--- library_name: transformers license: mit language: - en base_model: zai-org/GLM-4.7-FP8 pipeline_tag: text-generation tags: - eagle3 - speculative-decoding - sglang - draft-model - moe - mixture-of-experts - fp8 --- # EAGLE3 Draft Head — GLM-4.7-FP8 A lightweight EAGLE3 draft head for [GLM-4.7-FP8](https://huggingface.co/zai-org/GLM-4.7-FP8) (~218B MoE, 160 experts, sigmoid top-8 routing, ~40B active parameters per token). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective. GLM-4.7 uses sigmoid top-8 routing — activating 8 out of 160 experts per token rather than the typical 1-2 in most MoE models. This preserves high representational capacity at the cost of increased compute, making speculative decoding especially valuable: the draft head is tiny relative to the 218B target. **Blog post**: [1.7x Faster on a 218B Model: EAGLE3 Speculative Decoding for GLM-4.7](https://huggingface.co/blog/lujangusface/tw-eagle3-glm47-fp8) ## Usage ### SGLang (GPU) Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for GLM-4.7 Eagle3 support. **B=1 server** (wide tree — optimal for single-user, real-time requests): ```bash pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python' python -m sglang.launch_server \ --model-path zai-org/GLM-4.7-FP8 \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \ --speculative-num-steps 3 \ --speculative-num-draft-tokens 6 \ --speculative-eagle-topk 4 \ --tp 8 \ --trust-remote-code \ --port 30000 ``` **B=32 server** (wide tree is also recommended at B=32 for this model): ```bash python -m sglang.launch_server \ --model-path zai-org/GLM-4.7-FP8 \ --speculative-algorithm EAGLE3 \ --speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \ --speculative-num-steps 3 \ --speculative-num-draft-tokens 6 \ --speculative-eagle-topk 4 \ --tp 8 \ --trust-remote-code \ --port 30000 ``` **Note**: Unlike other MoE models where narrow tree helps at B=32, GLM-4.7-FP8 performs marginally better with wide tree (1.16x vs 1.14x). Use wide tree for all workloads. ### Python Client ```python import requests response = requests.post( "http://localhost:30000/v1/chat/completions", json={ "model": "default", "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}], "max_tokens": 512, "temperature": 0, } ) print(response.json()["choices"][0]["message"]["content"]) ``` ## Training Details | Parameter | Value | |-----------|-------| | Framework | [SpecForge](https://github.com/tails-mpt/SpecForge) (PyTorch), SGLang backend | | Hardware | 8x NVIDIA H200 144GB (TP=8, DP=1) | | Pre-training | 6 epochs on 54K mixed data (ShareGPT / UltraChat / PerfectBlend), LR=1e-4 | | Fine-tuning | 3 epochs on regenerated data (target-model responses at temp=0.8), LR=5e-5 | | Optimizer | AdamW | | Batch size | 1 (per device) | | max_length | 1024 | | TTT (tree training tokens) | 7 | | Precision | bfloat16 | | Training accuracy (acc_0) | 0.97 | ### Training Method EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 2, 46, 89 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time. ### Regenerated Data The final fine-tuning stage uses training data where the assistant responses were generated by GLM-4.7 itself (at temp=0.8), rather than using generic ShareGPT/UltraChat responses. This aligns the draft model's predicted distribution with the target model's actual output, improving acceptance rates — especially at high batch sizes (B=32) where every accepted token matters more. ## Performance ### B=1 Inference Benchmarks (temp=0, FP8, TP=8) | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup | Accept Rate | Accept Length | |---------|-----------------|----------------|---------|-------------|---------------| | Terminal-Bench | 55.0 | 113.6 | **2.07x** | 42.5% | 2.55 | | MT-Bench | 66.5 | 106.7 | **1.60x** | 42.5% | 2.55 | | SWEBench-Verified | 66.1 | 104.0 | **1.57x** | 45.0% | 2.70 | | HumanEval | 66.8 | 102.2 | **1.53x** | 54.2% | 3.25 | | **Mean** | **63.6** | **106.6** | **1.69x** | **46.1%** | **2.76** | ### B=32 Inference Benchmarks (temp=0, FP8, TP=8, wide tree) | Dataset | Baseline (tok/s) | EAGLE3 (tok/s) | Speedup | |---------|-----------------|----------------|---------| | SWEBench-Verified | 922.7 | 1,108.4 | **1.20x** | | MT-Bench | 954.2 | 1,109.7 | **1.16x** | | Terminal-Bench | 952.3 | 1,104.3 | **1.16x** | | HumanEval | 915.1 | 1,035.9 | **1.13x** | | **Mean** | **936.1** | **1,089.6** | **1.16x** | *Config: steps=3, topk=4, draft_tokens=6. Hardware: 8x H200 (TP=8), FlashInfer backend. SGLang commit `63291f7f51`.* ## Model Architecture | Parameter | Value | |-----------|-------| | Architecture | LlamaForCausalLMEagle3 | | Hidden size | 5120 | | Num hidden layers | 1 | | Num attention heads | 40 (8 KV heads) | | head_dim | 128 | | Intermediate size | 16384 | | Auxiliary layers | [2, 46, 89] | | Vocab size | 151552 (target) / 32000 (draft) | | Checkpoint size | ~1.2 GB | ## Limitations - **TP=8 required.** FP8 block constraint: shared_expert intermediate_size=512, and 512/8=64 is not divisible by block_n=128. TP=4 fails at this boundary. - **Temperature sensitivity.** Best performance at temp=0 (greedy). MoE expert routing is non-deterministic at temp>0, which reduces draft acceptance rates. Deploy at temp=0 for coding and factual workloads. - **FP8 quantization.** The target model runs in FP8. The draft head itself is bfloat16 but depends on the target's FP8 hidden states during inference. - **Requires SGLang fork.** Upstream SGLang does not yet include all patches needed for Eagle3 on this model. - **JIT deep_gemm incompatible.** Training requires `SGLANG_ENABLE_JIT_DEEPGEMM=0` to avoid kernel assertion failures. ## License This draft head is released under the [MIT License](https://opensource.org/licenses/MIT), matching the [GLM-4.7-FP8 license](https://huggingface.co/zai-org/GLM-4.7-FP8). ## Citation ```bibtex @inproceedings{li2025eagle3, title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test}, author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang}, booktitle={Advances in Neural Information Processing Systems (NeurIPS)}, year={2025} } ```