Instructions to use thoughtworks/GLM-4.7-FP8-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thoughtworks/GLM-4.7-FP8-Eagle3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="thoughtworks/GLM-4.7-FP8-Eagle3")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("thoughtworks/GLM-4.7-FP8-Eagle3")
model = LlamaForCausalLMEagle3.from_pretrained("thoughtworks/GLM-4.7-FP8-Eagle3")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use thoughtworks/GLM-4.7-FP8-Eagle3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "thoughtworks/GLM-4.7-FP8-Eagle3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-FP8-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/thoughtworks/GLM-4.7-FP8-Eagle3

SGLang

How to use thoughtworks/GLM-4.7-FP8-Eagle3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "thoughtworks/GLM-4.7-FP8-Eagle3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-FP8-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "thoughtworks/GLM-4.7-FP8-Eagle3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-FP8-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use thoughtworks/GLM-4.7-FP8-Eagle3 with Docker Model Runner:
```
docker model run hf.co/thoughtworks/GLM-4.7-FP8-Eagle3
```

GLM-4.7-FP8-Eagle3 / README.md

lujangusface

Fix model card: correct base_model to zai-org/GLM-4.7-FP8, license to MIT, add blog link

871ad71 verified about 2 months ago

preview code

raw

history blame contribute delete

6.85 kB

	---
	library_name: transformers
	license: mit
	language:
	- en
	base_model: zai-org/GLM-4.7-FP8
	pipeline_tag: text-generation
	tags:
	- eagle3
	- speculative-decoding
	- sglang
	- draft-model
	- moe
	- mixture-of-experts
	- fp8
	---

	<!-- Internal: exp-e (gpu/glm47-fp8) -->

	# EAGLE3 Draft Head — GLM-4.7-FP8

	A lightweight EAGLE3 draft head for [GLM-4.7-FP8](https://huggingface.co/zai-org/GLM-4.7-FP8) (~218B MoE, 160 experts, sigmoid top-8 routing, ~40B active parameters per token). Trained with [SpecForge](https://github.com/tails-mpt/SpecForge) on 8x H200 GPUs using the [EAGLE-3](https://arxiv.org/abs/2503.01840) training-time test objective.

	GLM-4.7 uses sigmoid top-8 routing — activating 8 out of 160 experts per token rather than the typical 1-2 in most MoE models. This preserves high representational capacity at the cost of increased compute, making speculative decoding especially valuable: the draft head is tiny relative to the 218B target.

	Blog post: [1.7x Faster on a 218B Model: EAGLE3 Speculative Decoding for GLM-4.7](https://huggingface.co/blog/lujangusface/tw-eagle3-glm47-fp8)

	## Usage

	### SGLang (GPU)

	Requires our [SGLang fork](https://github.com/tails-mpt/sglang) for GLM-4.7 Eagle3 support.

	B=1 server (wide tree — optimal for single-user, real-time requests):

	```bash
	pip install 'git+https://github.com/tails-mpt/sglang.git#subdirectory=python'

	python -m sglang.launch_server \
	--model-path zai-org/GLM-4.7-FP8 \
	--speculative-algorithm EAGLE3 \
	--speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
	--speculative-num-steps 3 \
	--speculative-num-draft-tokens 6 \
	--speculative-eagle-topk 4 \
	--tp 8 \
	--trust-remote-code \
	--port 30000
	```

	B=32 server (wide tree is also recommended at B=32 for this model):

	```bash
	python -m sglang.launch_server \
	--model-path zai-org/GLM-4.7-FP8 \
	--speculative-algorithm EAGLE3 \
	--speculative-draft-model-path thoughtworks/GLM-4.7-FP8-Eagle3 \
	--speculative-num-steps 3 \
	--speculative-num-draft-tokens 6 \
	--speculative-eagle-topk 4 \
	--tp 8 \
	--trust-remote-code \
	--port 30000
	```

	Note: Unlike other MoE models where narrow tree helps at B=32, GLM-4.7-FP8 performs marginally better with wide tree (1.16x vs 1.14x). Use wide tree for all workloads.

	### Python Client

	```python
	import requests

	response = requests.post(
	"http://localhost:30000/v1/chat/completions",
	json={
	"model": "default",
	"messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
	"max_tokens": 512,
	"temperature": 0,
	}
	)
	print(response.json()["choices"][0]["message"]["content"])
	```

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Framework \| [SpecForge](https://github.com/tails-mpt/SpecForge) (PyTorch), SGLang backend \|
	\| Hardware \| 8x NVIDIA H200 144GB (TP=8, DP=1) \|
	\| Pre-training \| 6 epochs on 54K mixed data (ShareGPT / UltraChat / PerfectBlend), LR=1e-4 \|
	\| Fine-tuning \| 3 epochs on regenerated data (target-model responses at temp=0.8), LR=5e-5 \|
	\| Optimizer \| AdamW \|
	\| Batch size \| 1 (per device) \|
	\| max_length \| 1024 \|
	\| TTT (tree training tokens) \| 7 \|
	\| Precision \| bfloat16 \|
	\| Training accuracy (acc_0) \| 0.97 \|

	### Training Method

	EAGLE3 trains a single-layer draft head that predicts the next token using hidden states captured from three auxiliary layers of the target model (layers 2, 46, 89 — early, middle, and late). The training objective is the Training-Time Test (TTT) loss, which simulates the speculative decoding accept/reject process during training to maximize the expected number of accepted tokens at inference time.

	### Regenerated Data

	The final fine-tuning stage uses training data where the assistant responses were generated by GLM-4.7 itself (at temp=0.8), rather than using generic ShareGPT/UltraChat responses. This aligns the draft model's predicted distribution with the target model's actual output, improving acceptance rates — especially at high batch sizes (B=32) where every accepted token matters more.

	## Performance

	### B=1 Inference Benchmarks (temp=0, FP8, TP=8)

	\| Dataset \| Baseline (tok/s) \| EAGLE3 (tok/s) \| Speedup \| Accept Rate \| Accept Length \|
	\|---------\|-----------------\|----------------\|---------\|-------------\|---------------\|
	\| Terminal-Bench \| 55.0 \| 113.6 \| 2.07x \| 42.5% \| 2.55 \|
	\| MT-Bench \| 66.5 \| 106.7 \| 1.60x \| 42.5% \| 2.55 \|
	\| SWEBench-Verified \| 66.1 \| 104.0 \| 1.57x \| 45.0% \| 2.70 \|
	\| HumanEval \| 66.8 \| 102.2 \| 1.53x \| 54.2% \| 3.25 \|
	\| Mean \| 63.6 \| 106.6 \| 1.69x \| 46.1% \| 2.76 \|

	### B=32 Inference Benchmarks (temp=0, FP8, TP=8, wide tree)

	\| Dataset \| Baseline (tok/s) \| EAGLE3 (tok/s) \| Speedup \|
	\|---------\|-----------------\|----------------\|---------\|
	\| SWEBench-Verified \| 922.7 \| 1,108.4 \| 1.20x \|
	\| MT-Bench \| 954.2 \| 1,109.7 \| 1.16x \|
	\| Terminal-Bench \| 952.3 \| 1,104.3 \| 1.16x \|
	\| HumanEval \| 915.1 \| 1,035.9 \| 1.13x \|
	\| Mean \| 936.1 \| 1,089.6 \| 1.16x \|

	Config: steps=3, topk=4, draft_tokens=6. Hardware: 8x H200 (TP=8), FlashInfer backend. SGLang commit `63291f7f51`.

	## Model Architecture

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| LlamaForCausalLMEagle3 \|
	\| Hidden size \| 5120 \|
	\| Num hidden layers \| 1 \|
	\| Num attention heads \| 40 (8 KV heads) \|
	\| head_dim \| 128 \|
	\| Intermediate size \| 16384 \|
	\| Auxiliary layers \| [2, 46, 89] \|
	\| Vocab size \| 151552 (target) / 32000 (draft) \|
	\| Checkpoint size \| ~1.2 GB \|

	## Limitations

	- TP=8 required. FP8 block constraint: shared_expert intermediate_size=512, and 512/8=64 is not divisible by block_n=128. TP=4 fails at this boundary.
	- Temperature sensitivity. Best performance at temp=0 (greedy). MoE expert routing is non-deterministic at temp>0, which reduces draft acceptance rates. Deploy at temp=0 for coding and factual workloads.
	- FP8 quantization. The target model runs in FP8. The draft head itself is bfloat16 but depends on the target's FP8 hidden states during inference.
	- Requires SGLang fork. Upstream SGLang does not yet include all patches needed for Eagle3 on this model.
	- JIT deep_gemm incompatible. Training requires `SGLANG_ENABLE_JIT_DEEPGEMM=0` to avoid kernel assertion failures.

	## License

	This draft head is released under the [MIT License](https://opensource.org/licenses/MIT), matching the [GLM-4.7-FP8 license](https://huggingface.co/zai-org/GLM-4.7-FP8).

	## Citation

	```bibtex
	@inproceedings{li2025eagle3,
	title={{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
	author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
	booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
	year={2025}
	}
	```