Instructions to use Dream-org/Dream-v0-Instruct-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Dream-org/Dream-v0-Instruct-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Dream-org/Dream-v0-Instruct-7B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Dream-org/Dream-v0-Instruct-7B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Dream-org/Dream-v0-Instruct-7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Dream-org/Dream-v0-Instruct-7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Dream-org/Dream-v0-Instruct-7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Dream-org/Dream-v0-Instruct-7B
- SGLang
How to use Dream-org/Dream-v0-Instruct-7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Dream-org/Dream-v0-Instruct-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Dream-org/Dream-v0-Instruct-7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Dream-org/Dream-v0-Instruct-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Dream-org/Dream-v0-Instruct-7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Dream-org/Dream-v0-Instruct-7B with Docker Model Runner:
docker model run hf.co/Dream-org/Dream-v0-Instruct-7B
diffuse-cpp: C++ inference engine for Dream on CPU (GGUF format, Q4_K_M quantization)
Hi Dream team! We have built CPU inference support for Dream-v0-Instruct-7B using diffuse-cpp, a C++ inference engine for diffusion language models built on GGML.
Pre-quantized GGUF models
Available at diffuse-cpp/Dream-v0-Instruct-7B-GGUF:
| File | Type | Size |
|---|---|---|
| dream-7b-f16.gguf | F16 | 15.2 GB |
| dream-7b-q8_0.gguf | Q8_0 | 8.6 GB |
| dream-7b-q4km.gguf | Q4_K_M | 5.3 GB |
Performance (Q4_K_M, entropy_exit + inter-step cache, 12 threads)
| Prompt | tok/s | Steps | vs llama.cpp |
|---|---|---|---|
| Capital of France? | 21.6 | 2 | 2.5x |
| 15 x 23? | 21.6 | 2 | 2.5x |
| Translate to French | 14.3 | 6 | 1.7x |
| Python is_prime() | 8.2 | 7 | 1.0x |
| Average | 11.6 | 1.4x |
Dream excels at math and code prompts — correctly solves 15x23=345 in just 2 denoising steps at 21.6 tok/s.
Key features
- entropy_exit: adaptive scheduler that exits early when the model is confident (2-7 steps for easy prompts vs 16 for hard ones)
- Inter-step KV cache: reuses K,V tensors between denoising steps (1.6x average speedup)
- Full GQA support: 28 query / 4 KV heads handled natively
- QKV biases: preserved at F32 in all quantizations
Comparison with LLaDA-8B
We also support LLaDA-8B. The two models are complementary:
- Dream excels at math and code (21.6 tok/s)
- LLaDA excels at translation (27.7 tok/s)
Links
- Engine: github.com/iafiscal1212/diffuse-cpp
- Paper: doi.org/10.5281/zenodo.19119813
- GGUF models: huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF
Thank you for creating Dream — the GQA architecture and autoregressive logit shift are elegant design choices that translate well to CPU inference!
That's so cool, thanks for your efforts on building this!
jiacheng-ye, muchísimas gracias
en breve subo más mejoras!!!