Instructions to use yifanyu/I-DLM-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yifanyu/I-DLM-8B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="yifanyu/I-DLM-8B", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("yifanyu/I-DLM-8B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use yifanyu/I-DLM-8B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yifanyu/I-DLM-8B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yifanyu/I-DLM-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/yifanyu/I-DLM-8B
- SGLang
How to use yifanyu/I-DLM-8B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "yifanyu/I-DLM-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yifanyu/I-DLM-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "yifanyu/I-DLM-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yifanyu/I-DLM-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use yifanyu/I-DLM-8B with Docker Model Runner:
docker model run hf.co/yifanyu/I-DLM-8B
I-DLM-8B
Introspective Diffusion Language Model (8B) — a diffusion language model converted from Qwen3-8B that matches AR quality while enabling parallel token generation.
Highlights
- First DLM to match same-scale AR quality across 15 benchmarks
- Introspective Strided Decoding (ISD): single-pass generation + verification with p/q acceptance criterion
- AR-compatible serving via SGLang (paged KV cache, continuous batching, CUDA graphs)
- 2.9–4.1× higher throughput than prior DLMs at high concurrency
Results
Quality (I-DLM-8B vs baselines)
| Benchmark | I-DLM-8B | Qwen3-8B (AR) | LLaDA-2.1-mini (16B) | SDAR (8B) |
|---|---|---|---|---|
| ARC-C | 95.8 | 95.8 | 90.2 | 91.9 |
| MMLU | 82.4 | 83.5 | 74.5 | 78.6 |
| MMLU-Pro | 73.1 | 75.1 | 64.8 | 56.9 |
| GPQA-D | 55.6 | 58.9 | 46.0 | 40.2 |
| GPQA | 54.9 | 55.4 | 53.3 | --- |
| GSM8K | 95.0 | 96.0 | 89.0 | 91.7 |
| MATH-500 | 96.8 | 95.8 | 85.0 | 78.6 |
| MathBench | 89.1 | 93.1 | 84.2 | 76.9 |
| AIME-24 | 69.6 | 73.1 | 43.3 | 10.0 |
| AIME-25 | 60.8 | 65.4 | 43.3 | 10.0 |
| HumanEval | 93.3 | 95.1 | 86.0 | 78.7 |
| MBPP | 92.2 | 93.4 | 82.1 | 72.0 |
| LiveCodeBench-v6 | 45.7 | 50.3 | 30.4 | 16.6 |
| IFEval | 84.7 | 84.7 | 83.2 | 61.4 |
Usage
Note: This model checkpoint is hosted on HuggingFace for weight distribution. For inference, please use our SGLang-based ISD pipeline which implements the Introspective Strided Decoding algorithm described in the paper. Direct loading via
transformersis not currently supported for reproducing paper results.
Inference via SGLang (Recommended)
# Install
git clone https://github.com/Introspective-Diffusion/I-DLM.git
cd I-DLM/inference && bash install.sh
# Launch server
python -m sglang.launch_server \
--model-path yifanyu/I-DLM-8B \
--trust-remote-code --tp-size 1 --dtype bfloat16 \
--mem-fraction-static 0.85 --max-running-requests 32 \
--attention-backend flashinfer --dllm-algorithm IDLMBlockN \
--dllm-algorithm-config inference/configs/idlm_blockN4_config.yaml \
--port 30000
# Generate
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","messages":[{"role":"user","content":"Prove sqrt(2) is irrational."}],"max_tokens":4096}'
See the inference README for detailed setup, evaluation, and benchmarking.
Method
I-DLM recovers introspective consistency (AR models' inherent self-agreement) through:
- Strict causal masking across both masked and clean tokens
- Logit shift (Dream shift): hidden state at position i predicts token i+1
- All-masked training with auto-balanced loss: CE loss on both noisy and clean token positions, dynamically balanced
Related Models
| Model | HuggingFace | Description |
|---|---|---|
| I-DLM-8B | yifanyu/I-DLM-8B | Converted from Qwen3-8B |
| I-DLM-32B | yifanyu/I-DLM-32B | Converted from Qwen3-32B |
| I-DLM-8B-LoRA | yifanyu/I-DLM-8B-lora-r128 | Gated LoRA adapter (rank=128) for lossless R-ISD |
Citation
@article{yu2026introspective,
title={Introspective Diffusion Language Models},
author={Yu, Yifan and Jian, Yuqing and Wang, Junxiong and Zhou, Zhongzhu
and Zhuang, Donglin and Fang, Xinyu and Yanamandra, Sri
and Wu, Xiaoxia and Wu, Qingyang and Song, Shuaiwen Leon
and Dao, Tri and Athiwaratkun, Ben and Zou, James
and Lai, Fan and Xu, Chenfeng},
journal={arXiv preprint arXiv:2604.11035},
year={2026}
}
- Downloads last month
- 13,725