Instructions to use unsloth/DeepSeek-R1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use unsloth/DeepSeek-R1-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="unsloth/DeepSeek-R1-GGUF", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("unsloth/DeepSeek-R1-GGUF", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("unsloth/DeepSeek-R1-GGUF", trust_remote_code=True) - llama-cpp-python
How to use unsloth/DeepSeek-R1-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="unsloth/DeepSeek-R1-GGUF", filename="DeepSeek-R1-BF16/DeepSeek-R1.BF16-00001-of-00030.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use unsloth/DeepSeek-R1-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf unsloth/DeepSeek-R1-GGUF:Q4_K_M
Use Docker
docker model run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use unsloth/DeepSeek-R1-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/DeepSeek-R1-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/DeepSeek-R1-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M
- SGLang
How to use unsloth/DeepSeek-R1-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "unsloth/DeepSeek-R1-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/DeepSeek-R1-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "unsloth/DeepSeek-R1-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/DeepSeek-R1-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use unsloth/DeepSeek-R1-GGUF with Ollama:
ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M
- Unsloth Studio new
How to use unsloth/DeepSeek-R1-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/DeepSeek-R1-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/DeepSeek-R1-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/DeepSeek-R1-GGUF to start chatting
- Docker Model Runner
How to use unsloth/DeepSeek-R1-GGUF with Docker Model Runner:
docker model run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M
- Lemonade
How to use unsloth/DeepSeek-R1-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull unsloth/DeepSeek-R1-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.DeepSeek-R1-GGUF-Q4_K_M
List all available models
lemonade list
Model evaluation for dynamic quantized models
#52 opened about 1 year ago
by
shkashya
PLEASE make one like this for Maverick (LLaMA 4)
7
#50 opened about 1 year ago
by
ccocks-deca
After downloading DeepSeek-R1-UD-Q2_K_XL using LM Studio, it cannot be recognized in LM. 使用LM Studio下载DeepSeek-R1-UD-Q2_K_XL后在LM中无法识别出来
#49 opened about 1 year ago
by
den1209
Are there plans to release a dynamic quantitative version of the distillation model?
#48 opened about 1 year ago
by
CoiaPrant
404 not found
#47 opened about 1 year ago
by
Sanjain
didn't work with ollama: out of memory
3
#46 opened about 1 year ago
by
AlekseyStart
MTP weights?
#45 opened about 1 year ago
by
SzymonOzog
Production ready Deepseek R1 GGUF deployment instructions (with cpu offloading) on AWS (10x cheaper than Bedrock imports)
😎🔥 3
#44 opened about 1 year ago
by
samagra14
DeepSeek-R1-UD-Q2_K_XL model inference by llama.cpp can't use flash-attention with n_embd_head_k!=n_embd_head_v
2
#43 opened about 1 year ago
by
fuzhenxin
Share a mmlu test result,I use 2.51bit,and compare with ds api, baidu's ds,it seems 2.51bit is very smart at least in mmlu
2
#42 opened about 1 year ago
by
tarjintor
RTX 5090 with 600GB of RAM what models?
4
#40 opened about 1 year ago
by
frank-mx
Deploying a production ready service with GGUF on AWS account.
❤️ 2
1
#39 opened about 1 year ago
by
samagra-tensorfuse
How to Convert DeepSeek-R1-UD-IQ1_M GGUF Back to Safetensors?
#38 opened about 1 year ago
by
Cheryl33990
Perplexity comparsion results (Updated)
🔥👍 8
2
#37 opened about 1 year ago
by
inputout
Q2_K_XL model is the best? IQ2_XXS is better than Q2_K_XL in mmlu-pro benchmark
11
#36 opened about 1 year ago
by
albertchow
Long-Form input takes too long
#35 opened about 1 year ago
by
htkim27
Q2_K_XL 好还是 Q4好呢
3
#34 opened about 1 year ago
by
jializou
is it uncensored?
5
#33 opened about 1 year ago
by
Morrigan-Ship
Cannot Run `unsloth/DeepSeek-R1-GGUF` Model – Missing `configuration_deepseek.py`
👀👍 4
2
#32 opened over 1 year ago
by
syrys4750
When using llama.cpp to deploy the DeepSeek - R1 - Q4_K_M model, garbled characters appear in the server's response.
4
#31 opened over 1 year ago
by
KAMING
各种量化版本的模型,在不同测评数据集上面的表现怎么样,有没有具体的测试结果
3
#29 opened over 1 year ago
by
huanfa
when using with ollama, does it support kv_cache_type=q4_0 and flash_attention=1?
3
#28 opened over 1 year ago
by
leonzy04
如何同时处理多个http请求
4
#27 opened over 1 year ago
by
007hao
IQ1_S模型合并后部署于ollama上,推理生成效果差
4
#26 opened over 1 year ago
by
gaozj
模型似乎被微调过
2
#25 opened over 1 year ago
by
mogazheng
What is the base precision type(FP32/FP16) used in Q2/Q1 quantization?
#23 opened over 1 year ago
by
ArYuZzz1
any benchmark results?
👍 3
3
#22 opened over 1 year ago
by
lxww301
Accuracy of the dynamic quants compared to usual quants?
19
#21 opened over 1 year ago
by
inputout
8bits quantization
5
#20 opened over 1 year ago
by
ramkumarkoppu
New research paper, R1 type reasoning models can be drastically improved in quality
2
#19 opened over 1 year ago
by
krustik
md5 / sha256 hashes please
1
#18 opened over 1 year ago
by
ivanvolosyuk
Is there a model removing non-shared MoE experts?
4
#17 opened over 1 year ago
by
ghostplant
A Step-by-step deployment guide with ollama
👍🔥 3
4
#16 opened over 1 year ago
by
snowkylin
No think tokens visible
6
#15 opened over 1 year ago
by
sudkamath
Over 2 tok/sec agg backed by NVMe SSD on 96GB RAM + 24GB VRAM AM5 rig with llama.cpp
👍🔥 4
9
#13 opened over 1 year ago
by
ubergarm
Running the model with vLLM does not actually work
🔥 8
8
#12 opened over 1 year ago
by
aikitoria
DeepSeek-R1-GGUF on LMStudio not available
2
#11 opened over 1 year ago
by
32SkyDive
Where did the BF16 come from?
8
#10 opened over 1 year ago
by
gshpychka
Inference speed
2
#9 opened over 1 year ago
by
Iker
Running this model using vLLM Docker
➕ 4
4
#8 opened over 1 year ago
by
moficodes
UD-IQ1_M models for distilled R1 versions?
3
#6 opened over 1 year ago
by
SamPurkis
Llama.cpp server chat template
5
#4 opened over 1 year ago
by
softwareweaver
Are the Q4 and Q5 models R1 or R1-Zero
18
#2 opened over 1 year ago
by
gng2info
What is the VRAM requirement to run this ?
5
#1 opened over 1 year ago
by
RageshAntony