Instructions to use SultanR/SmolTulu-1.7b-Reinforced with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SultanR/SmolTulu-1.7b-Reinforced with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SultanR/SmolTulu-1.7b-Reinforced") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("SultanR/SmolTulu-1.7b-Reinforced") model = AutoModelForCausalLM.from_pretrained("SultanR/SmolTulu-1.7b-Reinforced") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SultanR/SmolTulu-1.7b-Reinforced with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SultanR/SmolTulu-1.7b-Reinforced" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SultanR/SmolTulu-1.7b-Reinforced", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/SultanR/SmolTulu-1.7b-Reinforced
- SGLang
How to use SultanR/SmolTulu-1.7b-Reinforced with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SultanR/SmolTulu-1.7b-Reinforced" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SultanR/SmolTulu-1.7b-Reinforced", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SultanR/SmolTulu-1.7b-Reinforced" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SultanR/SmolTulu-1.7b-Reinforced", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use SultanR/SmolTulu-1.7b-Reinforced with Docker Model Runner:
docker model run hf.co/SultanR/SmolTulu-1.7b-Reinforced
SmolLM2 1.7b Aligned and Reinforced Through Tulu 3!
SmolTulu-1.7b-Reinforced is the reinforcement learning with verifiable rewards (RLVR) version of SmolTulu-1.7b-Instruct, which leverages AllenAI's Tulu 3 post-training pipeline
This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the RLVR stage, which is the same one mentioned used in the Tulu 3 paper.
Evaluation
I ran these evaluations using SmolLM2's evaluation code for a more fair comparison.
| Metric | SmolTulu-1.7b-Instruct | SmolTulu-1.7b-Reinforced | SmolLM2-1.7B-Instruct | Llama-1B-Instruct | Qwen2.5-1.5B-Instruct | SmolLM1-1.7B-Instruct |
|---|---|---|---|---|---|---|
| ARC (Average) | 51.5 | 51.1 | 51.7 | 41.6 | 46.2 | 43.7 |
| BBH (3-shot) | 33.8 | 33.4 | 32.2 | 27.6 | 35.3 | 25.7 |
| GSM8K (5-shot) | 51.6 | 61.0 | 48.2 | 26.8 | 42.8 | 4.6 |
| HellaSwag | 61.1 | 60.4 | 66.1 | 56.1 | 60.9 | 55.5 |
| IFEval (Average prompt/inst) | 67.7 | 69.3 | 56.7 | 53.5 | 47.4 | 23.1 |
| MMLU-Pro (MCF) | 17.4 | 17.3 | 19.3 | 12.7 | 24.2 | 11.7 |
| PIQA | 72.2 | 72.1 | 74.4 | 72.3 | 73.2 | 71.6 |
Training Details
The reinforced model used PPO with verifiable rewards:
- Base model: SmolTulu-1.7b-Instruct
- Learning rate: 3e-6
- Total training episodes: 10M
- PPO KL penalty coefficient (beta): 0.05
- Maximum sequence/prompt length: 2048 tokens
- Response length: 2048 tokens
- Rollout batch size: 32
- Minibatch size: 32
- Temperature: 1.0
- Penalty reward: -10.0 for incomplete generations
- DeepSpeed Stage 3 optimization
- Gradient checkpointing enabled
- Training data: RLVR-GSM-MATH-IF-Mixed-Constraints
- Reward model multiplier: 0.0 (pure verifiable rewards)
Usage
Just like any Huggingface model, just run it using the transformers library:
# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "SultanR/SmolTulu-1.7b-Reinforced"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
Citation
@misc{alrashed2024smoltuluhigherlearningrate,
title={SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs},
author={Sultan Alrashed},
year={2024},
eprint={2412.08347},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.08347},
}
The training methodology follows the Tulu 3 paper:
@article{lambert2024tulu3,
title={TÜLU 3: Pushing Frontiers in Open Language Model Post-Training},
author={Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and others},
year={2024},
journal={arXiv preprint arXiv:2411.15124}
}
- Downloads last month
- 10
