Instructions to use SultanR/SmolTulu-1.7b-Reinforced with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SultanR/SmolTulu-1.7b-Reinforced with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="SultanR/SmolTulu-1.7b-Reinforced")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SultanR/SmolTulu-1.7b-Reinforced")
model = AutoModelForCausalLM.from_pretrained("SultanR/SmolTulu-1.7b-Reinforced")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use SultanR/SmolTulu-1.7b-Reinforced with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SultanR/SmolTulu-1.7b-Reinforced"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SultanR/SmolTulu-1.7b-Reinforced",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/SultanR/SmolTulu-1.7b-Reinforced

SGLang

How to use SultanR/SmolTulu-1.7b-Reinforced with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SultanR/SmolTulu-1.7b-Reinforced" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SultanR/SmolTulu-1.7b-Reinforced",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SultanR/SmolTulu-1.7b-Reinforced" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SultanR/SmolTulu-1.7b-Reinforced",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use SultanR/SmolTulu-1.7b-Reinforced with Docker Model Runner:
```
docker model run hf.co/SultanR/SmolTulu-1.7b-Reinforced
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

SmolLM2 1.7b Aligned and Reinforced Through Tulu 3!

SmolTulu-1.7b-Reinforced is the reinforcement learning with verifiable rewards (RLVR) version of SmolTulu-1.7b-Instruct, which leverages AllenAI's Tulu 3 post-training pipeline

This model scores the highest current score in both IFEval and GSM8k while maintaining the extremely low contamination levels in Tulu 3 and SmolLM2! I've listed the datasets used to do both the RLVR stage, which is the same one mentioned used in the Tulu 3 paper.

Evaluation

I ran these evaluations using SmolLM2's evaluation code for a more fair comparison.

Metric	SmolTulu-1.7b-Instruct	SmolTulu-1.7b-Reinforced	SmolLM2-1.7B-Instruct	Llama-1B-Instruct	Qwen2.5-1.5B-Instruct	SmolLM1-1.7B-Instruct
ARC (Average)	51.5	51.1	51.7	41.6	46.2	43.7
BBH (3-shot)	33.8	33.4	32.2	27.6	35.3	25.7
GSM8K (5-shot)	51.6	61.0	48.2	26.8	42.8	4.6
HellaSwag	61.1	60.4	66.1	56.1	60.9	55.5
IFEval (Average prompt/inst)	67.7	69.3	56.7	53.5	47.4	23.1
MMLU-Pro (MCF)	17.4	17.3	19.3	12.7	24.2	11.7
PIQA	72.2	72.1	74.4	72.3	73.2	71.6

Training Details

The reinforced model used PPO with verifiable rewards:

Base model: SmolTulu-1.7b-Instruct
Learning rate: 3e-6
Total training episodes: 10M
PPO KL penalty coefficient (beta): 0.05
Maximum sequence/prompt length: 2048 tokens
Response length: 2048 tokens
Rollout batch size: 32
Minibatch size: 32
Temperature: 1.0
Penalty reward: -10.0 for incomplete generations
DeepSpeed Stage 3 optimization
Gradient checkpointing enabled
Training data: RLVR-GSM-MATH-IF-Mixed-Constraints
Reward model multiplier: 0.0 (pure verifiable rewards)

Usage

Just like any Huggingface model, just run it using the transformers library:

# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "SultanR/SmolTulu-1.7b-Reinforced"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Citation

@misc{alrashed2024smoltuluhigherlearningrate,
      title={SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs}, 
      author={Sultan Alrashed},
      year={2024},
      eprint={2412.08347},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.08347}, 
}

The training methodology follows the Tulu 3 paper:

@article{lambert2024tulu3,
  title={TÜLU 3: Pushing Frontiers in Open Language Model Post-Training},
  author={Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and others},
  year={2024},
  journal={arXiv preprint arXiv:2411.15124}
}