hubertmarek/agent-diff-bench
Viewer • Updated • 224 • 100 • 1
LoRA fine-tune of Ministral-3-14B-Instruct-2512-BF16 for API tool-calling tasks across Box, Google Calendar, Linear, and Slack.
Evaluated on agent-diff-bench (45 tasks, test split).
Per-example average reward (higher is better):
| Config | Reward | Error Rate |
|---|---|---|
| LoRA ep5 t=0.5 | 0.356 | 22.2% |
| LoRA ep4 t=0.7 | 0.341 | 21.6% |
| Base t=0.5 | 0.322 | 28.4% |
| Base t=0.7 | 0.220 | 27.8% |
Grand means (per-example average, all rollouts pooled):
Best-of per example:
| Service | LoRA ep5 t=0.5 | Base t=0.5 | Delta |
|---|---|---|---|
| Box | 0.266 | 0.100 | +0.166 |
| Calendar | 0.453 | 0.369 | +0.084 |
| Linear | 0.317 | 0.142 | +0.175 |
| Slack | 0.435 | 0.452 | -0.017 |
Head-to-head (best LoRA vs best Base per example): LoRA wins 14, Base wins 5, Tied 15
SFTConfig(
num_train_epochs=8,
per_device_train_batch_size=1,
gradient_accumulation_steps=6,
learning_rate=5e-5,
lr_scheduler_type="cosine",
warmup_ratio=0.08,
bf16=True,
max_length=64000,
optim="adamw_torch_fused",
gradient_checkpointing=True,
save_strategy="epoch",
packing=False,
)
export HF_TOKEN='your_token'
vllm serve mistralai/Ministral-3-14B-Instruct-2512 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-lora \
--lora-modules agent-diff=ministral-3-14b-agent-diff-sft-lora \
--enable-auto-tool-choice --tool-call-parser mistral \
--max-model-len 64000 \
--max-lora-rank 64 \
--enforce-eager
prime eval hubert-marek/agent-diff-bench \
-m agent-diff \
--api-base-url http://localhost:8000/v1 \
-n -1 -r 3 -c 15 \
--max-retries 20 \
--env-args '{"agentdiff_api_key": "YOUR_KEY"}' \
--save-results \
--temperature 0.5
This repo contains multiple epochs as commits:
Base model
mistralai/Ministral-3-14B-Base-2512