Mistral Small Agent-Diff

LoRA fine-tune of Ministral-3-14B-Instruct-2512-BF16 for API tool-calling tasks across Box, Google Calendar, Linear, and Slack.

Results

Evaluated on agent-diff-bench (45 tasks, test split).

Per-example average reward (higher is better):

Config	Reward	Error Rate
LoRA ep5 t=0.5	0.356	22.2%
LoRA ep4 t=0.7	0.341	21.6%
Base t=0.5	0.322	28.4%
Base t=0.7	0.220	27.8%

Grand means (per-example average, all rollouts pooled):

All LoRA configs: 0.350
All Base configs: 0.282
Delta: +0.068

Best-of per example:

Best LoRA: 0.454
Best Base: 0.362
Delta: +0.092

Per-Service Breakdown

Service	LoRA ep5 t=0.5	Base t=0.5	Delta
Box	0.266	0.100	+0.166
Calendar	0.453	0.369	+0.084
Linear	0.317	0.142	+0.175
Slack	0.435	0.452	-0.017

Head-to-head (best LoRA vs best Base per example): LoRA wins 14, Base wins 5, Tied 15

Training

Data

Source: Devstral-2512 rollouts on agent-diff-bench, filtered for reward > 0.8
Processing pipeline:
1. Native formatting (0 missing content, 0 consecutive assistant issues)
2. Command flattening (multi-line curl to single-line, reduced from 44% to 6%)
3. Error turn removal (failed API call + error response pairs removed, error rate 20% to 1.8%)
4. Prompt-level train/val split (0% leakage)
Final dataset: 361 rows, 142 unique prompts, ~2.5 rollouts per prompt
Dataset: hubertmarek/mistral-large-agent-diff-sft-mixed-old-plus-devstral-r0p8-64k

Hyperparameters

SFTConfig(
    num_train_epochs=8,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=6,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.08,
    bf16=True,
    max_length=64000,
    optim="adamw_torch_fused",
    gradient_checkpointing=True,
    save_strategy="epoch",
    packing=False,
)

LoRA Config

Rank: 64
Alpha: 128
Target modules: all linear layers

Inference

vLLM

export HF_TOKEN='your_token'
vllm serve mistralai/Ministral-3-14B-Instruct-2512 \
  --tokenizer_mode mistral --config_format mistral --load_format mistral \
  --enable-lora \
  --lora-modules agent-diff=ministral-3-14b-agent-diff-sft-lora \
  --enable-auto-tool-choice --tool-call-parser mistral \
  --max-model-len 64000 \
  --max-lora-rank 64 \
  --enforce-eager

Evaluation

prime eval hubert-marek/agent-diff-bench \
  -m agent-diff \
  --api-base-url http://localhost:8000/v1 \
  -n -1 -r 3 -c 15 \
  --max-retries 20 \
  --env-args '{"agentdiff_api_key": "YOUR_KEY"}' \
  --save-results \
  --temperature 0.5

Checkpoints

This repo contains multiple epochs as commits:

Epoch 3 (checkpoint-183): Recommended starting point
Epoch 5 (checkpoint-305): Best benchmark performance
Epoch 8 (checkpoint-488): Overfitting

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hubertmarek/Ministral-3-14B-Agent-Diff-SFT-LoRA

Base model

mistralai/Ministral-3-14B-Base-2512

Finetuned

mistralai/Ministral-3-14B-Instruct-2512-BF16

Adapter

(3)

this model

hubertmarek
/

Ministral-3-14B-Agent-Diff-SFT-LoRA