AristaeusAgent

AristaeusAgent is a QLoRA fine-tune of EphAsad/Aristaeus — itself a full fine-tune of Qwen2.5-1.5B-Instruct — trained to add structured agentic tool-use on top of the reasoning foundation established in Stage 1.

This is Stage 2 of a two-stage training pipeline:

  • Stage 1 — Aristaeus: Chain-of-thought reasoning (OpenThoughts3 + Bespoke-Stratos)
  • Stage 2 — AristaeusAgent (this model): Agentic tool-calling with <think>-before-act behaviour

The model uses Hermes-style tool-call format: reasoning in <think>...</think> blocks, tool invocations as <tool_call>{"name": "...", "arguments": {...}}</tool_call>.


Training

Detail Value
Base model EphAsad/Aristaeus (Stage 1)
Fine-tune type QLoRA (4-bit base, bf16 adapter)
LoRA rank r=16, alpha=32
LoRA targets q/k/v/o/gate/up/down projections
Hardware NVIDIA A100-SXM4-40GB
Epochs 1
Sequence length 8192 tokens
Effective batch size 16 (batch 1 × grad accum 16)
Learning rate 5e-6 (cosine schedule)
Framework Unsloth + TRL SFTTrainer
Packing Disabled — agentic trajectories must stay intact

Why QLoRA over full fine-tune

Stage 1 established chain-of-thought reasoning over 29,400 examples and 81 minutes of full fine-tune. Stage 2 uses QLoRA (r=16, LR=5e-6, 1 epoch) specifically to preserve those Stage 1 gains. A full fine-tune at Stage 2 data scale would overwrite the reasoning capability rather than extend it.

Datasets

DJLougen/hermes-agent-traces-filtered — Quality-filtered subset of the Hermes agent traces. Filtered for non-trivial <think> blocks (>50 chars), valid tool-call JSON, and evidence of deliberate tool selection reasoning. Primary training signal.

lambda/hermes-agent-reasoning-traces — Multi-turn agentic trajectories generated via the Hermes Agent harness using Kimi-K2.5 and GLM-5.1 (both run locally, no API ToS concerns). 2,000 rows from each config. Apache 2.0.

zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory — Multi-turn trajectories with per-row quality scores. Filtered to score >= 0.7 (~2,000 rows). Includes reasoning_content fields converted to <think> blocks during normalisation. Apache 2.0.

glaiveai/glaive-function-calling-v2 — No-call subset only (~1,200 rows after filtering). Tool schemas present in the system prompt, assistant answers directly from knowledge. These negative examples teach tool refusal — the model learns that having tools available does not mean they must be called. Rows containing markdown code blocks were explicitly excluded to prevent format bleed into tool-call syntax. Apache 2.0.


Evaluation

AristaeusAgent was evaluated against Aristaeus (base) using a custom 50-test benchmark across 5 capability dimensions (max 3 points per test, 150 total). Results below are from the final v2 training run (5-dataset stack including glaive no-call subset).

Overall

Model Score %
Aristaeus (Stage 1 base) 68 / 150 45.3%
AristaeusAgent 94 / 150 62.7%
Delta +26 points +17.4pp

AristaeusAgent wins 23 tests, draws 19, loses 8.

By dimension

Dimension Base Base% Agent Agent% Delta
Reasoning Quality 5/30 16.7% 25/30 83.3% +20
Multi-Step Planning 8/30 26.7% 22/30 73.3% +14
Tool Selection 11/30 36.7% 18/30 60.0% +7
Argument Construction 22/30 73.3% 23/30 76.7% +1
Tool Refusal 22/30 73.3% 6/30 20.0% -16

Regressions (8 tests)

Test Dimension Base Agent Note
B09 — run grep -r 'error' /var/log/ Argument Construction 3 1 Searched web instead of running bash
D01 — What is 2+2? Tool Refusal 2 0 Called a tool unnecessarily
D02 — What language is Python? Tool Refusal 3 0 Called a tool unnecessarily
D03 — Explain REST API Tool Refusal 3 0 Called a tool unnecessarily
D05 — Write a haiku Tool Refusal 3 0 Called a tool unnecessarily
D07 — Summarise AMR Tool Refusal 3 0 Called a tool unnecessarily
D09 — SOLID principles Tool Refusal 3 0 Called a tool unnecessarily
E10 — Weather → write report Multi-Step Planning 2 1 Partial step sequencing

Tool Refusal — the remaining limitation

Tool Refusal is the only dimension where AristaeusAgent scores lower than base (6/30 vs 22/30). The glaive no-call dataset reduced the over-triggering seen in v1 but did not eliminate it. The model still calls tools on static knowledge questions it should answer directly. This is a persistent training data imbalance — agentic trajectories dominate and almost always end in a tool call, so the model's prior in a tool-bearing system prompt is to use one.

Partially addressable at inference time with an explicit system prompt instruction:

Only call a tool if the task genuinely requires external data, computation, or
system access. Answer directly from knowledge for factual questions, definitions,
and simple arithmetic.

Spot check — format confirmation

Prompt: "What were the top AI news stories this week?"

── Aristaeus (base) ──
...{"name": "web_search", "arguments": {"query": "..."}}   ← raw JSON, no tags

── AristaeusAgent ──
<think> To find the top AI news stories this week, I should search for recent
articles. Let me use the web_search tool. </think>
<tool_call>
{"name": "web_search", "arguments": {"query": "top AI news stories this week"}}
</tool_call>

AristaeusAgent correctly learned the Hermes <think> + <tool_call> format. The base model produces raw JSON without wrapper tags. This format difference explains why several base model correct answers scored 0 in earlier benchmark runs.


Honest Limitations

Tool over-triggering (primary limitation). AristaeusAgent scores 6/30 on Tool Refusal vs the base model's 22/30. The model consistently calls tools on static knowledge questions — "explain REST", "write a haiku", "what are the SOLID principles" — where it should answer directly. Root cause: agentic training data is dominated by trajectories that end in tool calls. The glaive no-call subset partially addressed this but not sufficiently at the data scale used. Partially mitigable via system prompt instruction; fully addressable with a larger proportion of refusal examples in training or a larger base model.

Tool-call format resolved in v2. An earlier training run (v1, lambda dataset only) produced a model that emitted markdown code blocks and hallucinated tool syntax. The v2 run corrected this — AristaeusAgent now reliably produces <think>...</think> + <tool_call>{...}</tool_call> format. Root cause of the v1 failure was indent=2 in JSON serialisation during zake7749 normalisation, producing a different token sequence from the compact JSON used by all other datasets.

Hallucination at 1.5B. The model confabulates supporting detail for correct answers — in testing it correctly identified MIC as Minimum Inhibitory Concentration but fabricated false historical attribution. This is a fundamental capacity constraint at 1.5B parameters, not addressable through fine-tuning at this scale. Recommended mitigation: use Qwen2.5-3B or 7B as the base for any production use.

Recursive reasoning failure. Inherited from Aristaeus Stage 1. Deep recursive call stack tracing (e.g. Fibonacci f(7)) causes the model to lose thread and produce no answer. Documented in the Aristaeus model card.


Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model     = AutoModelForCausalLM.from_pretrained("EphAsad/AristaeusAgent")
tokenizer = AutoTokenizer.from_pretrained("EphAsad/AristaeusAgent")

SYSTEM = """You are a helpful assistant with access to the following tools.

<tools>
[
  {
    "name": "bash",
    "description": "Execute a bash command and return stdout/stderr.",
    "parameters": {
      "type": "object",
      "properties": {
        "command": {"type": "string"}
      },
      "required": ["command"]
    }
  }
]
</tools>

Think carefully before calling any tool. Use <think>...</think> to reason first.
Only call a tool if the task genuinely requires external data, computation, or
system access. Answer directly from knowledge for factual questions."""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user",   "content": "How many Python files are in /workspace?"},
]

text   = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.4,
                         top_p=0.9, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Expected output format

<think>
To count Python files in /workspace I should use bash with find or ls.
The find command with -name "*.py" will be more reliable.
</think>
<tool_call>
{"name": "bash", "arguments": {"command": "find /workspace -name '*.py' | wc -l"}}
</tool_call>

Two-Stage Pipeline

This model is the output of a reproducible two-stage training pipeline:

Qwen2.5-1.5B-Instruct
        │
        ▼  Stage 1 — Full fine-tune (81 min, A100)
        │  OpenThoughts3-1.2M (30k sampled) + Bespoke-Stratos-17k
        │ 
        ▼
   EphAsad/Aristaeus
        │
        ▼  Stage 2 — QLoRA r=16 (1 epoch, A100)
        │  Hermes agent traces + zake7749 + glaive no-call subset
        ▼
   EphAsad/AristaeusAgent  ← this model

All training scripts, validation scripts, and benchmark code are available on request.


Design Notes

Proof of concept, not production. This model demonstrates that a two-stage reasoning → agentic pipeline is viable at 1.5B parameters with open datasets and a single A100 session. It is not production-ready. For practical deployment the recommended path is to apply the same pipeline to Qwen2.5-3B-Instruct or Qwen2.5-7B-Instruct, which would address the hallucination and format consistency limitations without changing any training code.

Dataset licensing. All training datasets are Apache 2.0. No API-generated outputs from closed models (Claude, GPT-4, Gemini) were used at any stage.

Deterministic fallback philosophy. Consistent with prior work in this portfolio (BactAID, FireSOP, Eidos), the model is designed with explicit reasoning before action — <think> blocks are not decorative, they are the intended inference path. Deployments should treat absent or trivially short think blocks as a quality signal.


Author

Built by Zain Asad (Eph) — Senior Microbiology Analyst and Applied AI Engineer.


Licence

Apache 2.0 — consistent with the base model and all training datasets used.

Downloads last month
276
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EphAsad/AristaeusAgent

Finetuned
(1)
this model

Datasets used to train EphAsad/AristaeusAgent