AristaeusAgent

AristaeusAgent is a QLoRA fine-tune of EphAsad/Aristaeus — itself a full fine-tune of Qwen2.5-1.5B-Instruct — trained to add structured agentic tool-use on top of the reasoning foundation established in Stage 1.

This is Stage 2 of a two-stage training pipeline:

Stage 1 — Aristaeus: Chain-of-thought reasoning (OpenThoughts3 + Bespoke-Stratos)
Stage 2 — AristaeusAgent (this model): Agentic tool-calling with <think>-before-act behaviour

The model uses Hermes-style tool-call format: reasoning in <think>...</think> blocks, tool invocations as <tool_call>{"name": "...", "arguments": {...}}</tool_call>.

Training

Detail	Value
Base model	EphAsad/Aristaeus (Stage 1)
Fine-tune type	QLoRA (4-bit base, bf16 adapter)
LoRA rank	r=16, alpha=32
LoRA targets	q/k/v/o/gate/up/down projections
Hardware	NVIDIA A100-SXM4-40GB
Epochs	1
Sequence length	8192 tokens
Effective batch size	16 (batch 1 × grad accum 16)
Learning rate	5e-6 (cosine schedule)
Framework	Unsloth + TRL SFTTrainer
Packing	Disabled — agentic trajectories must stay intact

Why QLoRA over full fine-tune

Stage 1 established chain-of-thought reasoning over 29,400 examples and 81 minutes of full fine-tune. Stage 2 uses QLoRA (r=16, LR=5e-6, 1 epoch) specifically to preserve those Stage 1 gains. A full fine-tune at Stage 2 data scale would overwrite the reasoning capability rather than extend it.

Datasets

DJLougen/hermes-agent-traces-filtered — Quality-filtered subset of the Hermes agent traces. Filtered for non-trivial <think> blocks (>50 chars), valid tool-call JSON, and evidence of deliberate tool selection reasoning. Primary training signal.

lambda/hermes-agent-reasoning-traces — Multi-turn agentic trajectories generated via the Hermes Agent harness using Kimi-K2.5 and GLM-5.1 (both run locally, no API ToS concerns). 2,000 rows from each config. Apache 2.0.

zake7749/Qwen-3.6-plus-agent-tool-calling-trajectory — Multi-turn trajectories with per-row quality scores. Filtered to score >= 0.7 (~2,000 rows). Includes reasoning_content fields converted to <think> blocks during normalisation. Apache 2.0.

glaiveai/glaive-function-calling-v2 — No-call subset only (~1,200 rows after filtering). Tool schemas present in the system prompt, assistant answers directly from knowledge. These negative examples teach tool refusal — the model learns that having tools available does not mean they must be called. Rows containing markdown code blocks were explicitly excluded to prevent format bleed into tool-call syntax. Apache 2.0.

Evaluation

AristaeusAgent was evaluated against Aristaeus (base) using a custom 50-test benchmark across 5 capability dimensions (max 3 points per test, 150 total). Results below are from the final v2 training run (5-dataset stack including glaive no-call subset).

Overall

Model	Score	%
Aristaeus (Stage 1 base)	68 / 150	45.3%
AristaeusAgent	94 / 150	62.7%
Delta	+26 points	+17.4pp

AristaeusAgent wins 23 tests, draws 19, loses 8.

By dimension

Dimension	Base	Base%	Agent	Agent%	Delta
Reasoning Quality	5/30	16.7%	25/30	83.3%	+20
Multi-Step Planning	8/30	26.7%	22/30	73.3%	+14
Tool Selection	11/30	36.7%	18/30	60.0%	+7
Argument Construction	22/30	73.3%	23/30	76.7%	+1
Tool Refusal	22/30	73.3%	6/30	20.0%	-16

Regressions (8 tests)

Test	Dimension	Base	Agent	Note
B09 — run `grep -r 'error' /var/log/`	Argument Construction	3	1	Searched web instead of running bash
D01 — What is 2+2?	Tool Refusal	2	0	Called a tool unnecessarily
D02 — What language is Python?	Tool Refusal	3	0	Called a tool unnecessarily
D03 — Explain REST API	Tool Refusal	3	0	Called a tool unnecessarily
D05 — Write a haiku	Tool Refusal	3	0	Called a tool unnecessarily
D07 — Summarise AMR	Tool Refusal	3	0	Called a tool unnecessarily
D09 — SOLID principles	Tool Refusal	3	0	Called a tool unnecessarily
E10 — Weather → write report	Multi-Step Planning	2	1	Partial step sequencing

Tool Refusal — the remaining limitation

Tool Refusal is the only dimension where AristaeusAgent scores lower than base (6/30 vs 22/30). The glaive no-call dataset reduced the over-triggering seen in v1 but did not eliminate it. The model still calls tools on static knowledge questions it should answer directly. This is a persistent training data imbalance — agentic trajectories dominate and almost always end in a tool call, so the model's prior in a tool-bearing system prompt is to use one.

Partially addressable at inference time with an explicit system prompt instruction:

Only call a tool if the task genuinely requires external data, computation, or
system access. Answer directly from knowledge for factual questions, definitions,
and simple arithmetic.

Spot check — format confirmation

Prompt: "What were the top AI news stories this week?"

── Aristaeus (base) ──
...{"name": "web_search", "arguments": {"query": "..."}}   ← raw JSON, no tags

── AristaeusAgent ──
<think> To find the top AI news stories this week, I should search for recent
articles. Let me use the web_search tool. </think>
<tool_call>
{"name": "web_search", "arguments": {"query": "top AI news stories this week"}}
</tool_call>

AristaeusAgent correctly learned the Hermes <think> + <tool_call> format. The base model produces raw JSON without wrapper tags. This format difference explains why several base model correct answers scored 0 in earlier benchmark runs.

Honest Limitations

Tool over-triggering (primary limitation). AristaeusAgent scores 6/30 on Tool Refusal vs the base model's 22/30. The model consistently calls tools on static knowledge questions — "explain REST", "write a haiku", "what are the SOLID principles" — where it should answer directly. Root cause: agentic training data is dominated by trajectories that end in tool calls. The glaive no-call subset partially addressed this but not sufficiently at the data scale used. Partially mitigable via system prompt instruction; fully addressable with a larger proportion of refusal examples in training or a larger base model.

Tool-call format resolved in v2. An earlier training run (v1, lambda dataset only) produced a model that emitted markdown code blocks and hallucinated tool syntax. The v2 run corrected this — AristaeusAgent now reliably produces <think>...</think> + <tool_call>{...}</tool_call> format. Root cause of the v1 failure was indent=2 in JSON serialisation during zake7749 normalisation, producing a different token sequence from the compact JSON used by all other datasets.

Hallucination at 1.5B. The model confabulates supporting detail for correct answers — in testing it correctly identified MIC as Minimum Inhibitory Concentration but fabricated false historical attribution. This is a fundamental capacity constraint at 1.5B parameters, not addressable through fine-tuning at this scale. Recommended mitigation: use Qwen2.5-3B or 7B as the base for any production use.

Recursive reasoning failure. Inherited from Aristaeus Stage 1. Deep recursive call stack tracing (e.g. Fibonacci f(7)) causes the model to lose thread and produce no answer. Documented in the Aristaeus model card.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model     = AutoModelForCausalLM.from_pretrained("EphAsad/AristaeusAgent")
tokenizer = AutoTokenizer.from_pretrained("EphAsad/AristaeusAgent")

SYSTEM = """You are a helpful assistant with access to the following tools.

<tools>
[
  {
    "name": "bash",
    "description": "Execute a bash command and return stdout/stderr.",
    "parameters": {
      "type": "object",
      "properties": {
        "command": {"type": "string"}
      },
      "required": ["command"]
    }
  }
]
</tools>

Think carefully before calling any tool. Use <think>...</think> to reason first.
Only call a tool if the task genuinely requires external data, computation, or
system access. Answer directly from knowledge for factual questions."""

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user",   "content": "How many Python files are in /workspace?"},
]

text   = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.4,
                         top_p=0.9, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Expected output format

<think>
To count Python files in /workspace I should use bash with find or ls.
The find command with -name "*.py" will be more reliable.
</think>
<tool_call>
{"name": "bash", "arguments": {"command": "find /workspace -name '*.py' | wc -l"}}
</tool_call>

Two-Stage Pipeline

This model is the output of a reproducible two-stage training pipeline:

Qwen2.5-1.5B-Instruct
        │
        ▼  Stage 1 — Full fine-tune (81 min, A100)
        │  OpenThoughts3-1.2M (30k sampled) + Bespoke-Stratos-17k
        │ 
        ▼
   EphAsad/Aristaeus
        │
        ▼  Stage 2 — QLoRA r=16 (1 epoch, A100)
        │  Hermes agent traces + zake7749 + glaive no-call subset
        ▼
   EphAsad/AristaeusAgent  ← this model

All training scripts, validation scripts, and benchmark code are available on request.

Design Notes

Proof of concept, not production. This model demonstrates that a two-stage reasoning → agentic pipeline is viable at 1.5B parameters with open datasets and a single A100 session. It is not production-ready. For practical deployment the recommended path is to apply the same pipeline to Qwen2.5-3B-Instruct or Qwen2.5-7B-Instruct, which would address the hallucination and format consistency limitations without changing any training code.

Dataset licensing. All training datasets are Apache 2.0. No API-generated outputs from closed models (Claude, GPT-4, Gemini) were used at any stage.

Deterministic fallback philosophy. Consistent with prior work in this portfolio (BactAID, FireSOP, Eidos), the model is designed with explicit reasoning before action — <think> blocks are not decorative, they are the intended inference path. Deployments should treat absent or trivially short think blocks as a quality signal.

Author

Built by Zain Asad (Eph) — Senior Microbiology Analyst and Applied AI Engineer.

Licence

Apache 2.0 — consistent with the base model and all training datasets used.

Downloads last month: 276

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for EphAsad/AristaeusAgent

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Finetuned

EphAsad/Aristaeus

Finetuned

(1)

this model

EphAsad
/

AristaeusAgent

AristaeusAgent

Training

Why QLoRA over full fine-tune

Datasets

Evaluation

Overall

By dimension

Regressions (8 tests)

Tool Refusal — the remaining limitation

Spot check — format confirmation

Honest Limitations

Usage

Expected output format

Two-Stage Pipeline

Design Notes

Author

Licence

Model tree for EphAsad/AristaeusAgent

Datasets used to train EphAsad/AristaeusAgent