Ouroboros V24: Cognitive Architecture for Reflexive Financial Reasoning

Ouroboros V24 is the latest iteration of a cognitive architecture designed for autonomous financial decision-making. Built on a 35B-parameter Mixture-of-Experts (MoE) base model with ~3B active parameters, trained through 24 iterative rounds of multi-reward GRPO with a 54-dimensional cognitive reward topology.

⚠️ Weights are not publicly released. This model card documents the architecture and training methodology. For research collaboration inquiries, contact the author.

Architecture

Base Model

  • Type: Mixture-of-Experts (MoE)
  • Total Parameters: ~35B
  • Active Parameters: ~3B per token
  • Context Window: 32K tokens

Training Methodology

  • Algorithm: R-GRPO (Reflexive Group Relative Policy Optimization)
  • Training Rounds: 24 iterative cycles (V1 → V24)
  • Adapter Strategy: 20-layer sequential LoRA merge chain
  • Reward Architecture: SCRGNDWMT (9-tier, 54 sub-dimensions)

9-Tier Reward Topology (SCRGNDWMT)

Tier Name Sub-dimensions Description
S Structure 6 XML formatting, JSON decision blocks
C Content 7 Domain expertise, data fidelity, causal depth
R Reasoning 5 Temporal-causal chains, counterfactual depth
G Game Theory 5 K-level thinking, deception detection, coalition
N Narrative 4 Scenario construction, debate, arc coherence
D Data Fidelity 3 Numerical accuracy, source attribution
W World Model 6 Regime detection, cross-market transmission, macro
M Metacognition 7 Self-awareness, Bayesian confidence, falsification
T Temporal-Causal 5 Causal chains, temporal depth, granularity

V24 Upgrades (from V22)

  • C7 (CausalChainDepthV2): Multi-step causal chains with time-lag annotations
  • M7 (BayesianConfidence): Calibrated confidence field in JSON decisions
  • W3 (CrossMarketPath): Structural contagion paths (Market A → Mechanism → Market B)
  • M5 (FalsificationV2): Quantitative, price-based invalidation conditions

Key Training Parameters

Parameter Value
Learning rate 5 × 10⁻⁷
Group size 12
Max completion tokens 1000
Temperature 1.15
β-annealing Stable (β=0.05) ↔ Break-up (β=0.03)
LoRA rank ≥ 10

Key Results

Reflexive Intelligence Emergence

During V17 training, reflexive reasoning emerged through a discontinuous phase transition at Step 153 — after 150+ steps of zero reflexivity scores, the capability appeared spontaneously and sustained. This is documented in Papers 1-3 of the research program.

V24 Training (ongoing)

  • 54-dimensional reward actively guiding cognitive development
  • Bayesian confidence calibration observed from Step 18
  • Cross-market causal reasoning emerging by Step 25
  • Zero gradient failures through 55+ steps

Research Program

This model is part of a six-paper research program:

Paper Title DOI
P1 Reflexive Intelligence in LLMs 10.5281/zenodo.19557261
P2 Observer Depth (ReflexBench) 10.5281/zenodo.19627242
P3 When Rewards Collide (Multi-Reward GRPO) 10.5281/zenodo.19665969
P4 Ouroboros V22 Architecture 10.5281/zenodo.19666786
P5 The Cognitive Lifecycle 10.5281/zenodo.19666806
P6 Cognitive Reward Topology 10.5281/zenodo.19666829

Related Resources

Resource Link
ReflexBench Dataset MMJBDS/reflexbench
ReflexBench Eval Results MMJBDS/reflexbench-eval
Papers Repository github.com/mmjbds/ouroboros-papers
Evaluation Code github.com/mmjbds/reflexbench

Citation

@article{zhang2026ouroborosv22,
  title={Ouroboros V22: Bayesian Scenario Simulation and Recurrent Depth Cognition},
  author={Zhang, Mian},
  year={2026},
  doi={10.5281/zenodo.19666786}
}

@article{zhang2026topology,
  title={Cognitive Reward Topology: A Nine-Tier Architecture for Multi-Reward GRPO},
  author={Zhang, Mian},
  year={2026},
  doi={10.5281/zenodo.19666829}
}

Author

License

This model card is released under CC BY 4.0. Model weights are not publicly available.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support