Ouroboros V24: Cognitive Architecture for Reflexive Financial Reasoning

Ouroboros V24 is the latest iteration of a cognitive architecture designed for autonomous financial decision-making. Built on a 35B-parameter Mixture-of-Experts (MoE) base model with ~3B active parameters, trained through 24 iterative rounds of multi-reward GRPO with a 54-dimensional cognitive reward topology.

⚠️ Weights are not publicly released. This model card documents the architecture and training methodology. For research collaboration inquiries, contact the author.

Architecture

Base Model

Type: Mixture-of-Experts (MoE)
Total Parameters: ~35B
Active Parameters: ~3B per token
Context Window: 32K tokens

Training Methodology

Algorithm: R-GRPO (Reflexive Group Relative Policy Optimization)
Training Rounds: 24 iterative cycles (V1 → V24)
Adapter Strategy: 20-layer sequential LoRA merge chain
Reward Architecture: SCRGNDWMT (9-tier, 54 sub-dimensions)

9-Tier Reward Topology (SCRGNDWMT)

Tier	Name	Sub-dimensions	Description
S	Structure	6	XML formatting, JSON decision blocks
C	Content	7	Domain expertise, data fidelity, causal depth
R	Reasoning	5	Temporal-causal chains, counterfactual depth
G	Game Theory	5	K-level thinking, deception detection, coalition
N	Narrative	4	Scenario construction, debate, arc coherence
D	Data Fidelity	3	Numerical accuracy, source attribution
W	World Model	6	Regime detection, cross-market transmission, macro
M	Metacognition	7	Self-awareness, Bayesian confidence, falsification
T	Temporal-Causal	5	Causal chains, temporal depth, granularity

V24 Upgrades (from V22)

C7 (CausalChainDepthV2): Multi-step causal chains with time-lag annotations
M7 (BayesianConfidence): Calibrated confidence field in JSON decisions
W3 (CrossMarketPath): Structural contagion paths (Market A → Mechanism → Market B)
M5 (FalsificationV2): Quantitative, price-based invalidation conditions

Key Training Parameters

Parameter	Value
Learning rate	5 × 10⁻⁷
Group size	12
Max completion tokens	1000
Temperature	1.15
β-annealing	Stable (β=0.05) ↔ Break-up (β=0.03)
LoRA rank	≥ 10

Key Results

Reflexive Intelligence Emergence

During V17 training, reflexive reasoning emerged through a discontinuous phase transition at Step 153 — after 150+ steps of zero reflexivity scores, the capability appeared spontaneously and sustained. This is documented in Papers 1-3 of the research program.

V24 Training (ongoing)

54-dimensional reward actively guiding cognitive development
Bayesian confidence calibration observed from Step 18
Cross-market causal reasoning emerging by Step 25
Zero gradient failures through 55+ steps

Research Program

This model is part of a six-paper research program:

Paper	Title	DOI
P1	Reflexive Intelligence in LLMs	10.5281/zenodo.19557261
P2	Observer Depth (ReflexBench)	10.5281/zenodo.19627242
P3	When Rewards Collide (Multi-Reward GRPO)	10.5281/zenodo.19665969
P4	Ouroboros V22 Architecture	10.5281/zenodo.19666786
P5	The Cognitive Lifecycle	10.5281/zenodo.19666806
P6	Cognitive Reward Topology	10.5281/zenodo.19666829

Related Resources

Resource	Link
ReflexBench Dataset	MMJBDS/reflexbench
ReflexBench Eval Results	MMJBDS/reflexbench-eval
Papers Repository	github.com/mmjbds/ouroboros-papers
Evaluation Code	github.com/mmjbds/reflexbench

Citation

@article{zhang2026ouroborosv22,
  title={Ouroboros V22: Bayesian Scenario Simulation and Recurrent Depth Cognition},
  author={Zhang, Mian},
  year={2026},
  doi={10.5281/zenodo.19666786}
}

@article{zhang2026topology,
  title={Cognitive Reward Topology: A Nine-Tier Architecture for Multi-Reward GRPO},
  author={Zhang, Mian},
  year={2026},
  doi={10.5281/zenodo.19666829}
}

Author

Mian Zhang — Independent AI Researcher
ORCID: 0009-0001-9556-3839
Email: 373743743@qq.com
GitHub: @mmjbds
Twitter/X: @Henry_Avery666
LinkedIn: henryavery-mianzhang

License

This model card is released under CC BY 4.0. Model weights are not publicly available.

Downloads last month: -; Downloads are not tracked for this model. How to track