Sleeping Agents FEST-Style Few-Shot RL for Reasoning π§ Solve math problems with stepβbyβstep reasoning
Sleeping Agents Implicit Memory Conflict Validator π§ Evaluate LLM responses for outdated memory conflicts
Sleeping Agents Sudanese CoT Reasoning Benchmark π§ Run Sudanese Arabic reasoning benchmark with step-by-step analysis
Sleeping Agents COPSD Sudanese Reasoning Demo π Compare Sudanese math reasoning with and without English context
Sleeping Agents PrefixGuard Demo - Agent Failure Detection π‘ Detect potential agent failures from execution traces
Sleeping Agents LoPE Demo - Prompt Perturbation for Reasoning Exploration π§ Compare baseline and perturbed reasoning for tasks
Paused Agents Lost-in-Thought Benchmark π§ Run a benchmark to see how reasoning steps affect retrieval accuracy
Sleeping Agents Master Key Capability Demo π Show expected accuracy boost for a math problem via steering