MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models
Abstract
A benchmark called MemoryRewardBench is introduced to systematically evaluate reward models' ability to assess long-term memory management in large language models across various context lengths and memory patterns.
Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information across the entire sequence. Therefore, leveraging reward models (RMs) to automatically and reliably evaluate memory quality is critical. In this work, we introduce MemoryRewardBench, the first benchmark to systematically study the ability of RMs to evaluate long-term memory management processes. MemoryRewardBench covers both long-context comprehension and long-form generation tasks, featuring 10 distinct settings with different memory management patterns, with context length ranging from 8K to 128K tokens. Evaluations on 13 cutting-edge RMs indicate a diminishing performance gap between open-source and proprietary models, with newer-generation models consistently outperforming their predecessors regardless of parameter count. We further expose the capabilities and fundamental limitations of current RMs in evaluating LLM memory management across diverse settings.
Community
Check our code: https://github.com/LCM-Lab/MemRewardBench
and Benchmark: https://huggingface.co/datasets/LCM-Lab/MemRewardBench
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Fine-Mem: Fine-Grained Feedback Alignment for Long-Horizon Memory Management (2026)
- EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory (2026)
- Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents (2026)
- Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents (2026)
- RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction (2026)
- MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards (2026)
- CloneMem: Benchmarking Long-Term Memory for AI Clones (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper