Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories Paper • 2606.02060 • Published 10 days ago • 53
TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation Paper • 2606.02320 • Published 10 days ago • 14
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues Paper • 2510.17722 • Published Oct 20, 2025 • 20
IF-VidCap: Can Video Caption Models Follow Instructions? Paper • 2510.18726 • Published Oct 21, 2025 • 27
DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation Paper • 2604.14683 • Published Apr 16 • 36