Benchmark and Evaluation
updated
Paper
• 2501.14249
• Published
• 77
Benchmarking LLMs for Political Science: A United Nations Perspective
Paper
• 2502.14122
• Published
• 2
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in
Expert-Domain Information Retrieval
Paper
• 2503.04644
• Published
• 21
ExpertGenQA: Open-ended QA generation in Specialized Domains
Paper
• 2503.02948
• Published
Toward Stable and Consistent Evaluation Results: A New Methodology for
Base Model Evaluation
Paper
• 2503.00812
• Published
Deceptive Humor: A Synthetic Multilingual Benchmark Dataset for Bridging
Fabricated Claims with Humorous Content
Paper
• 2503.16031
• Published
• 3
JudgeLRM: Large Reasoning Models as a Judge
Paper
• 2504.00050
• Published
• 62
FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on
Technical Documents
Paper
• 2504.13128
• Published
• 7
Cost-of-Pass: An Economic Framework for Evaluating Language Models
Paper
• 2504.13359
• Published
• 4
Benchmarking LLMs' Swarm intelligence
Paper
• 2505.04364
• Published
• 20
A Multi-Dimensional Constraint Framework for Evaluating and Improving
Instruction Following in Large Language Models
Paper
• 2505.07591
• Published
• 11
On Robustness and Reliability of Benchmark-Based Evaluation of LLMs
Paper
• 2509.04013
• Published
• 4
COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs
Paper
• 2601.01836
• Published
• 10