FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation Paper โข 2410.22257 โข Published Oct 29, 2024
Logit Arithmetic Elicits Long Reasoning Capabilities Without Training Paper โข 2507.12759 โข Published Jul 17
From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models Paper โข 2511.10899 โข Published Nov 14 โข 3