NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist Paper • 2602.16756 • Published Feb 18 • 4
Sequential Causal Normal Form Games: Theory, Computation, and Strategic Signaling Paper • 2511.06934 • Published Nov 10, 2025
Causal Regime Detection in Energy Markets With Augmented Time Series Structural Causal Models Paper • 2511.04361 • Published Nov 6, 2025
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM Paper • 2509.18058 • Published Sep 22, 2025 • 12
Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers Paper • 2506.15674 • Published Jun 18, 2025 • 2
Baseline Defenses for Adversarial Attacks Against Aligned Language Models Paper • 2309.00614 • Published Sep 1, 2023 • 2
Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion Paper • 2403.16365 • Published Mar 25, 2024 • 1
Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise Paper • 2208.09392 • Published Aug 19, 2022 • 2
Canary in a Coalmine: Better Membership Inference with Ensembled Adversarial Queries Paper • 2210.10750 • Published Oct 19, 2022 • 1
Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs Paper • 2406.10209 • Published Jun 14, 2024 • 8
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation Paper • 2502.19414 • Published Feb 26, 2025 • 20
GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching Paper • 2506.20480 • Published Jun 25, 2025 • 7