arxiv:2603.06638

HEARTS: Benchmarking LLM Reasoning on Health Time Series

Published on Feb 25

Authors:

Abstract

HEARTS benchmark evaluates LLMs' hierarchical reasoning capabilities across diverse health time series domains and tasks, revealing significant gaps compared to specialized models.

AI-generated summary

The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform specialized models, and their performance is only weakly related to general reasoning scores. Moreover, LLMs often rely on simple heuristics and struggle with multi-step temporal reasoning. Finally, performance declines with increasing temporal complexity, with similar failure modes within model families, indicating that scaling alone is insufficient. By making these gaps measurable, HEARTS provides a standardized testbed and living benchmark for developing next-generation LLM agents capable of reasoning over diverse health signals.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.06638 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.06638 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.