Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs
Abstract
Multimodal large language models demonstrate consistent computational limitations in exact multi-digit multiplication across different representations and modalities, with performance closely tied to a novel arithmetic load metric that predicts accuracy better than traditional step-counting methods.
Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often > 0.5, nearing the value from more complex measures of arithmetic load that count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect (> 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce a forced-completion loss probe that scores heuristic-specific reasoning prefixes--including columnar multiplication, distributive decomposition, and rounding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specific LoRA adapters produce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tuned internal router.
Community
You can try out the multimodal benchmark yourself here! 😅 https://www.youtube.com/watch?v=WyxnP3UVc58&t=5560s
Or train your models on it using the Hugging Face data repo: https://huggingface.co/datasets/cjerzak/MultimodalMathBenchmarks
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs (2026)
- The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models (2026)
- Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs (2026)
- UNICBench: UNIfied Counting Benchmark for MLLM (2026)
- Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap (2026)
- Post-Routing Arithmetic in Llama-3: Last-Token Result Writing and Rotation-Structured Digit Directions (2026)
- Do Vision Language Models Need to Process Image Tokens? (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.18203 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper