Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction
Abstract
Future-L1, an interleaved latent visual reasoning framework, improves video event prediction by maintaining visual semantics in latent space during autoregressive decoding, achieving state-of-the-art results on FutureBench and TwiFF-Bench benchmarks.
Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.
Community
Future-L1 nails video event prediction by letting MLLMs interleave text reasoning with latent visual "imagination" of future frames.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs (2026)
- STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models (2026)
- Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning (2026)
- LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning (2026)
- DeepLatent: Think with Images via Parallel Latent Visual Reasoning (2026)
- Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs (2026)
- Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.05769 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper