Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models Paper • 2602.02185 • Published 17 days ago • 125
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models Paper • 2601.22060 • Published 21 days ago • 153
iFSQ: Improving FSQ for Image Generation with 1 Line of Code Paper • 2601.17124 • Published 27 days ago • 32
UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision Paper • 2601.03193 • Published Jan 6 • 47
UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision Paper • 2601.03193 • Published Jan 6 • 47
DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action Paper • 2511.22134 • Published Nov 27, 2025 • 22
Look-Back: Implicit Visual Re-focusing in MLLM Reasoning Paper • 2507.03019 • Published Jul 2, 2025 • 1
Can Understanding and Generation Truly Benefit Together -- or Just Coexist? Paper • 2509.09666 • Published Sep 11, 2025 • 34
FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation Paper • 2509.25187 • Published Sep 29, 2025 • 2
GIR-Bench: Versatile Benchmark for Generating Images with Reasoning Paper • 2510.11026 • Published Oct 13, 2025 • 18
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback Paper • 2510.16888 • Published Oct 19, 2025 • 22
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models Paper • 2510.01304 • Published Oct 1, 2025 • 11
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion Paper • 2502.08590 • Published Feb 12, 2025 • 42
Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings Paper • 2506.04997 • Published Jun 5, 2025
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing Paper • 2506.19848 • Published Jun 24, 2025 • 26
Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models Paper • 2508.00819 • Published Aug 1, 2025 • 63
Next Patch Prediction for Autoregressive Visual Generation Paper • 2412.15321 • Published Dec 19, 2024 • 1
DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses Paper • 2412.00397 • Published Nov 30, 2024
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation Paper • 2503.07265 • Published Mar 10, 2025 • 4