TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding Paper • 2511.16595 • Published Nov 20 • 9
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding Paper • 2511.13026 • Published Nov 17 • 25
POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World Paper • 2403.05856 • Published Mar 9, 2024
Unveiling Visual Biases in Audio-Visual Localization Benchmarks Paper • 2409.06709 • Published Aug 25, 2024
SPAFormer: Sequential 3D Part Assembly with Transformers Paper • 2403.05874 • Published Mar 9, 2024 • 1
Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions? Paper • 2405.17719 • Published May 28, 2024 • 1
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM Paper • 2503.13377 • Published Mar 17 • 3