-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:2511.04962
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 19 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48
-
GuardReasoner: Towards Reasoning-based LLM Safeguards
Paper • 2501.18492 • Published • 88 -
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
Paper • 2412.19512 • Published • 9 -
Course-Correction: Safety Alignment Using Synthetic Preferences
Paper • 2407.16637 • Published • 26 -
Refusal in Language Models Is Mediated by a Single Direction
Paper • 2406.11717 • Published • 4
-
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Paper • 2309.15915 • Published • 2 -
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
Paper • 2310.00653 • Published • 3 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 11 -
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
Paper • 2309.09958 • Published • 19
-
Can Large Language Models Understand Context?
Paper • 2402.00858 • Published • 23 -
OLMo: Accelerating the Science of Language Models
Paper • 2402.00838 • Published • 85 -
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 -
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity
Paper • 2401.17072 • Published • 25
-
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
Paper • 2511.02779 • Published • 58 -
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Paper • 2511.04962 • Published • 54 -
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Paper • 2511.04570 • Published • 211
-
Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning
Paper • 2507.17512 • Published • 36 -
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Paper • 2511.04962 • Published • 54 -
10 Open Challenges Steering the Future of Vision-Language-Action Models
Paper • 2511.05936 • Published • 5
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23
-
Can Large Language Models Understand Context?
Paper • 2402.00858 • Published • 23 -
OLMo: Accelerating the Science of Language Models
Paper • 2402.00838 • Published • 85 -
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 151 -
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity
Paper • 2401.17072 • Published • 25
-
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
Paper • 2511.02779 • Published • 58 -
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Paper • 2511.04962 • Published • 54 -
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Paper • 2511.04570 • Published • 211
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 19 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48
-
Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning
Paper • 2507.17512 • Published • 36 -
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Paper • 2511.04962 • Published • 54 -
10 Open Challenges Steering the Future of Vision-Language-Action Models
Paper • 2511.05936 • Published • 5
-
GuardReasoner: Towards Reasoning-based LLM Safeguards
Paper • 2501.18492 • Published • 88 -
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
Paper • 2412.19512 • Published • 9 -
Course-Correction: Safety Alignment Using Synthetic Preferences
Paper • 2407.16637 • Published • 26 -
Refusal in Language Models Is Mediated by a Single Direction
Paper • 2406.11717 • Published • 4
-
Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Paper • 2309.15915 • Published • 2 -
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
Paper • 2310.00653 • Published • 3 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 11 -
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
Paper • 2309.09958 • Published • 19