| --- |
| license: apple-amlr |
| library_name: ml-fastvlm |
| --- |
| # FastVLM: Efficient Vision Encoding for Vision Language Models |
|
|
| FastVLM was introduced in |
| **[FastVLM: Efficient Vision Encoding for Vision Language Models](https://www.arxiv.org/abs/2412.13303). (CVPR 2025)** |
|
|
| [//]: # () |
| <p align="center"> |
| <img src="acc_vs_latency_qwen-2.png" alt="Accuracy vs latency figure." width="400"/> |
| </p> |
|
|
| ### Highlights |
| * We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images. |
| * Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder. |
| * Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT. |
|
|
|
|
| ### Evaluations |
| | Benchmark | FastVLM-0.5B | FastVLM-1.5B | FastVLM-7B | |
| |:--------------|:------------:|:------------:|:----------:| |
| | Ai2D | 68.0 | 77.4 | 83.6 | |
| | ScienceQA | 85.2 | 94.4 | 96.7 | |
| | MMMU | 33.9 | 37.8 | 45.4 | |
| | VQAv2 | 76.3 | 79.1 | 80.8 | |
| | ChartQA | 76.0 | 80.1 | 85.0 | |
| | TextVQA | 64.5 | 70.4 | 74.9 | |
| | InfoVQA | 46.4 | 59.7 | 75.8 | |
| | DocVQA | 82.5 | 88.3 | 93.2 | |
| | OCRBench | 63.9 | 70.2 | 73.1 | |
| | RealWorldQA | 56.1 | 61.2 | 67.2 | |
| | SeedBench-Img | 71.0 | 74.2 | 75.4 | |
|
|
|
|
| ### Usage Example |
| The model has been exported to run with MLX. Follow the instructions in the official repository to use it in an iOS or macOS app. |
|
|
|
|
| ## Citation |
| If you found this model useful, please cite the following paper: |
| ``` |
| @InProceedings{fastvlm2025, |
| author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari}, |
| title = {FastVLM: Efficient Vision Encoding for Vision Language Models}, |
| booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, |
| month = {June}, |
| year = {2025}, |
| } |
| ``` |
|
|