Adding desc
AI & ML interests
None defined yet.
Recent Activity
View all activity
Papers
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation
Collection for Code World Model, an agentic coding model from FAIR.
A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann
-
facebook/vjepa2-vitl-fpc64-256
Video Classification • 0.3B • Updated • 44.3k • 175 -
facebook/vjepa2-vith-fpc64-256
Video Classification • 0.7B • Updated • 1.18k • 15 -
facebook/vjepa2-vitg-fpc64-256
Video Classification • 1B • Updated • 12.3k • 22 -
facebook/vjepa2-vitg-fpc64-384
Video Classification • 1B • Updated • 8.77k • 36
A collection of small (sub-1B) multilingual dense retrievers that generalize well across a number of tasks and languages.
Optimizing Sub-billion Parameter Language Models for On-Device Use Cases (ICML 2024) https://arxiv.org/abs/2402.14905
-
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Paper • 2402.14905 • Published • 134 -
facebook/MobileLLM-125M
Text Generation • Updated • 909 • 128 -
facebook/MobileLLM-350M
Text Generation • Updated • 214 • 36 -
facebook/MobileLLM-600M
Text Generation • Updated • 158 • 29
Models continually pretrained using LayerSkip - https://arxiv.org/abs/2404.16710
-
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper • 2404.16710 • Published • 80 -
facebook/layerskip-llama2-7B
Text Generation • 7B • Updated • 91 • 15 -
facebook/layerskip-llama2-13B
Text Generation • 13B • Updated • 44 • 5 -
facebook/layerskip-llama2-70B
Text Generation • 69B • Updated • 22 • 5
A significant step towards removing language barriers through expressive, fast and high-quality AI translation.
-
Seamless: Multilingual Expressive and Streaming Speech Translation
Paper • 2312.05187 • Published • 14 -
facebook/seamless-m4t-v2-large
Automatic Speech Recognition • 2B • Updated • 214k • 945 -
Seamless M4T v2
📞517Translate speech and text between languages
-
facebook/seamless-expressive
Text-to-Speech • Updated • 187
A collection for the first release of Wav2Vec 2.0, a speech encoder that learns powerful representations from unlabelled audio data.
-
facebook/wav2vec2-large-960h-lv60-self
Automatic Speech Recognition • Updated • 99.2k • 155 -
facebook/wav2vec2-large-960h
Automatic Speech Recognition • Updated • 12.1k • 32 -
facebook/wav2vec2-base-960h
Automatic Speech Recognition • 94.4M • Updated • 1.21M • 385 -
facebook/wav2vec2-base-100h
Automatic Speech Recognition • Updated • 695 • 7
A collection of multilingual Wav2Vec 2.0 checkpoints pre-trained on 53 languages and fine-tuned for CTC speech recognition.
-
facebook/wav2vec2-large-xlsr-53
Updated • 289k • 151 -
facebook/wav2vec2-xlsr-53-espeak-cv-ft
Automatic Speech Recognition • Updated • 212k • 43 -
facebook/wav2vec2-large-xlsr-53-dutch
Automatic Speech Recognition • Updated • 114 • 3 -
facebook/wav2vec2-large-xlsr-53-french
Automatic Speech Recognition • Updated • 882 • 13
A collection of "robust" Wav2Vec 2.0 checkpoints pre-trained on datasets from multiple domains.
-
facebook/wav2vec2-large-robust
Updated • 2.45k • 38 -
facebook/wav2vec2-large-robust-ft-libri-960h
Automatic Speech Recognition • 0.3B • Updated • 107k • 15 -
facebook/wav2vec2-large-robust-ft-swbd-300h
Automatic Speech Recognition • Updated • 2.36k • 20 -
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Paper • 2104.01027 • Published • 1
A collection of checkpoints from the second VoxPopuli release.
-
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Paper • 2101.00390 • Published • 1 -
facebook/wav2vec2-base-bg-voxpopuli-v2
Automatic Speech Recognition • Updated • 7 • 2 -
facebook/wav2vec2-base-cs-voxpopuli-v2
Automatic Speech Recognition • Updated • 6 • 1 -
facebook/wav2vec2-base-da-voxpopuli-v2
Automatic Speech Recognition • Updated • 5
Text-to-speech models from fairseq s^2
A collection of stereo music generation models as part of the v2 MusicGen release.
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
OPT (Open Pretrained Transformer) is a series of open-sourced large causal language models which perform similar in performance to GPT3.
-
facebook/metaclip-2-worldwide-huge-quickgelu
Zero-Shot Image Classification • 2B • Updated • 5.55k • 16 -
facebook/metaclip-2-worldwide-huge-378
Zero-Shot Image Classification • 2B • Updated • 149 • 6 -
facebook/metaclip-2-worldwide-giant
Zero-Shot Image Classification • 4B • Updated • 581 • 7 -
facebook/metaclip-2-worldwide-giant-378
Zero-Shot Image Classification • 4B • Updated • 507 • 11
MobileLLM-R1, a series of sub-billion parameter reasoning models
DINOv3: foundation models producing excellent dense features, outperforming SotA w/o fine-tuning - https://arxiv.org/abs/2508.10104
-
facebook/dinov3-vit7b16-pretrain-lvd1689m
Image Feature Extraction • 7B • Updated • 13.5k • 202 -
facebook/dinov3-vits16-pretrain-lvd1689m
Image Feature Extraction • 21.6M • Updated • 180k • 61 -
facebook/dinov3-convnext-small-pretrain-lvd1689m
Image Feature Extraction • 49.5M • Updated • 16.3k • 22 -
facebook/dinov3-vitb16-pretrain-lvd1689m
Image Feature Extraction • 85.7M • Updated • 203k • 92
Scaling CLIP data with transparent training distribution from an end-to-end pipeline.
-
facebook/metaclip-h14-fullcc2.5b
Zero-Shot Image Classification • 1.0B • Updated • 7.59k • 49 -
facebook/metaclip-l14-fullcc2.5b
Zero-Shot Image Classification • Updated • 766 • 7 -
facebook/metaclip-b16-fullcc2.5b
Zero-Shot Image Classification • Updated • 3.69k • 11 -
facebook/metaclip-b32-fullcc2.5b
Zero-Shot Image Classification • Updated • 104 • 9
-
facebook/webssl-dino300m-full2b-224
Image Feature Extraction • 0.3B • Updated • 689 • 10 -
facebook/webssl-dino1b-full2b-224
Image Feature Extraction • 1B • Updated • 113 • 3 -
facebook/webssl-dino2b-full2b-224
Image Feature Extraction • 2B • Updated • 7 -
facebook/webssl-dino3b-full2b-224
Image Feature Extraction • 3B • Updated • 5
A first-of-its-kind behavioral foundation model to control a virtual physics-based humanoid agent for a wide range of whole-body tasks.
Models and datasets for Sparsh: Self-supervised touch representations for vision-based tactile sensing
MelodyFlow: High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching
Masked Audio Generation using a Single Non-Autoregressive Transformer
SeamlessM4T is designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly.
First release checkpoints for XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0.
A collection of open-source artefacts (datasets + checkpoints) from the first VoxPopuli release.
-
facebook/voxpopuli
Updated • 6.98k • 138 -
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Paper • 2101.00390 • Published • 1 -
facebook/wav2vec2-base-100k-voxpopuli
Automatic Speech Recognition • Updated • 58 • 4 -
facebook/wav2vec2-base-10k-voxpopuli-ft-cs
Automatic Speech Recognition • Updated • 5
A collection of checkpoints from the HuBERT release, a speech encoder that learns powerful representations from unlabelled audio data.
-
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Paper • 2106.07447 • Published • 4 -
facebook/hubert-base-ls960
Feature Extraction • Updated • 247k • • 65 -
facebook/hubert-large-ll60k
Feature Extraction • Updated • 34.1k • • 32 -
facebook/hubert-large-ls960-ft
Automatic Speech Recognition • Updated • 226k • 75
DINOv2: foundation models producing robust visual features suitable for image-level and pixel-level visual tasks - https://arxiv.org/abs/2304.07193
-
facebook/dinov2-small
Image Feature Extraction • 22.1M • Updated • 1.67M • 51 -
facebook/dinov2-base
Image Feature Extraction • 86.6M • Updated • 1.12M • 163 -
facebook/dinov2-large
Image Feature Extraction • 0.3B • Updated • 2.33M • 99 -
facebook/dinov2-giant
Image Feature Extraction • 1B • Updated • 78.2k • 54
Meta LLM Compiler is a state-of-the-art LLM that builds upon Code Llama with improved performance for code optimization and compiler reasoning.
Foundation models for human tasks. Code: https://github.com/facebookresearch/sapiens
Adding desc
-
facebook/metaclip-2-worldwide-huge-quickgelu
Zero-Shot Image Classification • 2B • Updated • 5.55k • 16 -
facebook/metaclip-2-worldwide-huge-378
Zero-Shot Image Classification • 2B • Updated • 149 • 6 -
facebook/metaclip-2-worldwide-giant
Zero-Shot Image Classification • 4B • Updated • 581 • 7 -
facebook/metaclip-2-worldwide-giant-378
Zero-Shot Image Classification • 4B • Updated • 507 • 11
MobileLLM-R1, a series of sub-billion parameter reasoning models
Collection for Code World Model, an agentic coding model from FAIR.
DINOv3: foundation models producing excellent dense features, outperforming SotA w/o fine-tuning - https://arxiv.org/abs/2508.10104
-
facebook/dinov3-vit7b16-pretrain-lvd1689m
Image Feature Extraction • 7B • Updated • 13.5k • 202 -
facebook/dinov3-vits16-pretrain-lvd1689m
Image Feature Extraction • 21.6M • Updated • 180k • 61 -
facebook/dinov3-convnext-small-pretrain-lvd1689m
Image Feature Extraction • 49.5M • Updated • 16.3k • 22 -
facebook/dinov3-vitb16-pretrain-lvd1689m
Image Feature Extraction • 85.7M • Updated • 203k • 92
Scaling CLIP data with transparent training distribution from an end-to-end pipeline.
-
facebook/metaclip-h14-fullcc2.5b
Zero-Shot Image Classification • 1.0B • Updated • 7.59k • 49 -
facebook/metaclip-l14-fullcc2.5b
Zero-Shot Image Classification • Updated • 766 • 7 -
facebook/metaclip-b16-fullcc2.5b
Zero-Shot Image Classification • Updated • 3.69k • 11 -
facebook/metaclip-b32-fullcc2.5b
Zero-Shot Image Classification • Updated • 104 • 9
A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann
-
facebook/vjepa2-vitl-fpc64-256
Video Classification • 0.3B • Updated • 44.3k • 175 -
facebook/vjepa2-vith-fpc64-256
Video Classification • 0.7B • Updated • 1.18k • 15 -
facebook/vjepa2-vitg-fpc64-256
Video Classification • 1B • Updated • 12.3k • 22 -
facebook/vjepa2-vitg-fpc64-384
Video Classification • 1B • Updated • 8.77k • 36
-
facebook/webssl-dino300m-full2b-224
Image Feature Extraction • 0.3B • Updated • 689 • 10 -
facebook/webssl-dino1b-full2b-224
Image Feature Extraction • 1B • Updated • 113 • 3 -
facebook/webssl-dino2b-full2b-224
Image Feature Extraction • 2B • Updated • 7 -
facebook/webssl-dino3b-full2b-224
Image Feature Extraction • 3B • Updated • 5
A collection of small (sub-1B) multilingual dense retrievers that generalize well across a number of tasks and languages.
A first-of-its-kind behavioral foundation model to control a virtual physics-based humanoid agent for a wide range of whole-body tasks.
Optimizing Sub-billion Parameter Language Models for On-Device Use Cases (ICML 2024) https://arxiv.org/abs/2402.14905
-
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Paper • 2402.14905 • Published • 134 -
facebook/MobileLLM-125M
Text Generation • Updated • 909 • 128 -
facebook/MobileLLM-350M
Text Generation • Updated • 214 • 36 -
facebook/MobileLLM-600M
Text Generation • Updated • 158 • 29
Models and datasets for Sparsh: Self-supervised touch representations for vision-based tactile sensing
Models continually pretrained using LayerSkip - https://arxiv.org/abs/2404.16710
-
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper • 2404.16710 • Published • 80 -
facebook/layerskip-llama2-7B
Text Generation • 7B • Updated • 91 • 15 -
facebook/layerskip-llama2-13B
Text Generation • 13B • Updated • 44 • 5 -
facebook/layerskip-llama2-70B
Text Generation • 69B • Updated • 22 • 5
MelodyFlow: High Fidelity Text-Guided Music Generation and Editing via Single-Stage Flow Matching
A significant step towards removing language barriers through expressive, fast and high-quality AI translation.
-
Seamless: Multilingual Expressive and Streaming Speech Translation
Paper • 2312.05187 • Published • 14 -
facebook/seamless-m4t-v2-large
Automatic Speech Recognition • 2B • Updated • 214k • 945 -
Seamless M4T v2
📞517Translate speech and text between languages
-
facebook/seamless-expressive
Text-to-Speech • Updated • 187
Masked Audio Generation using a Single Non-Autoregressive Transformer
A collection for the first release of Wav2Vec 2.0, a speech encoder that learns powerful representations from unlabelled audio data.
-
facebook/wav2vec2-large-960h-lv60-self
Automatic Speech Recognition • Updated • 99.2k • 155 -
facebook/wav2vec2-large-960h
Automatic Speech Recognition • Updated • 12.1k • 32 -
facebook/wav2vec2-base-960h
Automatic Speech Recognition • 94.4M • Updated • 1.21M • 385 -
facebook/wav2vec2-base-100h
Automatic Speech Recognition • Updated • 695 • 7
SeamlessM4T is designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly.
A collection of multilingual Wav2Vec 2.0 checkpoints pre-trained on 53 languages and fine-tuned for CTC speech recognition.
-
facebook/wav2vec2-large-xlsr-53
Updated • 289k • 151 -
facebook/wav2vec2-xlsr-53-espeak-cv-ft
Automatic Speech Recognition • Updated • 212k • 43 -
facebook/wav2vec2-large-xlsr-53-dutch
Automatic Speech Recognition • Updated • 114 • 3 -
facebook/wav2vec2-large-xlsr-53-french
Automatic Speech Recognition • Updated • 882 • 13
First release checkpoints for XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0.
A collection of "robust" Wav2Vec 2.0 checkpoints pre-trained on datasets from multiple domains.
-
facebook/wav2vec2-large-robust
Updated • 2.45k • 38 -
facebook/wav2vec2-large-robust-ft-libri-960h
Automatic Speech Recognition • 0.3B • Updated • 107k • 15 -
facebook/wav2vec2-large-robust-ft-swbd-300h
Automatic Speech Recognition • Updated • 2.36k • 20 -
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Paper • 2104.01027 • Published • 1
A collection of open-source artefacts (datasets + checkpoints) from the first VoxPopuli release.
-
facebook/voxpopuli
Updated • 6.98k • 138 -
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Paper • 2101.00390 • Published • 1 -
facebook/wav2vec2-base-100k-voxpopuli
Automatic Speech Recognition • Updated • 58 • 4 -
facebook/wav2vec2-base-10k-voxpopuli-ft-cs
Automatic Speech Recognition • Updated • 5
A collection of checkpoints from the second VoxPopuli release.
-
VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
Paper • 2101.00390 • Published • 1 -
facebook/wav2vec2-base-bg-voxpopuli-v2
Automatic Speech Recognition • Updated • 7 • 2 -
facebook/wav2vec2-base-cs-voxpopuli-v2
Automatic Speech Recognition • Updated • 6 • 1 -
facebook/wav2vec2-base-da-voxpopuli-v2
Automatic Speech Recognition • Updated • 5
A collection of checkpoints from the HuBERT release, a speech encoder that learns powerful representations from unlabelled audio data.
-
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Paper • 2106.07447 • Published • 4 -
facebook/hubert-base-ls960
Feature Extraction • Updated • 247k • • 65 -
facebook/hubert-large-ll60k
Feature Extraction • Updated • 34.1k • • 32 -
facebook/hubert-large-ls960-ft
Automatic Speech Recognition • Updated • 226k • 75
Text-to-speech models from fairseq s^2
DINOv2: foundation models producing robust visual features suitable for image-level and pixel-level visual tasks - https://arxiv.org/abs/2304.07193
-
facebook/dinov2-small
Image Feature Extraction • 22.1M • Updated • 1.67M • 51 -
facebook/dinov2-base
Image Feature Extraction • 86.6M • Updated • 1.12M • 163 -
facebook/dinov2-large
Image Feature Extraction • 0.3B • Updated • 2.33M • 99 -
facebook/dinov2-giant
Image Feature Extraction • 1B • Updated • 78.2k • 54
A collection of stereo music generation models as part of the v2 MusicGen release.
Meta LLM Compiler is a state-of-the-art LLM that builds upon Code Llama with improved performance for code optimization and compiler reasoning.
Repository for Meta Chameleon, a mixed-modal early-fusion foundation model from FAIR.
Foundation models for human tasks. Code: https://github.com/facebookresearch/sapiens
OPT (Open Pretrained Transformer) is a series of open-sourced large causal language models which perform similar in performance to GPT3.