laion/majestrino-data
Viewer β’ Updated β’ 8.22M β’ 26.7k β’ 1
A Top-K Sparse Autoencoder trained on Majestrino 1.00 voice/audio embeddings. It decomposes 768-dimensional audio embeddings into 12,288 interpretable features covering emotions, speaking styles, languages, vocal qualities, and more.
| Input dimension | 768 (Majestrino 1.00 embedding) |
| Dictionary size | 12,288 features (16x expansion) |
| Active features per input | 5 (top-k) |
| Parameters | 18.9M |
| Training data | 7.6M embeddings from majestrino-data |
| Training epochs | 30 |
| Best validation MSE | 0.000116 |
| Annotated features | 9,575 / 12,288 (77.9%) |
| Semantic groups | 14 |
Each of the 9,575 annotated features has been classified into one of 14 semantic groups (183 features belong to 2 groups):
| # | Group | Features | Description |
|---|---|---|---|
| 1 | Sound Effects | 98 | Non-speech sounds: impacts, clicks, mechanical noises, foley |
| 2 | Music & Singing | 216 | Singing, instruments, rap, humming, melodies |
| 3 | Recording / Technical | 26 | Microphone type, reverb, compression, audio quality |
| 4 | Environmental / Ambient | 194 | Background noise, crowd, traffic, weather, room tone |
| 5 | Vocal Bursts | 998 | Laughter, crying, gasping, sighing, coughing, screaming |
| 6 | Cognitive States | 369 | Hesitation, filler words, confusion, uncertainty |
| 7 | Speed / Tempo | 80 | Speech rate, pacing, cadence, rhythm |
| 8 | Vocal Register | 154 | Falsetto, vocal fry, pitch range, chest/head voice |
| 9 | Languages | 1,533 | Language identity (French, Arabic, Japanese, etc.) |
| 10 | Accents / Slang | 228 | Regional pronunciation, dialect, AAVE, code-switching |
| 11 | Emotions (EmoNet 40) | 1,760 | 40 emotion categories: joy, anger, fear, sadness, etc. |
| 12 | Talking Styles | 3,452 | Narration, broadcast, whisper, theatrical, casual, didactic |
| 13 | Character Archetypes | 303 | Villain, mentor, child, gamer, military commander |
| 14 | Timbre & Speaker Qualities | 347 | Raspy, nasal, smooth, breathy, warm, deep, bright |
pip install torch huggingface_hub transformers torchaudio safetensors
from sae import SparseAutoencoder
# Download from HuggingFace and load
sae = SparseAutoencoder.from_pretrained("laion/majestrino-1.00-16xk5-sae")
sae.eval()
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import WhisperModel, WhisperFeatureExtractor
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from sae import SparseAutoencoder
import json
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# ββ Step 1: Load Majestrino 1.00 base model ββ
class MajestrinoCLAP(nn.Module):
def __init__(self):
super().__init__()
self.whisper = WhisperModel.from_pretrained("openai/whisper-small")
self.audio_encoder = self.whisper.encoder
input_dim = self.whisper.config.d_model # 768
self.projector = nn.Sequential(
nn.Linear(input_dim, 2048),
nn.GELU(),
nn.Linear(2048, 768),
)
def encode_audio(self, features):
out = self.audio_encoder(features).last_hidden_state
out = out.mean(dim=1)
return F.normalize(self.projector(out), p=2, dim=1)
majestrino = MajestrinoCLAP().to(DEVICE).eval()
# Load weights (note: key remapping audio_proj -> projector)
weights_path = hf_hub_download("laion/Majestrino-1.00", "model.safetensors")
state_dict = load_file(weights_path)
remapped = {k.replace("audio_proj.", "projector."): v for k, v in state_dict.items()}
majestrino.load_state_dict(remapped, strict=False)
# ββ Step 2: Load SAE ββ
sae = SparseAutoencoder.from_pretrained("laion/majestrino-1.00-16xk5-sae", device=DEVICE)
# ββ Step 3: Load annotations ββ
annotations_path = hf_hub_download("laion/majestrino-1.00-16xk5-sae", "annotations.json")
with open(annotations_path) as f:
annotations = json.load(f) # dict: feature_id_str -> {title, description, ...}
# ββ Step 4: Process audio ββ
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
waveform, sr = torchaudio.load("your_audio.wav")
if sr != 16000:
waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.mean(dim=0) # mono
inputs = feature_extractor(waveform.numpy(), sampling_rate=16000, return_tensors="pt")
mel = inputs.input_features.to(DEVICE)
with torch.no_grad():
embedding = majestrino.encode_audio(mel) # (1, 768)
recons, info = sae(embedding) # top-k decomposition
top_indices = info["inds"][0].cpu().tolist() # 5 feature indices
top_values = info["vals"][0].cpu().tolist() # 5 activation values
print("Active features:")
for idx, val in zip(top_indices, top_values):
ann = annotations.get(str(idx), {})
title = ann.get("title", "Unknown")
print(f" Feature {idx}: {title} (activation={val:.4f})")
Active features:
Feature 4821: Casual American Male Speech (activation=0.3142)
Feature 7203: Conversational Narration (activation=0.2891)
Feature 1156: Standard American English (activation=0.2453)
Feature 9834: Clear Articulate Delivery (activation=0.1987)
Feature 3291: Warm Baritone Timbre (activation=0.1654)
βββ sae.py # Standalone SAE class (copy to your project)
βββ model/
β βββ config.json # Model hyperparameters
β βββ state_dict.pth # PyTorch weights (73 MB)
βββ annotations.json # 9,575 feature annotations
βββ group_assignments.json # Feature β group mapping
βββ reports/
βββ index.html # Main feature index (browseable)
βββ index_groups.html # Grouped feature view
βββ feature_reports.tar # 10,684 individual feature pages with audio
# Download and extract the interactive HTML reports
cd reports/
tar xf feature_reports.tar
# Open index.html in a browser to explore all features
Input (768-d Majestrino embedding)
β
ββ subtract pre_bias
β
ββ encoder: Linear(768 β 12288, no bias)
β
ββ add latent_bias
β
ββ top-k (k=5): keep 5 largest activations
β
ββ ReLU
β
ββ decoder: Linear(12288 β 768, no bias)
β
ββ add pre_bias β reconstruction (768-d)
embedding_0_11 column from majestrino-data)Each annotated feature in annotations.json has:
{
"3400": {
"bin": 18,
"bin_name": "Angry & Hostile State",
"title": "Intense Anger and Frustration",
"description": "The primary commonality across all positive samples is ...",
"consistency": "high",
"reasoning": "..."
}
}
Group assignments in group_assignments.json:
{
"3400": [11],
"5234": [12, 14]
}
Values are lists of group IDs (1-14). Features can belong to multiple groups (183 do).
@misc{majestrino-sae-2025,
title={Sparse Autoencoder for Majestrino 1.00 Voice Embeddings},
author={LAION},
year={2025},
url={https://huggingface.co/laion/majestrino-1.00-16xk5-sae}
}
Apache 2.0
Base model
laion/Majestrino-1.00