Majestrino 1.00 Sparse Autoencoder (16x, k=5)

A Top-K Sparse Autoencoder trained on Majestrino 1.00 voice/audio embeddings. It decomposes 768-dimensional audio embeddings into 12,288 interpretable features covering emotions, speaking styles, languages, vocal qualities, and more.

Key Numbers


Input dimension	768 (Majestrino 1.00 embedding)
Dictionary size	12,288 features (16x expansion)
Active features per input	5 (top-k)
Parameters	18.9M
Training data	7.6M embeddings from majestrino-data
Training epochs	30
Best validation MSE	0.000116
Annotated features	9,575 / 12,288 (77.9%)
Semantic groups	14

Feature Groups

Each of the 9,575 annotated features has been classified into one of 14 semantic groups (183 features belong to 2 groups):

#	Group	Features	Description
1	Sound Effects	98	Non-speech sounds: impacts, clicks, mechanical noises, foley
2	Music & Singing	216	Singing, instruments, rap, humming, melodies
3	Recording / Technical	26	Microphone type, reverb, compression, audio quality
4	Environmental / Ambient	194	Background noise, crowd, traffic, weather, room tone
5	Vocal Bursts	998	Laughter, crying, gasping, sighing, coughing, screaming
6	Cognitive States	369	Hesitation, filler words, confusion, uncertainty
7	Speed / Tempo	80	Speech rate, pacing, cadence, rhythm
8	Vocal Register	154	Falsetto, vocal fry, pitch range, chest/head voice
9	Languages	1,533	Language identity (French, Arabic, Japanese, etc.)
10	Accents / Slang	228	Regional pronunciation, dialect, AAVE, code-switching
11	Emotions (EmoNet 40)	1,760	40 emotion categories: joy, anger, fear, sadness, etc.
12	Talking Styles	3,452	Narration, broadcast, whisper, theatrical, casual, didactic
13	Character Archetypes	303	Villain, mentor, child, gamer, military commander
14	Timbre & Speaker Qualities	347	Raspy, nasal, smooth, breathy, warm, deep, bright

Quick Start

Install dependencies

pip install torch huggingface_hub transformers torchaudio safetensors

Load the SAE

from sae import SparseAutoencoder

# Download from HuggingFace and load
sae = SparseAutoencoder.from_pretrained("laion/majestrino-1.00-16xk5-sae")
sae.eval()

Full pipeline: Audio → Majestrino embedding → SAE features

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import WhisperModel, WhisperFeatureExtractor
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from sae import SparseAutoencoder
import json

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# ── Step 1: Load Majestrino 1.00 base model ──

class MajestrinoCLAP(nn.Module):
    def __init__(self):
        super().__init__()
        self.whisper = WhisperModel.from_pretrained("openai/whisper-small")
        self.audio_encoder = self.whisper.encoder
        input_dim = self.whisper.config.d_model  # 768
        self.projector = nn.Sequential(
            nn.Linear(input_dim, 2048),
            nn.GELU(),
            nn.Linear(2048, 768),
        )

    def encode_audio(self, features):
        out = self.audio_encoder(features).last_hidden_state
        out = out.mean(dim=1)
        return F.normalize(self.projector(out), p=2, dim=1)

majestrino = MajestrinoCLAP().to(DEVICE).eval()

# Load weights (note: key remapping audio_proj -> projector)
weights_path = hf_hub_download("laion/Majestrino-1.00", "model.safetensors")
state_dict = load_file(weights_path)
remapped = {k.replace("audio_proj.", "projector."): v for k, v in state_dict.items()}
majestrino.load_state_dict(remapped, strict=False)

# ── Step 2: Load SAE ──

sae = SparseAutoencoder.from_pretrained("laion/majestrino-1.00-16xk5-sae", device=DEVICE)

# ── Step 3: Load annotations ──

annotations_path = hf_hub_download("laion/majestrino-1.00-16xk5-sae", "annotations.json")
with open(annotations_path) as f:
    annotations = json.load(f)  # dict: feature_id_str -> {title, description, ...}

# ── Step 4: Process audio ──

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

waveform, sr = torchaudio.load("your_audio.wav")
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.mean(dim=0)  # mono

inputs = feature_extractor(waveform.numpy(), sampling_rate=16000, return_tensors="pt")
mel = inputs.input_features.to(DEVICE)

with torch.no_grad():
    embedding = majestrino.encode_audio(mel)       # (1, 768)
    recons, info = sae(embedding)                   # top-k decomposition
    top_indices = info["inds"][0].cpu().tolist()     # 5 feature indices
    top_values = info["vals"][0].cpu().tolist()      # 5 activation values

print("Active features:")
for idx, val in zip(top_indices, top_values):
    ann = annotations.get(str(idx), {})
    title = ann.get("title", "Unknown")
    print(f"  Feature {idx}: {title} (activation={val:.4f})")

Example output

Active features:
  Feature 4821: Casual American Male Speech (activation=0.3142)
  Feature 7203: Conversational Narration (activation=0.2891)
  Feature 1156: Standard American English (activation=0.2453)
  Feature 9834: Clear Articulate Delivery (activation=0.1987)
  Feature 3291: Warm Baritone Timbre (activation=0.1654)

Files

├── sae.py                     # Standalone SAE class (copy to your project)
├── model/
│   ├── config.json            # Model hyperparameters
│   └── state_dict.pth         # PyTorch weights (73 MB)
├── annotations.json           # 9,575 feature annotations
├── group_assignments.json     # Feature → group mapping
└── reports/
    ├── index.html             # Main feature index (browseable)
    ├── index_groups.html      # Grouped feature view
    └── feature_reports.tar    # 10,684 individual feature pages with audio

Extracting feature reports

# Download and extract the interactive HTML reports
cd reports/
tar xf feature_reports.tar
# Open index.html in a browser to explore all features

Architecture

Input (768-d Majestrino embedding)
  │
  ├─ subtract pre_bias
  │
  ├─ encoder: Linear(768 → 12288, no bias)
  │
  ├─ add latent_bias
  │
  ├─ top-k (k=5): keep 5 largest activations
  │
  ├─ ReLU
  │
  ├─ decoder: Linear(12288 → 768, no bias)
  │
  └─ add pre_bias → reconstruction (768-d)

Training Details

Base embeddings: Majestrino 1.00 (embedding_0_11 column from majestrino-data)
Training samples: 7,608,199 embeddings
Validation samples: 7,615 embeddings
Optimizer: Adam (lr=1e-4)
Loss: MSE reconstruction + AuxK dead neuron recovery + frequency overactivation penalty (coef=3.0, decay=0.999)
Dead features: 2,713 / 12,288 (22.1%) — these are features that never activate and are excluded from annotations
Alive & annotated: 9,575 features with Gemini-generated titles and descriptions

Annotations

Each annotated feature in annotations.json has:

{
  "3400": {
    "bin": 18,
    "bin_name": "Angry & Hostile State",
    "title": "Intense Anger and Frustration",
    "description": "The primary commonality across all positive samples is ...",
    "consistency": "high",
    "reasoning": "..."
  }
}

Group assignments in group_assignments.json:

{
  "3400": [11],
  "5234": [12, 14]
}

Values are lists of group IDs (1-14). Features can belong to multiple groups (183 do).

Citation

@misc{majestrino-sae-2025,
  title={Sparse Autoencoder for Majestrino 1.00 Voice Embeddings},
  author={LAION},
  year={2025},
  url={https://huggingface.co/laion/majestrino-1.00-16xk5-sae}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for laion/majestrino-1.00-16xk5-sae

Base model

laion/Majestrino-1.00

Finetuned

(1)

this model

laion
/

majestrino-1.00-16xk5-sae