Majestrino 1.00 Sparse Autoencoder (16x, k=5)

A Top-K Sparse Autoencoder trained on Majestrino 1.00 voice/audio embeddings. It decomposes 768-dimensional audio embeddings into 12,288 interpretable features covering emotions, speaking styles, languages, vocal qualities, and more.

Key Numbers

Input dimension 768 (Majestrino 1.00 embedding)
Dictionary size 12,288 features (16x expansion)
Active features per input 5 (top-k)
Parameters 18.9M
Training data 7.6M embeddings from majestrino-data
Training epochs 30
Best validation MSE 0.000116
Annotated features 9,575 / 12,288 (77.9%)
Semantic groups 14

Feature Groups

Each of the 9,575 annotated features has been classified into one of 14 semantic groups (183 features belong to 2 groups):

# Group Features Description
1 Sound Effects 98 Non-speech sounds: impacts, clicks, mechanical noises, foley
2 Music & Singing 216 Singing, instruments, rap, humming, melodies
3 Recording / Technical 26 Microphone type, reverb, compression, audio quality
4 Environmental / Ambient 194 Background noise, crowd, traffic, weather, room tone
5 Vocal Bursts 998 Laughter, crying, gasping, sighing, coughing, screaming
6 Cognitive States 369 Hesitation, filler words, confusion, uncertainty
7 Speed / Tempo 80 Speech rate, pacing, cadence, rhythm
8 Vocal Register 154 Falsetto, vocal fry, pitch range, chest/head voice
9 Languages 1,533 Language identity (French, Arabic, Japanese, etc.)
10 Accents / Slang 228 Regional pronunciation, dialect, AAVE, code-switching
11 Emotions (EmoNet 40) 1,760 40 emotion categories: joy, anger, fear, sadness, etc.
12 Talking Styles 3,452 Narration, broadcast, whisper, theatrical, casual, didactic
13 Character Archetypes 303 Villain, mentor, child, gamer, military commander
14 Timbre & Speaker Qualities 347 Raspy, nasal, smooth, breathy, warm, deep, bright

Quick Start

Install dependencies

pip install torch huggingface_hub transformers torchaudio safetensors

Load the SAE

from sae import SparseAutoencoder

# Download from HuggingFace and load
sae = SparseAutoencoder.from_pretrained("laion/majestrino-1.00-16xk5-sae")
sae.eval()

Full pipeline: Audio β†’ Majestrino embedding β†’ SAE features

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import WhisperModel, WhisperFeatureExtractor
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from sae import SparseAutoencoder
import json

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# ── Step 1: Load Majestrino 1.00 base model ──

class MajestrinoCLAP(nn.Module):
    def __init__(self):
        super().__init__()
        self.whisper = WhisperModel.from_pretrained("openai/whisper-small")
        self.audio_encoder = self.whisper.encoder
        input_dim = self.whisper.config.d_model  # 768
        self.projector = nn.Sequential(
            nn.Linear(input_dim, 2048),
            nn.GELU(),
            nn.Linear(2048, 768),
        )

    def encode_audio(self, features):
        out = self.audio_encoder(features).last_hidden_state
        out = out.mean(dim=1)
        return F.normalize(self.projector(out), p=2, dim=1)

majestrino = MajestrinoCLAP().to(DEVICE).eval()

# Load weights (note: key remapping audio_proj -> projector)
weights_path = hf_hub_download("laion/Majestrino-1.00", "model.safetensors")
state_dict = load_file(weights_path)
remapped = {k.replace("audio_proj.", "projector."): v for k, v in state_dict.items()}
majestrino.load_state_dict(remapped, strict=False)

# ── Step 2: Load SAE ──

sae = SparseAutoencoder.from_pretrained("laion/majestrino-1.00-16xk5-sae", device=DEVICE)

# ── Step 3: Load annotations ──

annotations_path = hf_hub_download("laion/majestrino-1.00-16xk5-sae", "annotations.json")
with open(annotations_path) as f:
    annotations = json.load(f)  # dict: feature_id_str -> {title, description, ...}

# ── Step 4: Process audio ──

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")

waveform, sr = torchaudio.load("your_audio.wav")
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)
waveform = waveform.mean(dim=0)  # mono

inputs = feature_extractor(waveform.numpy(), sampling_rate=16000, return_tensors="pt")
mel = inputs.input_features.to(DEVICE)

with torch.no_grad():
    embedding = majestrino.encode_audio(mel)       # (1, 768)
    recons, info = sae(embedding)                   # top-k decomposition
    top_indices = info["inds"][0].cpu().tolist()     # 5 feature indices
    top_values = info["vals"][0].cpu().tolist()      # 5 activation values

print("Active features:")
for idx, val in zip(top_indices, top_values):
    ann = annotations.get(str(idx), {})
    title = ann.get("title", "Unknown")
    print(f"  Feature {idx}: {title} (activation={val:.4f})")

Example output

Active features:
  Feature 4821: Casual American Male Speech (activation=0.3142)
  Feature 7203: Conversational Narration (activation=0.2891)
  Feature 1156: Standard American English (activation=0.2453)
  Feature 9834: Clear Articulate Delivery (activation=0.1987)
  Feature 3291: Warm Baritone Timbre (activation=0.1654)

Files

β”œβ”€β”€ sae.py                     # Standalone SAE class (copy to your project)
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ config.json            # Model hyperparameters
β”‚   └── state_dict.pth         # PyTorch weights (73 MB)
β”œβ”€β”€ annotations.json           # 9,575 feature annotations
β”œβ”€β”€ group_assignments.json     # Feature β†’ group mapping
└── reports/
    β”œβ”€β”€ index.html             # Main feature index (browseable)
    β”œβ”€β”€ index_groups.html      # Grouped feature view
    └── feature_reports.tar    # 10,684 individual feature pages with audio

Extracting feature reports

# Download and extract the interactive HTML reports
cd reports/
tar xf feature_reports.tar
# Open index.html in a browser to explore all features

Architecture

Input (768-d Majestrino embedding)
  β”‚
  β”œβ”€ subtract pre_bias
  β”‚
  β”œβ”€ encoder: Linear(768 β†’ 12288, no bias)
  β”‚
  β”œβ”€ add latent_bias
  β”‚
  β”œβ”€ top-k (k=5): keep 5 largest activations
  β”‚
  β”œβ”€ ReLU
  β”‚
  β”œβ”€ decoder: Linear(12288 β†’ 768, no bias)
  β”‚
  └─ add pre_bias β†’ reconstruction (768-d)

Training Details

  • Base embeddings: Majestrino 1.00 (embedding_0_11 column from majestrino-data)
  • Training samples: 7,608,199 embeddings
  • Validation samples: 7,615 embeddings
  • Optimizer: Adam (lr=1e-4)
  • Loss: MSE reconstruction + AuxK dead neuron recovery + frequency overactivation penalty (coef=3.0, decay=0.999)
  • Dead features: 2,713 / 12,288 (22.1%) β€” these are features that never activate and are excluded from annotations
  • Alive & annotated: 9,575 features with Gemini-generated titles and descriptions

Annotations

Each annotated feature in annotations.json has:

{
  "3400": {
    "bin": 18,
    "bin_name": "Angry & Hostile State",
    "title": "Intense Anger and Frustration",
    "description": "The primary commonality across all positive samples is ...",
    "consistency": "high",
    "reasoning": "..."
  }
}

Group assignments in group_assignments.json:

{
  "3400": [11],
  "5234": [12, 14]
}

Values are lists of group IDs (1-14). Features can belong to multiple groups (183 do).

Citation

@misc{majestrino-sae-2025,
  title={Sparse Autoencoder for Majestrino 1.00 Voice Embeddings},
  author={LAION},
  year={2025},
  url={https://huggingface.co/laion/majestrino-1.00-16xk5-sae}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for laion/majestrino-1.00-16xk5-sae

Finetuned
(1)
this model

Dataset used to train laion/majestrino-1.00-16xk5-sae