- HT-Demucs FT β Production-ready PyTorch model card
- Quality (independently benchmarked)
- Quick start (Python)
- Deploy on Hugging Face Inference Endpoints
- Skip the infrastructure β use the StemSplit API
- Performance
- How
htdemucs_ftdiffers from the other Demucs models - Single-stem specialist variants (faster, smaller)
- Files in this repo
- License & attribution
- Quality (independently benchmarked)
HT-Demucs FT β Production-ready PyTorch model card
The highest-vocal-SDR open-source stem separator on MUSDB18-HQ (9.19 dB
median), packaged for Hugging Face Inference Endpoints with a ready-to-deploy
handler.py. Use it for vocal removal, karaoke generation, acapella
extraction, and any task that needs clean 4-stem separation of music
(vocals, drums, bass, other).
This is the htdemucs_ft 4-bag ensemble by DΓ©fossez et al. (Meta AI),
repackaged with attribution. Original training and weights are unchanged;
we add the deployment handler, the model card, and the benchmark context.
Need it as a REST API today, without standing up GPUs? Use the StemSplit API β same model, hosted for you, with credits and a dashboard.
Quality (independently benchmarked)
Median SDR per stem on the standard MUSDB18-HQ test split (50 songs), BSS
Eval v4 via museval. Higher is better. Source:
StemSplitio/stem-separation-benchmark-2026
v1.1.
| Model | vocals | drums | bass | other |
|---|---|---|---|---|
htdemucs_ft (this card) |
9.19 | 10.11 | 10.38 | 6.34 |
mdx_extra_q |
9.04 | 11.49 | 11.42 | 7.67 |
htdemucs_6s |
8.66 | 9.54 | 9.11 | 5.74 |
htdemucs |
8.53 | 10.01 | 9.78 | 6.42 |
mdx_net_inst_hq3 (vocals-only) |
5.81 | β | β | β |
Pick this model when vocals are the priority β it beats every other
open-source separator on MUSDB18-HQ vocals. For drums/bass-focused work,
consider mdx_extra_q instead.
Quick start (Python)
import base64, io, soundfile as sf
from huggingface_hub import InferenceClient
with open("your-song.mp3", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
client = InferenceClient(model="StemSplitio/htdemucs-ft-pytorch")
result = client.post(json={"inputs": audio_b64})
for stem in ("vocals", "drums", "bass", "other"):
wav, sr = sf.read(io.BytesIO(base64.b64decode(result[stem])))
sf.write(f"out_{stem}.wav", wav, sr)
Or run locally without Hugging Face at all:
import torch, soundfile as sf
from demucs.apply import apply_model
from demucs.audio import convert_audio
from demucs.pretrained import get_model
model = get_model("htdemucs_ft").eval()
wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
wav = torch.from_numpy(wav.T).contiguous()
wav = convert_audio(wav, sr, model.samplerate, model.audio_channels).unsqueeze(0)
with torch.no_grad():
stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0]
for i, name in enumerate(model.sources): # ["drums", "bass", "other", "vocals"]
sf.write(f"out_{name}.wav", stems[i].T.numpy(), model.samplerate)
Deploy on Hugging Face Inference Endpoints
Click Deploy β Inference Endpoints above, pick a GPU instance, and HF
will spin up a container running handler.py. Recommended
hardware tiers based on M4 Pro reference latency:
| Hardware | RTF | Latency for 3-min song |
|---|---|---|
| NVIDIA L4 | ~0.04 | ~7 s |
| NVIDIA T4 small | ~0.10 | ~18 s |
| CPU x4 (basic) | ~0.7 | ~125 s |
Then call the endpoint:
curl -X POST https://<your-endpoint>.endpoints.huggingface.cloud \
-H "Authorization: Bearer $HF_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"inputs\": \"$(base64 < your-song.mp3)\"}"
Response is a JSON object with vocals, drums, bass, other
base64-encoded WAVs at 44.1 kHz.
Skip the infrastructure β use the StemSplit API
If you'd rather not run your own endpoint, the StemSplit API wraps this same model (and the rest of the benchmarked lineup) behind a hosted REST API with credits, a dashboard, and webhooks.
curl -X POST https://stemsplit.io/api/v1/jobs \
-H "Authorization: Bearer $STEMSPLIT_API_KEY" \
-F "audio=@your-song.mp3" \
-F "model=htdemucs_ft"
- π Developer docs
- π API reference
- π Guides & recipes
Or try it in your browser, no code:
- π€ Vocal Remover (free online tool) β upload a song, get an instrumental + isolated vocals
- πΆ Karaoke Maker β same model, optimised for karaoke output
- ποΈ Acapella Maker β extract clean vocal acapellas for remixes and sampling
- πΊ YouTube Stem Splitter β paste a YouTube URL, get the stems
Performance
Measured on an Apple M4 Pro (24 GB unified memory) with PyTorch 2.4 MPS, for the full 4-bag ensemble on 50 MUSDB18-HQ tracks (median track length ~4 min, RTF 0.26 Β± 0.02). Cloud GPU numbers are extrapolated from public Demucs benchmarks.
| Hardware | Per 3-min song | Peak RAM | Notes |
|---|---|---|---|
| Apple M4 Pro (MPS) | ~47 s | 3.1 GB | Measured in our benchmark (RTF 0.26) |
| NVIDIA L4 (CUDA) | ~7 s | 4 GB | Extrapolated |
| NVIDIA T4 small (CUDA) | ~18 s | 4 GB | Extrapolated |
| CPU (8-core) | ~125 s | 3 GB | Slow, but works for batch jobs |
How htdemucs_ft differs from the other Demucs models
| Variant | Bag size | Best at | When to choose |
|---|---|---|---|
htdemucs_ft (this) |
4 | Vocals | Karaoke, vocal isolation, acapella extraction |
htdemucs |
1 | Balanced | Lower latency / smaller deploy |
htdemucs_6s |
1 | 6-stem (adds piano, guitar) | When you need piano/guitar separately |
mdx_extra_q |
4 | Drums, bass | Music production where rhythm section is the priority |
See the full stem-separation benchmark dataset for SDR / ISR / SIR / SAR across all stems.
Single-stem specialist variants (faster, smaller)
If you only need one stem in production, ship a specialist sub-model instead of the full 4-bag ensemble. Same per-stem quality, ~160 MB instead of ~640 MB, ~2.6Γ faster on M4 Pro MPS:
| Repo | Stem | Use cases |
|---|---|---|
htdemucs-ft-drums-pytorch |
drums | Drum extraction, beat transcription, sample-pack creation |
htdemucs-ft-bass-pytorch |
bass | Bassline transcription, mix rebalancing, sub-bass mastering |
htdemucs-ft-other-pytorch |
other / instrumental | Karaoke instrumentals (pair with this vocals model), sample-flipping |
This repo (the full bag) remains the best choice when you need vocals plus any other stem in a single request β it amortises the inference cost across all 4 stems.
Files in this repo
handler.pyβEndpointHandlerclass HF Inference Endpoints calls on each request. Accepts base64 audio in, returns base64 stems out.requirements.txtβ Python deps (torch, demucs, soundfile).README.mdβ this card.
Model weights are downloaded into the container's torch hub cache on first
run (no .pt / .th files are stored in this repo to keep it small).
License & attribution
This repo is MIT-licensed, matching the original HT-Demucs.
Please cite the original authors if you use this model in research:
@inproceedings{rouard2023hybrid,
title = {Hybrid Transformers for Music Source Separation},
author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
booktitle = {ICASSP},
year = {2023}
}
And if you use the benchmark or this packaging:
@misc{stemsplit_benchmark_2026,
title = {StemSplit Stem-Separation Benchmark 2026},
author = {StemSplit},
year = {2026},
url = {https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026}
}
- Original model:
facebookresearch/demucs - Packaging by StemSplit
- Benchmark dataset: StemSplitio/stem-separation-benchmark-2026