HT-Demucs FT — Production-ready PyTorch model card

The highest-vocal-SDR open-source stem separator on MUSDB18-HQ (9.19 dB median), packaged for Hugging Face Inference Endpoints with a ready-to-deploy handler.py. Use it for vocal removal, karaoke generation, acapella extraction, and any task that needs clean 4-stem separation of music (vocals, drums, bass, other).

This is the htdemucs_ft 4-bag ensemble by Défossez et al. (Meta AI), repackaged with attribution. Original training and weights are unchanged; we add the deployment handler, the model card, and the benchmark context.

Need it as a REST API today, without standing up GPUs? Use the StemSplit API — same model, hosted for you, with credits and a dashboard.

Quality (independently benchmarked)

Median SDR per stem on the standard MUSDB18-HQ test split (50 songs), BSS Eval v4 via museval. Higher is better. Source: StemSplitio/stem-separation-benchmark-2026 v1.1.

Model	vocals	drums	bass	other
`htdemucs_ft` (this card)	9.19	10.11	10.38	6.34
`mdx_extra_q`	9.04	11.49	11.42	7.67
`htdemucs_6s`	8.66	9.54	9.11	5.74
`htdemucs`	8.53	10.01	9.78	6.42
`mdx_net_inst_hq3` (vocals-only)	5.81	—	—	—

Pick this model when vocals are the priority — it beats every other open-source separator on MUSDB18-HQ vocals. For drums/bass-focused work, consider mdx_extra_q instead.

Quick start (Python)

import base64, io, soundfile as sf
from huggingface_hub import InferenceClient

with open("your-song.mp3", "rb") as f:
    audio_b64 = base64.b64encode(f.read()).decode()

client = InferenceClient(model="StemSplitio/htdemucs-ft-pytorch")
result = client.post(json={"inputs": audio_b64})

for stem in ("vocals", "drums", "bass", "other"):
    wav, sr = sf.read(io.BytesIO(base64.b64decode(result[stem])))
    sf.write(f"out_{stem}.wav", wav, sr)

Or run locally without Hugging Face at all:

import torch, soundfile as sf
from demucs.apply import apply_model
from demucs.audio import convert_audio
from demucs.pretrained import get_model

model = get_model("htdemucs_ft").eval()
wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
wav = torch.from_numpy(wav.T).contiguous()
wav = convert_audio(wav, sr, model.samplerate, model.audio_channels).unsqueeze(0)

with torch.no_grad():
    stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0]

for i, name in enumerate(model.sources):  # ["drums", "bass", "other", "vocals"]
    sf.write(f"out_{name}.wav", stems[i].T.numpy(), model.samplerate)

Deploy on Hugging Face Inference Endpoints

Click Deploy → Inference Endpoints above, pick a GPU instance, and HF will spin up a container running handler.py. Recommended hardware tiers based on M4 Pro reference latency:

Hardware	RTF	Latency for 3-min song
NVIDIA L4	~0.04	~7 s
NVIDIA T4 small	~0.10	~18 s
CPU x4 (basic)	~0.7	~125 s

Then call the endpoint:

curl -X POST https://<your-endpoint>.endpoints.huggingface.cloud \
  -H "Authorization: Bearer $HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"inputs\": \"$(base64 < your-song.mp3)\"}"

Response is a JSON object with vocals, drums, bass, other base64-encoded WAVs at 44.1 kHz.

Skip the infrastructure — use the StemSplit API

If you'd rather not run your own endpoint, the StemSplit API wraps this same model (and the rest of the benchmarked lineup) behind a hosted REST API with credits, a dashboard, and webhooks.

curl -X POST https://stemsplit.io/api/v1/jobs \
  -H "Authorization: Bearer $STEMSPLIT_API_KEY" \
  -F "audio=@your-song.mp3" \
  -F "model=htdemucs_ft"

Or try it in your browser, no code:

🎤 Vocal Remover (free online tool) — upload a song, get an instrumental + isolated vocals
🎶 Karaoke Maker — same model, optimised for karaoke output
🎙️ Acapella Maker — extract clean vocal acapellas for remixes and sampling
📺 YouTube Stem Splitter — paste a YouTube URL, get the stems

Performance

Measured on an Apple M4 Pro (24 GB unified memory) with PyTorch 2.4 MPS, for the full 4-bag ensemble on 50 MUSDB18-HQ tracks (median track length ~4 min, RTF 0.26 ± 0.02). Cloud GPU numbers are extrapolated from public Demucs benchmarks.

Hardware	Per 3-min song	Peak RAM	Notes
Apple M4 Pro (MPS)	~47 s	3.1 GB	Measured in our benchmark (RTF 0.26)
NVIDIA L4 (CUDA)	~7 s	4 GB	Extrapolated
NVIDIA T4 small (CUDA)	~18 s	4 GB	Extrapolated
CPU (8-core)	~125 s	3 GB	Slow, but works for batch jobs

How `htdemucs_ft` differs from the other Demucs models

Variant	Bag size	Best at	When to choose
`htdemucs_ft` (this)	4	Vocals	Karaoke, vocal isolation, acapella extraction
`htdemucs`	1	Balanced	Lower latency / smaller deploy
`htdemucs_6s`	1	6-stem (adds piano, guitar)	When you need piano/guitar separately
`mdx_extra_q`	4	Drums, bass	Music production where rhythm section is the priority

See the full stem-separation benchmark dataset for SDR / ISR / SIR / SAR across all stems.

Single-stem specialist variants (faster, smaller)

If you only need one stem in production, ship a specialist sub-model instead of the full 4-bag ensemble. Same per-stem quality, ~160 MB instead of ~640 MB, ~2.6× faster on M4 Pro MPS:

Repo	Stem	Use cases
`htdemucs-ft-drums-pytorch`	drums	Drum extraction, beat transcription, sample-pack creation
`htdemucs-ft-bass-pytorch`	bass	Bassline transcription, mix rebalancing, sub-bass mastering
`htdemucs-ft-other-pytorch`	other / instrumental	Karaoke instrumentals (pair with this vocals model), sample-flipping

This repo (the full bag) remains the best choice when you need vocals plus any other stem in a single request — it amortises the inference cost across all 4 stems.

Files in this repo

handler.py — EndpointHandler class HF Inference Endpoints calls on each request. Accepts base64 audio in, returns base64 stems out.
requirements.txt — Python deps (torch, demucs, soundfile).
README.md — this card.

Model weights are downloaded into the container's torch hub cache on first run (no .pt / .th files are stored in this repo to keep it small).

License & attribution

This repo is MIT-licensed, matching the original HT-Demucs.

Please cite the original authors if you use this model in research:

@inproceedings{rouard2023hybrid,
  title     = {Hybrid Transformers for Music Source Separation},
  author    = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
  booktitle = {ICASSP},
  year      = {2023}
}

And if you use the benchmark or this packaging:

@misc{stemsplit_benchmark_2026,
  title  = {StemSplit Stem-Separation Benchmark 2026},
  author = {StemSplit},
  year   = {2026},
  url    = {https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026}
}

Original model: facebookresearch/demucs
Packaging by StemSplit
Benchmark dataset: StemSplitio/stem-separation-benchmark-2026

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train StemSplitio/htdemucs-ft-pytorch

Collection including StemSplitio/htdemucs-ft-pytorch

Music Source Separation Toolkit 2026

Collection

Open-source models + our reproducible MUSDB18-HQ benchmark for music source separation. Curated by the StemSplit team. • 19 items • Updated May 21