Distributed Muon – Reproducibility Dataset (Chrome Traces, Scripts, Figures, Logs)

bird-of-paradise · November 30, 2025, 4:08am

Hi everyone!

After publishing my analysis of the Muon optimizer’s distributed behavior (“Reproducing and Validating Distributed Muon”), I received several DMs asking for the raw artifacts.

So I packaged everything into a reproducibility dataset:

Dataset: https://huggingface.co/datasets/bird-of-paradise/muon-distributed-reproducibility
Report: https://medium.com/@jenwei0312/reproducing-and-validating-distributed-muon-a-practical-verification-of-communication-0be4d1d9b893

What’s inside:

Full PyTorch Profiler Chrome traces (AdamW vs Muon)
Hybrid parallelism traces (DP=2/TP=2, DP4, TP4)
Analysis scripts used for parsing comm/compute ratios
High-res figures from the report
PDF of the full write-up
Trace files ready for chrome://tracing or Perfetto

Why this dataset?

Muon’s communication-efficiency claims are interesting, but hard to verify without raw traces.
This repository aims to make those experiments:

transparent
reproducible
independently verifiable

Phase 3: Where This Work Goes Next (Wishlist)

The real fun begins beyond 4 GPUs.
If you have compute or want to collaborate, here’s the roadmap I’m aiming at next:

1. Scale to 32+ GPUs

Validate whether the DP=2 / TP=2 hybrid sweet spot holds up across multi-node environments.

2. Real LLM Pretraining Runs

Move beyond synthetic benchmarks and evaluate Muon vs AdamW on actual convergence curves.

3. Gather–Compute Overlap (Nightmare 2.0 )

Implement overlap logic to hide the remaining 1% latency and stress-test Muon’s comm pattern under load.

4. Pipeline Parallelism Integration

Test Muon inside a full 3D parallelism setup (DP + TP + PP).

5. Stress-Testing Real Clusters

If anyone has access to a cluster that could use a real distributed training workout —
I have the scripts, configs, and experiment plan ready to go.

Call for Collaboration

If you:

want to run these experiments on your hardware
have cluster time and want to stress-test your environment
or are exploring optimizer behavior at scale

I’d love to collaborate or compare findings.

Reproducibility datasets are only the first step —
scaling the experiments is where things get interesting.

— Jen

Topic		Replies	Views
Reproducing & Validating Distributed Muon (MoonshotAI) — Performance & Communication Results Show and Tell	5	39	December 14, 2025
🚀 [tutorial]Update: Reverse-Engineering Breakdown Released — “The Muon is Scalable” (CPU-Friendly) Blueprint Show and Tell	0	42	November 7, 2025
My Muon Replication Journey — From Distributed Optimizers to a No-BS Training Glossary 🧩 Show and Tell	2	95	October 28, 2025
[Tutorial] Understanding and Implementing the Muon Optimizer Show and Tell	2	1704	November 7, 2025
Study Group: Implementing a Scalable, FSDP-Compatible Muon Optimizer Research	5	188	October 1, 2025

Distributed Muon – Reproducibility Dataset (Chrome Traces, Scripts, Figures, Logs)

What’s inside:

Why this dataset?

Phase 3: Where This Work Goes Next (Wishlist)

Call for Collaboration

Related topics