Hi everyone! ![]()
After publishing my analysis of the Muon optimizer’s distributed behavior (“Reproducing and Validating Distributed Muon”), I received several DMs asking for the raw artifacts.
So I packaged everything into a reproducibility dataset:
Dataset: https://huggingface.co/datasets/bird-of-paradise/muon-distributed-reproducibility
Report: https://medium.com/@jenwei0312/reproducing-and-validating-distributed-muon-a-practical-verification-of-communication-0be4d1d9b893
What’s inside:
-
Full PyTorch Profiler Chrome traces (AdamW vs Muon)
-
Hybrid parallelism traces (DP=2/TP=2, DP4, TP4)
-
Analysis scripts used for parsing comm/compute ratios
-
High-res figures from the report
-
PDF of the full write-up
-
Trace files ready for
chrome://tracingor Perfetto
Why this dataset?
Muon’s communication-efficiency claims are interesting, but hard to verify without raw traces.
This repository aims to make those experiments:
-
transparent
-
reproducible
-
independently verifiable
Phase 3: Where This Work Goes Next (Wishlist)
The real fun begins beyond 4 GPUs.
If you have compute or want to collaborate, here’s the roadmap I’m aiming at next:
1. Scale to 32+ GPUs
Validate whether the DP=2 / TP=2 hybrid sweet spot holds up across multi-node environments.
2. Real LLM Pretraining Runs
Move beyond synthetic benchmarks and evaluate Muon vs AdamW on actual convergence curves.
3. Gather–Compute Overlap (Nightmare 2.0
)
Implement overlap logic to hide the remaining 1% latency and stress-test Muon’s comm pattern under load.
4. Pipeline Parallelism Integration
Test Muon inside a full 3D parallelism setup (DP + TP + PP).
5. Stress-Testing Real Clusters
If anyone has access to a cluster that could use a real distributed training workout —
I have the scripts, configs, and experiment plan ready to go.
Call for Collaboration
If you:
-
want to run these experiments on your hardware
-
have cluster time and want to stress-test your environment
-
or are exploring optimizer behavior at scale
I’d love to collaborate or compare findings.
Reproducibility datasets are only the first step —
scaling the experiments is where things get interesting. ![]()
![]()
![]()
— Jen