Distributed Muon – Reproducibility Dataset (Chrome Traces, Scripts, Figures, Logs)

Hi everyone! :waving_hand:

After publishing my analysis of the Muon optimizer’s distributed behavior (“Reproducing and Validating Distributed Muon”), I received several DMs asking for the raw artifacts.

So I packaged everything into a reproducibility dataset:

:backhand_index_pointing_right: Dataset: https://huggingface.co/datasets/bird-of-paradise/muon-distributed-reproducibility
:backhand_index_pointing_right: Report: https://medium.com/@jenwei0312/reproducing-and-validating-distributed-muon-a-practical-verification-of-communication-0be4d1d9b893

:globe_with_meridians: What’s inside:

  • Full PyTorch Profiler Chrome traces (AdamW vs Muon)

  • Hybrid parallelism traces (DP=2/TP=2, DP4, TP4)

  • Analysis scripts used for parsing comm/compute ratios

  • High-res figures from the report

  • PDF of the full write-up

  • Trace files ready for chrome://tracing or Perfetto

:magnifying_glass_tilted_right: Why this dataset?

Muon’s communication-efficiency claims are interesting, but hard to verify without raw traces.
This repository aims to make those experiments:

  • transparent

  • reproducible

  • independently verifiable

:crystal_ball: Phase 3: Where This Work Goes Next (Wishlist)

The real fun begins beyond 4 GPUs.
If you have compute or want to collaborate, here’s the roadmap I’m aiming at next:

1. Scale to 32+ GPUs

Validate whether the DP=2 / TP=2 hybrid sweet spot holds up across multi-node environments.

2. Real LLM Pretraining Runs

Move beyond synthetic benchmarks and evaluate Muon vs AdamW on actual convergence curves.

3. Gather–Compute Overlap (Nightmare 2.0 :smiling_face_with_horns:)

Implement overlap logic to hide the remaining 1% latency and stress-test Muon’s comm pattern under load.

4. Pipeline Parallelism Integration

Test Muon inside a full 3D parallelism setup (DP + TP + PP).

5. Stress-Testing Real Clusters

If anyone has access to a cluster that could use a real distributed training workout —
I have the scripts, configs, and experiment plan ready to go.


:handshake: Call for Collaboration

If you:

  • want to run these experiments on your hardware

  • have cluster time and want to stress-test your environment

  • or are exploring optimizer behavior at scale

I’d love to collaborate or compare findings.

Reproducibility datasets are only the first step —
scaling the experiments is where things get interesting. :smiling_face_with_horns::turtle::sparkles:

— Jen

1 Like