Ah⌠sorry⌠
Letâs continue the clarification. For now, there is something similar to a known bug (related to UlyssesSPDataLoaderAdapter), but it looks different.
However, the meaning behind the numbers the library outputs seem inherently difficult to grasp I thinkâŚ
What you are seeing is consistent with how DeepSpeed Ulysses SP is integrated in Accelerate and Transformers. Your conclusion âthe same sequence is replicated so compute is Ă4â is the wrong mental model for SP. The batch contents can look replicated on all GPUs for a step, while the attention compute is sharded across GPUs.
Below is the clean model that matches all three facts you reported: (1) steps do not shrink, (2) num_tokens becomes exactly Ă4, (3) you do not need to âhave examples of length 15,922â.
1) What âUlysses sequence parallelismâ actually does
Ulysses SP is a long-context enabler. It partitions along the sequence dimension and uses collectives (all-to-all) during attention so multiple GPUs cooperate on one long sequence. This is the core idea in the official Accelerate docs and DeepSpeed tutorial. (Hugging Face)
Key detail that resolves your âreplicationâ concern:
The DataLoader protocol is âone rankâs batch, all ranks cooperateâ
The ALST integration paper (the one that upstreamed the HF Transformers integration) describes the DataLoader adapter behavior explicitly:
UlyssesSPDataLoaderAdapter shards each batch along sequence
- then uses a single rankâs batch and processes it collaboratively across all ranks
- and iterates over ranks, doing âone batch at a time using all ranks in parallelâ (arXiv)
So yes, at a given optimizer step, all 4 GPUs can appear to âsee the same sampleâ. That is expected for SP because they must cooperate on the same batch to compute attention correctly.
But that is not â4 independent full computationsâ. It is â1 computation spread across 4 GPUs, plus communicationâ.
2) Why num_tokens becomes exactly Ă4 in TRL SFTTrainer
This one is simpler: it is a logging artifact.
TRLâs trainers compute âtokens in batchâ as something like:
inputs["attention_mask"].sum() (or position_ids fallback)
- then they call
accelerator.gather_for_metrics(...)
- then they sum across processes (GitHub)
So if each of the 4 processes reports ~15,922 tokens locally for that step, the logged metric becomes:
That matches your observation exactly.
Why local token counts can be identical on all ranks
With Ulysses SP, the framework often relies on position_ids rather than attention_mask for the SP path. Accelerateâs SP docs explicitly call out that this SP implementation uses position_ids. (Hugging Face)
That combination means:
- the model can be doing correct SP sharding internally
- while your
attention_mask.sum() is still âglobal-lookingâ on every rank
- and TRL then sums it across ranks for logging
So num_tokens is not a reliable indicator of âhow many unique tokens were processedâ under SP.
3) Why âTotal optimization steps = 20,243â is not a bug
Transformersâ Trainer explicitly accounts for SP when computing the effective DP world size:
dp_world_size = world_size // (tp_size * cp_size * sp_size)
- total batch for optimizer math uses that
dp_world_size (Hugging Face)
In your run:
world_size = 4
sp_size = 4
- so
dp_world_size = 1
That makes âoptimizer steps per epoch â number of examplesâ plausible and often expected in SP-only layouts.
Separately, the Transformers DeepSpeed doc text for SP says your DataLoader should keep feeding different samples per rank, and the adapter handles distribution across SP ranks. (Hugging Face)
Combine that with the ALST paperâs âiterate over ranks, one batch at a time using all ranksâ, and you get: you still cover the dataset, but you do not get DP throughput scaling. (arXiv)
4) Your âI donât have sequences of length 15,922â concern
Three separate reasons this is not alarming:
-
You might simply have that length somewhere. Exact lengths are easy to miss in a large dataset.
-
Packing and truncation create lengths that are not âsingle-example lengthsâ. TRL explicitly distinguishes:
- global sequence length (your configured
max_seq_length or max_length)
- micro sequence length per GPU after splitting (roughly global /
sp_size) (Hugging Face)
If packing is enabled anywhere, a âtraining sequenceâ can be a concatenation of multiple dataset rows.
- That 15,922 is a logging-derived number from the first step you looked at. Under SP it is especially not safe to interpret it as âone dataset example lengthâ.
5) Is it ever a real bug? Yes, but it looks different
There is a real DeepSpeed issue report titled:
- âUlyssesSPDataLoaderAdapter returns duplicate dataâ (GitHub)
That issueâs claim is stronger than what you observed. It reports each rank printing the same input IDs repeatedly (N times) when using the adapter. (GitHub)
So:
- âbatch looks identical across ranks for a given stepâ can be expected in SP
- âthe same batch repeats multiple times unexpectedlyâ could be a bug (and is being reported upstream)
6) What to do to confirm SP is actually working in your run
These checks avoid misleading metrics.
Check A: micro-sequence shape per rank
Log input_ids.shape[-1] on each rank for one step.
- If SP is active, per-rank sequence length should look like âglobal_length / 4â (maybe with padding to be divisible).
Check B: memory and OOM boundary
Pick a global length that OOMs with sp_size=1.
Run the same config with sp_size=4.
If it fits, SP is doing the intended sharding.
Check C: stop summing tokens across ranks in your own logs
Instead of TRLâs gather_for_metrics(...).sum(), log one of:
- rank0 local tokens only
- mean tokens across ranks (sum / world_size)
This makes the number stable across 1 GPU vs 4 GPU.
Similar cases and high-signal references
Guides and docs:
Papers:
Issues and pitfalls:
Bottom line
- Seeing
63,688 = 4 Ă 15,922 does not prove duplicated training. It strongly suggests âtoken metric summed across ranksâ. (GitHub)
- In Ulysses SP, it is expected that âone batch is processed collaboratively across all ranksâ at a step. That can look like replication, but it is the mechanism that enables long context. (arXiv)
- Your optimizer step count is consistent with Trainerâs explicit
dp_world_size formula that divides out sp_size. (Hugging Face)