Confusion with sequence parallelism

shon-ot · December 16, 2025, 12:38pm

Hi everyone,

I’m training a model using SFTTrainer with 4 GPUs and I’m confused about the number of optimization steps reported and in general how sequence parallel works in accelerate. Here’s my configuration:

training_args = SFTConfig(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    
    parallelism_config=ParallelismConfig(
        sp_backend="deepspeed",
        dp_replicate_size=1,
        dp_shard_size=1,
        sp_size=4,
        sp_handler=DeepSpeedSequenceParallelConfig(
            sp_seq_length_is_variable=True,
            sp_attn_implementation="flash_attention_2",
        ),
    )
)

Training output:

***** Running training *****
Num examples = 20,243
Num Epochs = 1
Instantaneous batch size per device = 1
Total train batch size (w. parallel, distributed & accumulation) = 4
Gradient Accumulation steps = 1
Total optimization steps = 20,243

After one step:

{'loss': 1.9432, 'grad_norm': 0.236328125, 'learning_rate': 0.0002, 
 'num_tokens': 63688.0, 'mean_token_accuracy': 0.8572250008583069, 'epoch': 0.0}

My understanding:

With sequence parallelism (sp_size=4), each sequence is split across 4 GPUs

Questions:

Note: the largest example in my dataset is 40K tokens

num tokens after 1 optimization step is ~64K meaning not only 1 sequence were processed after 1 step

Any clarification would be greatly appreciated!

John6666 · December 16, 2025, 1:57pm

GPT says:

You are misunderstanding two things:

Sequence parallelism (SP) does not mean “4 different sequences per step.”
The num_tokens you see is a global, aggregated token counter, not “tokens in the one sample you think you trained on.”

Below is the clean mental model that matches your exact logs.

Background you need (DP vs SP)

Data parallelism (DP)

DP scales examples per optimizer step.
Each GPU sees different samples.
Steps per epoch usually drop as you add GPUs.

Sequence parallelism (SP, DeepSpeed ALST/Ulysses)

SP scales max sequence length you can fit.
Multiple GPUs cooperate to process the sequence dimension (tokens) in parallel.
From the user perspective, “multiple GPUs are used to process a single batch.” (Hugging Face)

So SP is primarily a long-context enabler, not a throughput multiplier. (Hugging Face)

Why “Total optimization steps = 20,243” is actually correct

Transformers’ Trainer computes the effective DP size like this:

dp_world_size = world_size // (tp_size * cp_size * sp_size) (Hugging Face)
Total batch size for optimizer math is essentially:
micro_batch * grad_accum * dp_world_size (Hugging Face)

In your run:

world_size = 4
sp_size = 4
no TP and no CP
so dp_world_size = 4 // 4 = 1 (Hugging Face)

That means your effective global batch for optimizer updates is:

1 (per_device) * 1 (grad_accum) * 1 (dp_world_size) = 1

So one epoch over 20,243 examples produces about 20,243 optimizer steps, which is exactly what you see.

Why the banner says “Total train batch size … = 4”

That line is often computed using “number of processes” (4) in a way that is easy to misread under SP.

The Trainer’s own documented batch-size logic explicitly divides out sp_size via dp_world_size. (Hugging Face)
So in SP-heavy runs, you should trust the dp_world_size formula more than the “Total train batch size” banner.

Why `num_tokens = 63688` after one step does NOT mean multiple sequences were processed

What TRL is actually counting

In SFTTrainer, the token counter is computed from something like attention_mask.sum() and then gathered across processes for metrics. The implementation pattern uses accelerator.gather_for_metrics(...) and then sums. (GitHub)

So num_tokens is best interpreted as:

“tokens this step, aggregated across processes” (and then accumulated over time).

It is not “the length of the one dataset example.”

Why your number looks like ~64K

Do the simplest consistency check:

63688 / 4 ≈ 15922

That is very close to “~16k tokens per process.” With SP and metric gathering, it is completely normal for the logged token count to look like a multiple of the per-rank count. (GitHub)

Another common reason: packing can concatenate multiple samples

TRL’s distributed-training guidance explicitly shows SP setups often used with packing=True, and also clarifies that max_seq_length is the global sequence length before it is split into “micro sequences” across GPUs. (Hugging Face)
If packing is enabled anywhere in your pipeline, a single “training sequence” can be built from multiple dataset rows, and its token count can exceed your longest single row.

What you should do (depending on your goal)

Goal A: train with very long sequences (your “40k tokens” goal)

Keep SP, but align expectations:

Accept that steps won’t shrink with GPU count when sp_size consumes all GPUs and dp_world_size=1. (Hugging Face)
Make sequence splitting safe and efficient:
- Use Flash Attention (you already do). HF explicitly recommends Flash Attention because SDPA may attend across packed sample boundaries. (Hugging Face)
- Ensure padding divisibility. HF recommends pad_to_multiple_of so lengths are divisible by sp_size. (Hugging Face)
Scale batch size the way SP expects:
- If you want a larger effective optimizer batch, increase gradient_accumulation_steps (since DP is effectively 1 here).

Goal B: increase throughput (more examples per optimizer step)

Use 2D parallelism (DP + SP) instead of pure SP.

Both Accelerate and TRL docs show that with 4 GPUs a common balanced setup is:

sp_size=2 and dp_shard_size=2 (or dp_replicate_size=2) so you get both longer context and more data parallelism. (Hugging Face)

This changes dp_world_size from 1 to 2, so steps per epoch drop and you actually process more unique examples per optimizer update.

Two quick checks that make the behavior obvious

Check 1: local vs global token counting

Log these for one step:

local_tokens = inputs["attention_mask"].sum().item()
global_tokens = accelerator.gather_for_metrics(inputs["attention_mask"].sum()).sum().item() (GitHub)

If global_tokens ≈ 4 * local_tokens, your ~64k is just “summed across 4 ranks.”

Check 2: are you packing or truncating?

Confirm whether packing is enabled (packing=True can inflate per-step tokens). (Hugging Face)
Confirm what max_seq_length actually is in your run. TRL explicitly calls it the global sequence length before splitting, and gives the SP example (global 8192 with sp_size=2 means 4096 tokens per GPU). (Hugging Face)

Summary

Your step count is correct because dp_world_size = world_size // sp_size = 1, so you do not get fewer steps per epoch. (Hugging Face)
num_tokens is aggregated across ranks, so ~64k after one step can simply be “~16k per rank summed across 4.” (GitHub)
Decide your goal:
- Long context: keep SP, use pad_to_multiple_of, FlashAttention, tune grad accumulation. (Hugging Face)
- Throughput: switch to 2D DP+SP like sp_size=2, dp_shard_size=2. (Hugging Face)

shon-ot · December 16, 2025, 5:08pm

Thanks @John6666 but you can assume i already asked chats about it and then posted here. anyway, I did another analysis:

I ran my script once with

parallelism_config=ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=1,
    dp_shard_size=1,
    sp_handler=DeepSpeedSequenceParallelConfig(
        sp_seq_length_is_variable=True,
        sp_attn_implementation="flash_attention_2",
    ),
)

CUDA_VISIBLE_DEVICES=0 accelerate train_sft.py

and got:

{‘loss’: 0.4858, ‘num_tokens’: 15,922}

which is exactly /4 comparing to run with CUDA_VISIBLE_DEVICES=0,1,2,3 !!!

I dont have multiple seuences with length 15,922

Meaning while running sp_size=4 the same sequence is replicates to all devices and all computation loos/statistics etc are x4 time.

so I wonder if I missing something or its a bug…

John6666 · December 16, 2025, 11:54pm

Ah… sorry…

Let’s continue the clarification. For now, there is something similar to a known bug (related to UlyssesSPDataLoaderAdapter), but it looks different.

However, the meaning behind the numbers the library outputs seem inherently difficult to grasp I think…

What you are seeing is consistent with how DeepSpeed Ulysses SP is integrated in Accelerate and Transformers. Your conclusion “the same sequence is replicated so compute is ×4” is the wrong mental model for SP. The batch contents can look replicated on all GPUs for a step, while the attention compute is sharded across GPUs.

Below is the clean model that matches all three facts you reported: (1) steps do not shrink, (2) num_tokens becomes exactly ×4, (3) you do not need to “have examples of length 15,922”.

1) What “Ulysses sequence parallelism” actually does

Ulysses SP is a long-context enabler. It partitions along the sequence dimension and uses collectives (all-to-all) during attention so multiple GPUs cooperate on one long sequence. This is the core idea in the official Accelerate docs and DeepSpeed tutorial. (Hugging Face)

Key detail that resolves your “replication” concern:

The DataLoader protocol is “one rank’s batch, all ranks cooperate”

The ALST integration paper (the one that upstreamed the HF Transformers integration) describes the DataLoader adapter behavior explicitly:

UlyssesSPDataLoaderAdapter shards each batch along sequence
then uses a single rank’s batch and processes it collaboratively across all ranks
and iterates over ranks, doing “one batch at a time using all ranks in parallel” (arXiv)

So yes, at a given optimizer step, all 4 GPUs can appear to “see the same sample”. That is expected for SP because they must cooperate on the same batch to compute attention correctly.

But that is not “4 independent full computations”. It is “1 computation spread across 4 GPUs, plus communication”.

2) Why `num_tokens` becomes exactly ×4 in TRL SFTTrainer

This one is simpler: it is a logging artifact.

TRL’s trainers compute “tokens in batch” as something like:

inputs["attention_mask"].sum() (or position_ids fallback)
then they call accelerator.gather_for_metrics(...)
then they sum across processes (GitHub)

So if each of the 4 processes reports ~15,922 tokens locally for that step, the logged metric becomes:

4 × 15,922 = 63,688

That matches your observation exactly.

Why local token counts can be identical on all ranks

With Ulysses SP, the framework often relies on position_ids rather than attention_mask for the SP path. Accelerate’s SP docs explicitly call out that this SP implementation uses position_ids. (Hugging Face)

That combination means:

the model can be doing correct SP sharding internally
while your attention_mask.sum() is still “global-looking” on every rank
and TRL then sums it across ranks for logging

So num_tokens is not a reliable indicator of “how many unique tokens were processed” under SP.

3) Why “Total optimization steps = 20,243” is not a bug

Transformers’ Trainer explicitly accounts for SP when computing the effective DP world size:

dp_world_size = world_size // (tp_size * cp_size * sp_size)
total batch for optimizer math uses that dp_world_size (Hugging Face)

In your run:

world_size = 4
sp_size = 4
so dp_world_size = 1

That makes “optimizer steps per epoch ≈ number of examples” plausible and often expected in SP-only layouts.

Separately, the Transformers DeepSpeed doc text for SP says your DataLoader should keep feeding different samples per rank, and the adapter handles distribution across SP ranks. (Hugging Face)
Combine that with the ALST paper’s “iterate over ranks, one batch at a time using all ranks”, and you get: you still cover the dataset, but you do not get DP throughput scaling. (arXiv)

4) Your “I don’t have sequences of length 15,922” concern

Three separate reasons this is not alarming:

You might simply have that length somewhere. Exact lengths are easy to miss in a large dataset.
Packing and truncation create lengths that are not “single-example lengths”. TRL explicitly distinguishes:

global sequence length (your configured max_seq_length or max_length)
micro sequence length per GPU after splitting (roughly global / sp_size) (Hugging Face)
If packing is enabled anywhere, a “training sequence” can be a concatenation of multiple dataset rows.

That 15,922 is a logging-derived number from the first step you looked at. Under SP it is especially not safe to interpret it as “one dataset example length”.

5) Is it ever a real bug? Yes, but it looks different

There is a real DeepSpeed issue report titled:

“UlyssesSPDataLoaderAdapter returns duplicate data” (GitHub)

That issue’s claim is stronger than what you observed. It reports each rank printing the same input IDs repeatedly (N times) when using the adapter. (GitHub)

So:

“batch looks identical across ranks for a given step” can be expected in SP
“the same batch repeats multiple times unexpectedly” could be a bug (and is being reported upstream)

6) What to do to confirm SP is actually working in your run

These checks avoid misleading metrics.

Check A: micro-sequence shape per rank

Log input_ids.shape[-1] on each rank for one step.

If SP is active, per-rank sequence length should look like “global_length / 4” (maybe with padding to be divisible).

Check B: memory and OOM boundary

Pick a global length that OOMs with sp_size=1.
Run the same config with sp_size=4.
If it fits, SP is doing the intended sharding.

Check C: stop summing tokens across ranks in your own logs

Instead of TRL’s gather_for_metrics(...).sum(), log one of:

rank0 local tokens only
mean tokens across ranks (sum / world_size)
This makes the number stable across 1 GPU vs 4 GPU.

Similar cases and high-signal references

Guides and docs:

Accelerate sequence parallelism guide (Ulysses details, position_ids note, DP+SP examples): https://huggingface.co/docs/accelerate/en/concept_guides/sequence_parallelism (Hugging Face)
Transformers DeepSpeed guide (SP integration notes, DataLoader adapter mention): https://huggingface.co/docs/transformers/en/deepspeed (Hugging Face)
Transformers Trainer batch math (dp_world_size formula that divides by sp_size): https://huggingface.co/docs/transformers/en/main_classes/trainer (Hugging Face)

Papers:

ALST paper (explains the exact “single rank’s batch, all ranks cooperate, iterate over ranks” DataLoader protocol): https://arxiv.org/abs/2506.13996 (arXiv)
DeepSpeed-Ulysses paper (original Ulysses method): https://arxiv.org/abs/2309.14509 (arXiv)
Unified SP analysis (Ulysses + Ring Attention perspective): https://arxiv.org/abs/2405.07719 (arXiv)

Issues and pitfalls:

DeepSpeed issue: UlyssesSPDataLoaderAdapter duplicate data report: https://github.com/deepspeedai/DeepSpeed/issues/7484 (GitHub)
TRL issues showing gather_for_metrics token counting patterns and pitfalls: https://github.com/huggingface/trl/issues/3012 (GitHub)

Bottom line

Seeing 63,688 = 4 × 15,922 does not prove duplicated training. It strongly suggests “token metric summed across ranks”. (GitHub)
In Ulysses SP, it is expected that “one batch is processed collaboratively across all ranks” at a step. That can look like replication, but it is the mechanism that enables long context. (arXiv)
Your optimizer step count is consistent with Trainer’s explicit dp_world_size formula that divides out sp_size. (Hugging Face)

Topic		Replies	Views
New Trainer Doc no some properties but Old Doc have (n_gpu, parallel_mode) 🤗Transformers	3	316	December 6, 2022
Understanding gpu usage huggingface classification - Total optimization steps Beginners	0	1313	March 26, 2022
What does "--multi_gpu" do under the hood? (and how to use it) 🤗Accelerate	7	6925	May 31, 2023
Accelerate Distributed Randomly Hangs 🤗Accelerate	0	116	September 11, 2024
Incorrect total train batch size when using tp_size > 1 and deepspeed DeepSpeed	1	145	May 20, 2025