Combining multiple IOB datasets with unique tokenization

HarryPotterPayan · December 13, 2025, 9:21am

I am trying to fine-tune a pretrained model for NER tasks for which I have multiple datasets from different sources. They are a mishmash of different types ie, some are of standoff format and some use the IOB format.

The problem is that even among the datasets that use the IOB format, the tokens seem to be tokenized in different ways. For example, the words “U.S.A” is tokenized as:

dataset_A = [‘U’, ‘.’, ‘S’, ‘.’, ‘A’]
dataset_B = [‘U.’, ‘S.’, ‘A’]

As you can see, there is a big discrepancy here. How do I deal with this if I want to use both datasets to fine tune my model without any conflicts

John6666 · December 13, 2025, 12:57pm

Don’t expect tokenization to be consistent or universal. The safest approach is to use the tokenizer provided with the model.

So, while it may seem like a detour, converting back to words and re-tokenizing is the most reliable method.

AmosTipton · December 14, 2025, 4:52pm

You’ve identified the core issue correctly: IOB labels are defined over tokens, not text, so once tokenization diverges, the labels are no longer comparable.

There are essentially three safe strategies, depending on how much control you want.

1. Canonicalize tokenization first (recommended)

Pick one tokenizer (ideally the pretrained model’s tokenizer) as the source of truth and re-tokenize all datasets to it.

Practical approach:

Convert all datasets (IOB + standoff) into a common character-span representation first.
Then re-tokenize the raw text using your chosen tokenizer.
Re-project labels from spans → tokens.

This avoids heuristics like splitting/merging tokens after the fact, which tends to silently corrupt labels.

2. Use alignment with offsets (acceptable, but brittle)

If you already have offset mappings:

Tokenize with return_offsets_mapping=True
Align IOB tags by matching character spans to token offsets

This works, but you’ll still need to handle edge cases like punctuation splits (U.S.A vs U . S . A) carefully.

3. Don’t mix tokenizations at all (simplest)

If the datasets are small or heterogeneous:

Train separate adapters / heads per tokenization
Or fine-tune on one dataset at a time

This avoids conflicts but limits cross-dataset learning.

Important rule of thumb:

If two datasets disagree on token boundaries, fix it before training. Trying to “average it out” during training usually just teaches the model inconsistent supervision.

Your instinct is right — this isn’t a modeling problem, it’s a data contract problem.

Happy to clarify further if you want to share which model/tokenizer you’re targeting.

Topic		Replies	Views
IOB tagging for NER / DatasetDict Beginners	1	963	November 20, 2023
Tokenization in a NER context 🤗Tokenizers	5	5883	August 11, 2021
NER model fine tuning with labeled spans Beginners	5	4065	May 7, 2023
Token Classification Label order Intermediate	0	583	November 11, 2022
How to deal with differences between CoNLL 2003 dataset tokenisation and BER tokeniser when fine tuning NER model? Intermediate	6	2847	November 23, 2021

Combining multiple IOB datasets with unique tokenization

Related topics