Combining multiple IOB datasets with unique tokenization

I am trying to fine-tune a pretrained model for NER tasks for which I have multiple datasets from different sources. They are a mishmash of different types ie, some are of standoff format and some use the IOB format.

The problem is that even among the datasets that use the IOB format, the tokens seem to be tokenized in different ways. For example, the words “U.S.A” is tokenized as:

dataset_A = [‘U’, ‘.’, ‘S’, ‘.’, ‘A’]
dataset_B = [‘U.’, ‘S.’, ‘A’]

As you can see, there is a big discrepancy here. How do I deal with this if I want to use both datasets to fine tune my model without any conflicts

1 Like

Don’t expect tokenization to be consistent or universal. The safest approach is to use the tokenizer provided with the model.

So, while it may seem like a detour, converting back to words and re-tokenizing is the most reliable method.