I am trying to fine-tune a pretrained model for NER tasks for which I have multiple datasets from different sources. They are a mishmash of different types ie, some are of standoff format and some use the IOB format.
The problem is that even among the datasets that use the IOB format, the tokens seem to be tokenized in different ways. For example, the words “U.S.A” is tokenized as:
dataset_A = [‘U’, ‘.’, ‘S’, ‘.’, ‘A’]
dataset_B = [‘U.’, ‘S.’, ‘A’]
As you can see, there is a big discrepancy here. How do I deal with this if I want to use both datasets to fine tune my model without any conflicts