MatSciBERT β€” Chemical Named Entity Recognition

Fine-tuned m3rg-iitd/matscibert on the CHEMDNER corpus for chemical named entity recognition (NER) in biomedical and scientific text.

The model identifies and highlights chemical compound names, drug names, and chemical formulas in free text.


Model Description

Property Value
Base model m3rg-iitd/matscibert (BERT-base, domain pre-trained on 2M+ materials science papers)
Task Token classification β€” NER
Labels O, B-CHEM, I-CHEM
Training data CHEMDNER corpus via kjappelbaum/chemnlp-chemdner
Framework HuggingFace Transformers + Trainer API
Hardware NVIDIA Quadro P2000 (4 GB VRAM)

Labels

Label Description Example
O Outside β€” not a chemical reacts, with, is
B-CHEM Beginning of a chemical entity nitric (start of "nitric oxide")
I-CHEM Inside a chemical entity oxide (continuation of "nitric oxide")

After aggregation, the pipeline outputs CHEM spans combining B-CHEM/I-CHEM tokens.


Evaluation Results

Evaluated on the CHEMDNER validation set (~6 808 examples) using seqeval (entity-span level):

Metric Score
F1 0.9146
Precision 0.9075
Recall 0.9219
Accuracy (token) 0.9927

Usage

With pipeline

from transformers import pipeline

ner = pipeline(
    "ner",
    model="teman67/matscibert-chem-ner",
    aggregation_strategy="simple",
)

text = "Nitric oxide reacts with oxygen to form nitrogen dioxide."
results = ner(text)

for entity in results:
    print(f"{entity['word']:<25} {entity['entity_group']}  ({entity['score']:.1%})")

Output:

nitric oxide              CHEM  (100.0%)
oxygen                    CHEM  (100.0%)
nitrogen dioxide          CHEM  (99.9%)

With AutoModel

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("teman67/matscibert-chem-ner")
model = AutoModelForTokenClassification.from_pretrained("teman67/matscibert-chem-ner")

inputs = tokenizer("Aspirin inhibits COX-1 and COX-2 enzymes.", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

predictions = outputs.logits.argmax(-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if not token.startswith("##") and token not in ("[CLS]", "[SEP]"):
        print(f"{token:<20} {label}")

Training Details

Hyperparameter Value
Epochs 5
Batch size 8
Learning rate 2e-5
Weight decay 0.01
Warmup ratio 0.1
Max sequence length 128
Optimiser AdamW
LR schedule Linear decay with warmup
Best model selection Highest validation F1

Training data split from the CHEMDNER corpus:

Split Examples
Train 6 796
Validation 6 808

Training Code

Source code available at: github.com/teman67/Fine-tuning-Materials-Scientific-NER-


Intended Use & Limitations

Intended for:

  • Extracting chemical and drug names from biomedical literature
  • Pre-processing step for downstream chemistry/materials science NLP tasks
  • Scientific text mining pipelines

Limitations:

  • Trained on PubMed/patent text β€” performance may degrade on very different domains
  • Recognises a single entity type (CHEM); does not distinguish subtypes (drugs, elements, formulas, etc.)
  • Sentences longer than 128 WordPiece tokens are truncated β€” chemicals at the end of long passages may be missed
  • Entity strings matched by surface form; ambiguous terms (e.g. Mercury the planet vs element) are always tagged as CHEM

References

  • MatSciBERT: Gupta et al., "MatSciBERT: A materials domain language model for text mining and information extraction", npj Computational Materials, 2022. doi:10.1038/s41524-022-00784-w
  • CHEMDNER: Krallinger et al., "The CHEMDNER corpus of chemicals and drugs and its annotation principles", Journal of Cheminformatics, 2015. doi:10.1186/1758-2946-7-S1-S2

License

MIT

Downloads last month
47
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for teman67/matscibert-chem-ner

Finetuned
(19)
this model

Dataset used to train teman67/matscibert-chem-ner