MatSciBERT β Chemical Named Entity Recognition
Fine-tuned m3rg-iitd/matscibert on the CHEMDNER corpus for chemical named entity recognition (NER) in biomedical and scientific text.
The model identifies and highlights chemical compound names, drug names, and chemical formulas in free text.
Model Description
| Property | Value |
|---|---|
| Base model | m3rg-iitd/matscibert (BERT-base, domain pre-trained on 2M+ materials science papers) |
| Task | Token classification β NER |
| Labels | O, B-CHEM, I-CHEM |
| Training data | CHEMDNER corpus via kjappelbaum/chemnlp-chemdner |
| Framework | HuggingFace Transformers + Trainer API |
| Hardware | NVIDIA Quadro P2000 (4 GB VRAM) |
Labels
| Label | Description | Example |
|---|---|---|
O |
Outside β not a chemical | reacts, with, is |
B-CHEM |
Beginning of a chemical entity | nitric (start of "nitric oxide") |
I-CHEM |
Inside a chemical entity | oxide (continuation of "nitric oxide") |
After aggregation, the pipeline outputs CHEM spans combining B-CHEM/I-CHEM tokens.
Evaluation Results
Evaluated on the CHEMDNER validation set (~6 808 examples) using seqeval (entity-span level):
| Metric | Score |
|---|---|
| F1 | 0.9146 |
| Precision | 0.9075 |
| Recall | 0.9219 |
| Accuracy (token) | 0.9927 |
Usage
With pipeline
from transformers import pipeline
ner = pipeline(
"ner",
model="teman67/matscibert-chem-ner",
aggregation_strategy="simple",
)
text = "Nitric oxide reacts with oxygen to form nitrogen dioxide."
results = ner(text)
for entity in results:
print(f"{entity['word']:<25} {entity['entity_group']} ({entity['score']:.1%})")
Output:
nitric oxide CHEM (100.0%)
oxygen CHEM (100.0%)
nitrogen dioxide CHEM (99.9%)
With AutoModel
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("teman67/matscibert-chem-ner")
model = AutoModelForTokenClassification.from_pretrained("teman67/matscibert-chem-ner")
inputs = tokenizer("Aspirin inhibits COX-1 and COX-2 enzymes.", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
for token, label in zip(tokens, labels):
if not token.startswith("##") and token not in ("[CLS]", "[SEP]"):
print(f"{token:<20} {label}")
Training Details
| Hyperparameter | Value |
|---|---|
| Epochs | 5 |
| Batch size | 8 |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup ratio | 0.1 |
| Max sequence length | 128 |
| Optimiser | AdamW |
| LR schedule | Linear decay with warmup |
| Best model selection | Highest validation F1 |
Training data split from the CHEMDNER corpus:
| Split | Examples |
|---|---|
| Train | 6 796 |
| Validation | 6 808 |
Training Code
Source code available at: github.com/teman67/Fine-tuning-Materials-Scientific-NER-
Intended Use & Limitations
Intended for:
- Extracting chemical and drug names from biomedical literature
- Pre-processing step for downstream chemistry/materials science NLP tasks
- Scientific text mining pipelines
Limitations:
- Trained on PubMed/patent text β performance may degrade on very different domains
- Recognises a single entity type (
CHEM); does not distinguish subtypes (drugs, elements, formulas, etc.) - Sentences longer than 128 WordPiece tokens are truncated β chemicals at the end of long passages may be missed
- Entity strings matched by surface form; ambiguous terms (e.g. Mercury the planet vs element) are always tagged as
CHEM
References
- MatSciBERT: Gupta et al., "MatSciBERT: A materials domain language model for text mining and information extraction", npj Computational Materials, 2022. doi:10.1038/s41524-022-00784-w
- CHEMDNER: Krallinger et al., "The CHEMDNER corpus of chemicals and drugs and its annotation principles", Journal of Cheminformatics, 2015. doi:10.1186/1758-2946-7-S1-S2
License
MIT
- Downloads last month
- 47
Model tree for teman67/matscibert-chem-ner
Base model
m3rg-iitd/matscibert