T-VEC / README.md

Update README.md

2c8f57a verified about 1 year ago

8.78 kB

	---
	license: mit
	language:
	- en
	tags:
	- text-embeddings
	- telecom
	- domain-adaptation
	- triplet-loss
	- transformer
	- semantic-search
	- sentence-transformers
	- domain-specific
	- contrastive-learning
	- simcse
	- bio-bert
	- don’t-stop-pretraining

	metrics:
	- name: Telecom Triplet Score
	type: accuracy
	value: 0.9380
	verified: false
	- name: Average MTEB Score
	type: accuracy
	value: 0.825
	verified: false
	- name: Average STS Score
	type: spearman
	value: 82.19
	verified: false
	- name: AllNLI Triplet Score
	type: accuracy
	value: 0.6150
	verified: false
	base_model:
	- Alibaba-NLP/gte-Qwen2-1.5B-instruct
	model-index:
	- name: T-VEC
	results:
	- task:
	type: text-embedding
	name: Telecom Triplet Benchmark
	dataset:
	type: custom
	name: Telecom Triplet Benchmark
	metrics:
	- name: Telecom Triplet Score
	type: accuracy
	value: 0.9380
	verified: false
	- task:
	type: text-embedding
	name: MTEB Benchmark
	dataset:
	type: openai_humaneval
	name: MTEB Benchmark
	metrics:
	- name: Average MTEB Score
	type: accuracy
	value: 0.825
	verified: false
	- task:
	type: text-embedding
	name: STS Benchmark
	dataset:
	type: openai_humaneval
	name: STS Benchmark
	metrics:
	- name: Average STS Score
	type: spearman
	value: 82.19
	verified: false
	- task:
	type: text-embedding
	name: AllNLI Triplet
	dataset:
	type: openai_humaneval
	name: AllNLI Triplet
	metrics:
	- name: Triplet Score
	type: accuracy
	value: 0.6150
	verified: false

	extra_gated_prompt: "Please provide answers to the below questions to gain access to the model"
	extra_gated_fields:
	Company: text
	Full Name: text
	Email: text
	I want to use this model for:
	type: select
	options:
	- Research
	- Education
	- Commercial
	- label: Other
	value: other
	---


	# T-VEC: A Telecom-Specific Text Embedding Model

	## Overview

	T-VEC (Telecom Vectorization Model) is a domain-adapted text embedding model developed by NetoAI and fine-tuned from [Alibaba-NLP/gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct). Using a deeply supervised triplet-loss approach, T-VEC learns rich semantic representations tailored to telecom use cases, achieving state-of-the-art results on custom and standard benchmarks.

	## Model Details

	- Model Name: T-VEC
	- Developer: [NetoAI](https://www.netoai.ai)
	- Base Model: Alibaba-NLP/gte-Qwen2-1.5B-instruct
	- Parameters: 1.5 Billion
	- Embedding Dimension: 1536
	- Max Input Tokens: 32,000
	- Languages: Multilingual (optimized for English)
	- License: MIT
	- Tokenizer: Custom telecom-specific tokenizer (open-source)

	## Intended Uses

	- Semantic search over telecom documents (3GPP standards, vendor manuals)
	- Fault log analysis for root-cause detection
	- Telecom-specific chatbots and Q&A systems
	- Regulatory compliance analysis and semantic auditing

	## Training Details

	- Objective: Triplet loss using cosine similarity
	- Dataset: 100k+ telecom triplets curated by domain experts over 3 months
	- Layer Modification: 338 transformer layers fine-tuned
	- Avg. L2 Norm Weight Change: 0.7735
	- Enhancements: Telecom-specific tokenizer and query-aware anchor strategies

	## Evaluation Results

	\| Benchmark \| Metric \| Score \|
	\|-----------------------------\|----------------------\|--------\|
	\| Telecom Triplet Benchmark \| Accuracy \| 0.9380 \|
	\| MTEB Benchmark \| Accuracy \| 0.825 \|
	\| STS Benchmark \| Spearman Correlation \| 82.19 \|
	\| AllNLI Triplet \| Accuracy \| 0.6150 \|

	T-VEC significantly outperforms both its base model and other strong general-purpose models on telecom-specific benchmarks, while still retaining competitive general performance.


	\| Model \| ArguAna \| SciDocsRR \| STS12 \| STS13 \| STS14 \| STS15 \| STS16 \| STSBenchmark \|
	\|--------------------------------\|---------\|--------------\|-------------\|------------\|------------\|------------\|------------\|--------------\|
	\| gte‑Qwen2‑1.5B‑instruct \| 0.62335 \| 0.81558 \| 0.72805 \| 0.84699 \| 0.78803 \| 0.87450 \| 0.84938 \| 0.85379 \|
	\| T‑VEC \| 0.61150 \| 0.83970 \| 0.80320 \| 0.88220 \| 0.82750 \| 0.88260 \| 0.84780 \| 0.88050 \|
	\| all‑MiniLM‑L6‑v2 \| 0.50167 \| 0.87119 \| 0.72369 \| 0.80603 \| 0.75589 \| 0.85390 \| 0.78989 \| 0.82032 \|
	\| all‑mpnet‑base‑v2 \| 0.46521 \| 0.88654 \| 0.72634 \| 0.83485 \| 0.78000 \| 0.85663 \| 0.80030 \| 0.83422 \|
	\| bge‑base‑en‑v1.5 \| 0.63616 \| 0.87494 \| 0.78028 \| 0.84184 \| 0.82273 \| 0.87957 \| 0.85474 \| 0.86418 \|
	\| e5‑base‑v2 \| 0.51604 \| 0.82834 \| 0.73489 \| 0.82997 \| 0.80446 \| 0.88181 \| 0.83659 \| 0.85480 \|
	\| jina‑embeddings‑v2‑base‑en \| 0.44152 \| 0.83106 \| 0.74278 \| 0.84177 \| 0.78808 \| 0.87553 \| 0.85347 \| 0.84842 \|
	\| instructor‑xl \| 0.54884 \| 0.79538 \| 0.74085 \| 0.85046 \| 0.80318 \| 0.88359 \| 0.83784 \| 0.83048 \|
	\| gte‑base \| 0.57151 \| 0.87083 \| 0.75707 \| 0.85729 \| 0.81510 \| 0.88810 \| 0.83824 \| 0.85738 \|
	\| multilingual‑e5‑base \| 0.47829 \| 0.80392 \| 0.77933 \| 0.76890 \| 0.77535 \| 0.88373 \| 0.82699 \| 0.84201 \|


	![image/png](https://huggingface.co/static-proxy/cdn-uploads.huggingface.co/production/uploads/66fa4fb0ec6983f03c2b1ca2/oIX2bc76Er4TDd5eZCb_C.png)



	## Limitations

	- Reduced performance on non-domain tasks (e.g., AllNLI) due to specialization
	- Large size may impact deployment on edge devices
	- May miss recent telecom developments outside the training set

	## Ethical Considerations

	- Use in critical telecom systems should be validated by domain experts
	- May reflect terminology biases from dominant vendors in the dataset
	- Open licensing (MIT) supports transparency and community contributions

	## Usage

	### Installation

	```bash
	pip install transformers
	```

	### Load and Run

	```python
	from transformers import AutoModel, AutoTokenizer
	import torch

	model = AutoModel.from_pretrained("netoai/t-vec")
	tokenizer = AutoTokenizer.from_pretrained("netoai/t-vec")

	texts = ["5G NR architecture", "LTE handover", "Core network functions"]
	inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=32000)
	emb = model(**inputs).last_hidden_state.mean(dim=1)

	cos_sim = torch.nn.functional.cosine_similarity(emb[0:1], emb[1:], dim=1)
	print(cos_sim)
	```

	## Citation

	```bibtex
	@article{ethiraj2025tvec,
	title={T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning},
	author={Ethiraj, Vignesh and Menon, Sidhanth and Vijay, Divya},
	journal={arXiv preprint},
	year={2025},
	url={https://arxiv.org/abs/2504.16460}
	}
	```

	## References
	- Ethiraj, V., Menon, S., & Vijay, D. (2025). T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning. arXiv:2504.16460.
	- Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR, 2015.
	- Hermans, A., Beyer, L., Leibe, B. “In Defense of the Triplet Loss for Person Re-Identification.” arXiv:1703.07737, 2017.
	- Reimers, N., Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP, 2019.
	- Gao, T., Yao, X., Chen, D. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv:2104.08821, 2021.
	- Gururangan, S., et al. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” ACL, 2020.
	- Lee, J., Yoon, W., et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics, 2020.
	- Sahu, S. K., Maheshwari, A. “Automatic extraction of telecom network events from log messages.” IEEE ICC, 2018.
	- Wang, X., Li, Y., Han, J. “Log2Vec: A Deep Embedding Model for Network Log Analysis.” IEEE/IFIP DSN, 2021.


	## Contact
	- For questions or contributions, visit https://www.netoai.ai.
	---