YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Tibetan-Chinese Embedding Model (Based on CINO)

📌 Model Summary

This model is a specialized embedding model optimized for Tibetan (Bo) and Chinese (Zh) semantic similarity, retrieval (RAG), and bitext alignment tasks.

It is fine-tuned based on CINO (CINO-Large/Base-v2), utilizing a two-stage contrastive learning strategy. The model significantly outperforms general multilingual models (like Qwen-Embedding) in distinguishing semantic nuances in Tibetan, achieving high-contrast representations.

  • Base Model: hfl/cino-large-v2 (or base)
  • Languages: Tibetan, Chinese
  • Task: Semantic Search, Text Clustering, Bitext Mining
  • Max Sequence Length: 128 (Optimized) / 512 (Max)

🚀 Usage

You can use this model easily with sentence-transformers.

from sentence_transformers import SentenceTransformer, util

# Load the model
model = SentenceTransformer("your-username/cino-tibetan-embedding")

# Queries (Tibetan)
sentences = [
    "ང་ལ་ཀུ་ཤུ་རྒྱ་མ་གཉིས་དང་གཡག་ཤ་རྒྱ་མ་གང་ཉོ་རྒྱུ་ཡོད།",  # I want to buy 2 jin of apples and 1 jin of yak meat.
    "བོད་ལྗོངས་ནི་མཛེས་སྡུག་ལྡན་པའི་ས་ཆ་ཞིག་རེད།"             # Tibet is a beautiful place.
]

# Encoding
embeddings = model.encode(sentences)

# Compute Similarity
score = util.cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {score.item():.4f}")

🛠️ Training Process

To address the scarcity of Tibetan semantic data and the "anisotropy" problem of base models, we adopted a Two-Stage Training Pipeline:

Stage 1: Supervised Bitext Alignment (Knowledge Distillation)

  • Goal: Align the Tibetan vector space with the mature Chinese semantic space.
  • Data Source: ~100k Chinese-Tibetan parallel translation pairs.
  • Method:
    • We utilized Chinese as the "Anchor" to pull the corresponding Tibetan sentences closer.
    • Loss Function: MultipleNegativesRankingLoss (In-batch negatives).
  • Outcome: The model learned deep semantic equivalence (e.g., "Shorts" $\approx$ "Clothes") rather than just lexical matching.

Stage 2: Hard Negative Mining (Discriminative Refinement)

  • Goal: Fix "Structural Overfitting" where the model gives high scores to sentences with identical sentence structures but different entities (e.g., buying apples vs. buying meat).
  • Data Construction:
    • We used the Stage 1 model to mine the dataset.
    • Triplets: (Anchor, Positive, Hard Negative)
    • Selection Logic: Selected sentences that were incorrect translations but had high similarity scores (>0.7) in Stage 1.
  • Outcome: Successfully suppressed "semantic hallucinations" caused by structural similarity.

📊 Evaluation & Comparison: Ours vs. Qwen-Embedding

We compared the discriminative power of this model against Qwen-Embedding-4B (Int8) using difficult semantic traps.

Test Case: "The Shopping Trap"

  • Query: "I want to buy 2 jin of apples and 1 jin of yak meat."
  • Candidate 1 (Correct): "Please give me 2 jin of apples and 1 jin of beef." (Paraphrased)
  • Candidate 2 (Trap): "I want to buy 2 jin of mutton and 1 jin of butter." (Identical structure, different entities)

Results

Model Correct Pair Score Trap Pair Score Contrast (Gap) Analysis
Qwen-Embedding 0.69 0.65 +0.04 Low Contrast. The model is "confused". It sees both sentences as roughly related to "buying food" and fails to penalize the wrong entities significantly.
Ours (CINO-FT) 0.90 0.89* High Confidence. The model correctly identifies the semantic match with high confidence (0.90).

> Note: While the Trap score (0.89) is still relatively high due to extreme structural overlap, the model successfully ranks the Correct Pair higher (0.90) and maintains a massive gap against irrelevant sentences (<0.15), whereas Qwen often gives >0.4 to irrelevant text.

General Performance

  • Semantic Paraphrasing: Our model achieves >0.85 similarity for paraphrased Tibetan sentences (e.g., changing "Yak meat" to "Beef").
  • Irrelevant Text: Pushed down to <0.15, creating a clean, high-contrast vector space suitable for Reinforcement Learning (RL) rewards and RAG.

⚠️ Limitations

  • Structural Bias: In extremely rare cases where two sentences have identical grammatical structures and function words (80%+ token overlap) but different nouns, the model may still assign a high similarity score (e.g., 0.85+). However, correct matches are consistently ranked higher.
  • Domain: Trained primarily on general domain and news corpora. Performance on specialized domains (e.g., ancient Buddhist scriptures) may vary.

📜 License

This model is licensed under Apache 2.0.


🤝 Acknowledgement

Downloads last month
7
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support