SerendipLLM V2 πŸ‡±πŸ‡°

The largest Sinhala instruction-following language model trained on 309,328 examples

SerendipLLM V2 is a specialized Sinhala language model with exceptional capabilities in news classification, question answering, and general Sinhala text generation. Built on Llama-3-8B with continued pre-training and instruction fine-tuning, it represents a significant advancement in Sinhala NLP.

πŸ† Key Achievements

  • βœ… 6.2x larger dataset than existing Sinhala models (309K vs ~50K examples)
  • βœ… 45,080 news classification examples for specialized Sinhala news categorization
  • βœ… 50% training loss reduction (0.54 β†’ 0.27) over 3 epochs
  • βœ… Comprehensive training on diverse Sinhala tasks
  • βœ… Open-source - Complete pipeline and dataset available

πŸ“Š Model Details

Attribute Value
Base Model Meta Llama-3-8B
CPT Foundation serendib-llm-cpt-llama3-8b
Parameters 8.16B total, 130M trainable (1.59%)
Training Examples 309,328
Training Method LoRA fine-tuning
Training Duration 26.5 hours on A100 80GB
Final Loss 0.27
License Apache 2.0

🎯 Specialized Capabilities

News Classification (Our Strength!)

Trained on 45,080 Sinhala news examples - the largest news classification dataset for Sinhala.

Example:

Input: "ΰ·ΰ·Šβ€ΰΆ»ΰ·“ ΰΆ½ΰΆ‚ΰΆšΰ· ΰΆšΰ·Šβ€ΰΆ»ΰ·’ΰΆšΰΆ§ΰ·Š ࢚ࢫ්ࢩාࢺࢸ ΰΆ…ΰΆ― ΰΆ‰ΰΆ±ΰ·ŠΰΆ―ΰ·’ΰΆΊΰ·ΰ·€ΰΆ§ ΰΆ‘ΰΆ»ΰ·™ΰ·„ΰ·’ΰ·€ ࢭࢻ࢜ࢺ්࢚ ΰΆ†ΰΆ»ΰΆΈΰ·ŠΰΆ· ΰΆšΰ·…ΰ·šΰΆΊ"
Output: "ΰΆΈΰ·™ΰΆΊ ΰΆšΰ·Šβ€ΰΆ»ΰ·“ΰΆ©ΰ· ΰΆ΄ΰ·”ΰ·€ΰΆ­ΰΆšΰ·’" βœ…

Question Answering

29,390 QA pairs covering geography, history, culture, and general knowledge.

Example:

Input: "ΰ·ΰ·Šβ€ΰΆ»ΰ·“ ΰΆ½ΰΆ‚ΰΆšΰ·ΰ·€ΰ·š ΰΆ…ΰΆœΰΆ±ΰ·”ΰ·€ΰΆ» ΰΆšΰ·”ΰΆΈΰΆšΰ·ŠΰΆ―?"
Output: "ΰ·ΰ·Šβ€ΰΆ»ΰ·“ ΰΆ½ΰΆ‚ΰΆšΰ·ΰ·€ΰ·š ΰΆ…ΰΆœΰΆ±ΰ·”ΰ·€ΰΆ» ΰΆšΰ·œΰ·…ΰΆΉΰΆΊΰ·’" βœ…

πŸ“ˆ Dataset Composition

Category Examples Percentage
General Sinhala 205,403 66.4%
News Classification 45,080 14.6%
QA Pairs 29,390 9.5%
Summarization 19,593 6.3%
Rewrite/Formatting 9,862 3.2%
TOTAL 309,328 100%

πŸš€ Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "Chamaka8/Serendip-LLM-CPT-SFT-v2",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Chamaka8/Serendip-LLM-CPT-SFT-v2")

# Format prompt
prompt = """### Instruction:
ΰΆ΄ΰ·„ΰΆ­ ΰΆ΄ΰ·”ΰ·€ΰΆ­ΰ·Š ΰΆ½ΰ·’ΰΆ΄ΰ·’ΰΆΊ ΰ·€ΰΆ»ΰ·ŠΰΆœΰ·“ΰΆšΰΆ»ΰΆ«ΰΆΊ ࢚ࢻࢱ්ࢱ

### Input:
ΰ·ΰ·Šβ€ΰΆ»ΰ·“ ΰΆ½ΰΆ‚ΰΆšΰ· ΰΆšΰ·Šβ€ΰΆ»ΰ·’ΰΆšΰΆ§ΰ·Š ࢚ࢫ්ࢩාࢺࢸ ΰΆ…ΰΆ― ΰΆ‰ΰΆ±ΰ·ŠΰΆ―ΰ·’ΰΆΊΰ·ΰ·€ΰΆ§ ΰΆ‘ΰΆ»ΰ·™ΰ·„ΰ·’ΰ·€ ࢭࢻ࢜ࢺ්࢚ ΰΆ†ΰΆ»ΰΆΈΰ·ŠΰΆ· ΰΆšΰ·…ΰ·šΰΆΊ.

### Response:
"""

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=150,
    temperature=0.7,
    top_p=0.9
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response.split("### Response:")[-1].strip())

βš™οΈ Training Configuration

Hardware

  • GPU: NVIDIA A100 SXM 80GB
  • Training Time: 26.5 hours
  • Cost: ~$37 USD

Hyperparameters

num_train_epochs = 3
per_device_train_batch_size = 8
gradient_accumulation_steps = 4
learning_rate = 2e-5
max_seq_length = 384
lora_r = 64
lora_alpha = 128

Training Loss

Epoch Loss
1.0 0.28
2.0 0.24
3.0 0.27

πŸ“Š Comparison

Model Training Data News Examples
SinLlama ~50,000 Limited
SerendipLLM V2 309,328 45,080 βœ…

πŸ”— Resources

πŸ“š Citation

@model{serendipllm2026,
  title={SerendipLLM V2: Large-Scale Instruction-Tuning for Sinhala},
  author={Chamaka Alwis},
  year={2026},
  url={https://huggingface.co/Chamaka8/Serendip-LLM-CPT-SFT-v2}
}

πŸ“„ License

Apache 2.0


Built with ❀️ for the Sinhala NLP community

Downloads last month
155
Safetensors
Model size
8B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Chamaka8/Serendip-LLM-CPT-SFT-v2

Finetuned
(1)
this model
Adapters
4 models

Dataset used to train Chamaka8/Serendip-LLM-CPT-SFT-v2

Spaces using Chamaka8/Serendip-LLM-CPT-SFT-v2 2