Qwen2.5-1.5B-Singlish-Transliteration
Model Summary
Qwen2.5-1.5B-Singlish-Transliteration is a fine-tuned version of the Qwen/Qwen2.5-1.5B-Instruct large language model, specialized for transliterating Singlish (phonetic Sinhala typed in English) into Sinhala script.
This model was developed to bridge the gap between informal Romanized typing and formal Sinhala script, particularly for social media content, chat logs, and digital communication in Sri Lanka.
- Developed by: Afeef Zeed
- Task: Singlish to Sinhala Transliteration
- Base Model: Qwen 2.5 (1.5B Parameters)
- Fine-Tuning Technique: LoRA (Low-Rank Adaptation) via PEFT
- Dataset Size: ~500,000 pairs of Singlish-Sinhala text
Uses
Direct Use
The model is designed to take a Singlish sentence as input and output the corresponding Sinhala text. It understands context better than rule-based transliterators.
Example:
- Input: "oyage nama mokakda"
- Output: "ඔයාගේ නම මොකක්ද"
How to Get Started with the Model
You can use this model directly with the Hugging Face transformers library.
Python Code
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# 1. Load Base Model
base_model_name = "Qwen/Qwen2.5-1.5B-Instruct"
adapter_model_name = "Afeefzeed/Qwen2.5-Singlish-Transliteration" # Replace with your actual username if different
print("Loading model...")
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=True
)
# 2. Load the Fine-Tuned Adapter
model = PeftModel.from_pretrained(base_model, adapter_model_name)
model.eval()
# 3. Define Transliteration Function
def transliterate(text):
prompt = f"<|im_start|>user\nTransliterate this Singlish text to Sinhala: {text}<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.1,
do_sample=True
)
# Decode and extract only the assistant's response
full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
return full_output.split("assistant\n")[-1].strip()
# 4. Test
input_text = "mama heta gedara yanawa"
print(f"Input: {input_text}")
print(f"Output: {transliterate(input_text)}")