HachimiMT-60: Chinese→Vietnamese Web-Novel Translation Model

A 56.94M-parameter Marian-class Chinese-to-Vietnamese translation model optimized for web-novel content (xianxia, modern, cross-domain).

TL;DR

Aspect	Value
Params	56.94M
Architecture	Asymmetric Marian (8 encoder + 2 decoder, d_model 512)
Vocab	Custom SPM-BPE 24k joint ZH+VI
Max position	512
Best for	Xianxia + cross-domain web-novel paragraph translation

Quick Start

from transformers import AutoTokenizer, MarianMTModel
import torch

tokenizer = AutoTokenizer.from_pretrained("ngocdang83/HachimiMT-60-zh-vi")
model = MarianMTModel.from_pretrained("ngocdang83/HachimiMT-60-zh-vi").to("cuda").eval()

src = "他必须得抓紧时间了。凌伊山掏出手机，查询起了临江市最近开往雪霏市的机票。"
inp = tokenizer(src, return_tensors="pt", truncation=True, max_length=256).to("cuda")
with torch.inference_mode():
    out = model.generate(
        **inp,
        max_new_tokens=300,
        num_beams=4,
        early_stopping=True,
        no_repeat_ngram_size=2,
        repetition_penalty=1.2,
    )
print(tokenizer.decode(out[0], skip_special_tokens=True))
# Output: "Hắn phải tranh thủ thời gian rồi. Lăng Y Sơn lấy điện thoại ra, tra
#  vé máy bay gần nhất từ thành phố Lâm Giang đến thành phố Tuyết Phi."

Fast CPU Runtime

This repository also includes a CTranslate2 INT8 export under ct2-int8_float32/, used by the public demo Space for faster CPU inference.

import ctranslate2
from pathlib import Path
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

model_id = "ngocdang83/HachimiMT-60-zh-vi"
model_path = Path(snapshot_download(model_id, allow_patterns=[
    "config.json", "source.spm", "target.spm", "vocab.json", "tokenizer_config.json",
    "ct2-int8_float32/*",
]))

tokenizer = AutoTokenizer.from_pretrained(model_path)
translator = ctranslate2.Translator(
    str(model_path / "ct2-int8_float32"),
    device="cpu",
    compute_type="int8_float32",
)

Speed Benchmark

Tested on RTX 5070 Ti Laptop, num_beams=4, mixed test set (20 short + 20 medium + 20 long rows).

Model	Params	Mean Latency	max_position	Notes
Hirashiba-tiny	15.1M	377ms	512	Fastest
Hirashiba-medium	57.07M	495ms	128	Truncates paragraphs
HachimiMT-60 (this)	56.94M	603ms	512	Handles long paragraph without truncation

Per-bucket mean latency (ms):

Bucket	HachimiMT-60	Hirashiba-medium	Hirashiba-tiny
short (~70-120ch)	330	390	310
medium (~150-250ch)	626	546	430
long (>250ch)	853	548	390

⚠️ Hirashiba-medium and Hirashiba-tiny truncate on medium/long buckets due to max_position_embeddings=128, which caps output to ~120 tokens regardless of source length. Their lower latency on long bucket reflects truncated output rather than faster decoding. HachimiMT-60 produces full-length output up to ~1000 chars without truncation.

For ultra-low-latency short-content use cases, consider Hirashiba-tiny. For paragraph-level web-novel translation, HachimiMT-60 is recommended.

Architecture

MarianMTModel:
  vocab_size: 24000
  d_model: 512
  encoder_layers: 8
  decoder_layers: 2
  encoder_attention_heads: 8
  decoder_attention_heads: 8
  encoder_ffn_dim: 3072
  decoder_ffn_dim: 3072
  max_position_embeddings: 512
  share_encoder_decoder_embeddings: true
  tie_word_embeddings: true
  scale_embedding: true
  activation_function: swish

Total params: 56,935,424 (~57M).

Training Datasets

Primary training sources:

ngocdang83/tran-vi-teacher — 350k strict-clean Chinese-Vietnamese parallel from Gemini 2.5/3.0/3.1 teacher (Pro/Flash/Flash-Lite tiers). Provides paragraph-level training examples + cross-domain coverage (urban, fantasy, sci-fi, history).
chi-vi/hirashiba-mt-zh2vi-b-filtered — Filtered Chinese-Vietnamese translation dataset for web-novel domain.
Gold teacher generated by Gemini API for additional quality-targeted training examples.

Decode Configuration

Recommended generation parameters:

out = model.generate(
    **inputs,
    max_new_tokens=300,       # adjust based on expected length
    num_beams=4,              # quality/speed tradeoff
    early_stopping=True,
    no_repeat_ngram_size=2,   # prevent repetition
    repetition_penalty=1.2,
)

For shorter inputs (single sentence), reduce max_new_tokens=150. For long paragraphs, increase to 400.

Intended Uses

Not Recommended

Non-Chinese sources (ZH→VI only, not bidirectional)
Traditional Chinese (繁體) input — model trained on Simplified Chinese (简体). Traditional characters may degrade output quality; convert to Simplified first (e.g. via opencc).
Bilingual editing/post-editing without verification — automated MT should be reviewed before publication.

Limitations

Hallucination on rare proper nouns: Western names (Klein, Audrey, Bernadette) usually preserved, but uncommon proper nouns may hallucinate.
Trained on web-novel corpus: scientific, legal, or news domains may give suboptimal results.
Long-context drift: When translating a single long input (>200 chars in one go), proper names and consistent terminology may drift after several mentions in the same context (e.g., "Trương Vũ" → "Trương Huyền" → "Trương XX"). Mitigation: split long inputs by paragraph/sentence and translate each chunk independently. The HF Space demo applies this automatically.
Output length asymptote: outputs >1000 chars per chunk may degrade.
Simplified Chinese only: Traditional Chinese inputs untested and likely to degrade.

Evaluation Methodology

Quality validation uses a trio AI reviewer pattern for cross-validated human-style preference judgments without single-model bias.

Reviewers

Three independent reviewer sessions, each using a different agent/runtime context:

Reviewer 1: gemini-3.1-pro via Gemini CLI
Reviewer 2: gemini-3.5-flash via Gemini CLI (different temperature)
Reviewer 3: gemini-3.5-flash via Antigravity (AGY) agents

Each reviewer reads one review TSV in isolation — they cannot see other reviewers' outputs.

Scoring

Per row, per model:

Severity 0-3 scale (0 = OK / acceptable, 1 = minor error, 2 = moderate error, 3 = severe error — hallucination, truncation, or word salad)
Winner pick: choose the best of 4 model outputs, or tie / all_bad
winner_reason short text (model-specific failure modes or strengths)

Aggregation

Pooled severity = mean of all severity scores across reviewers (lower = better)
Winner aggregate = vote count across 180 judgments (60 rows × 3 reviewers)
Trio consensus = rows where all 3 reviewers agree on the same winner (highest-confidence signal)

Test Sets

Two complementary evaluation sets covering web-novel translation diversity:

Cross-novel paragraph (60 rows, 20 short + 20 medium + 20 long buckets) — random paragraphs from two web-novels (Lovecraftian fantasy + sci-fi mecha), tests cross-domain + long-output handling.
Xianxia in-distribution (60 rows, 30 classical xianxia + 30 modern xianxia hybrid chapter excerpts) — tests xianxia genre quality and register polish (Hán Việt accuracy, tu tiên vocabulary, modern colloquial Vietnamese register).

Anti-Bias Rules

To prevent single-reviewer drift:

Each session opens only one review file (no cross-read)
Anti-boilerplate rules enforced (no default severity=0, no default winner=tie)
Reviewer-specific bias patterns identified post-hoc and weighted in interpretation

Citation

@misc{hachimimt60-2026,
  author = {ngocdang83},
  title = {HachimiMT-60: Chinese-to-Vietnamese Web-Novel Translation},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/ngocdang83/HachimiMT-60-zh-vi}
}

License

CC-BY-4.0 — free use with attribution. Training data includes Gemini API teacher distillation; downstream users should verify current Gemini API terms for derivative-work training.

Downloads last month: 504

Safetensors

Model size

56.4M params

Tensor type

F32

ngocdang83
/

HachimiMT-60-zh-vi