mni-ml/transformer
A 12.3M-parameter decoder-only Transformer (GPT-style) trained in Node.js with
@mni-ml/framework on the
TinyStories corpus,
using a HuggingFace-style ByteLevel BPE tokenizer (vocab 4096).
Source code, training scripts, and data-prep utilities live at github.com/mni-ml/transformer.
The HF inference widget is disabled for this model. It uses a custom Node.js runtime (
@mni-ml/framework), nottransformers, so the widget cannot load it. See Running locally below.
Files
| File | Size | Description |
|---|---|---|
model-final.json |
~249 MB | Final checkpoint: weights, config, and optimizer state, loaded by @mni-ml/framework |
tokenizer.json |
~266 KB | HuggingFace-format ByteLevel BPE tokenizer (vocab 4096, special token <|endoftext|>) |
Architecture
Standard GPT-style decoder-only Transformer with pre-norm blocks, causal self-attention, learnable position embeddings, and weight-tied output head.
| Hyperparameter | Value |
|---|---|
| Parameters | 12,322,816 |
Layers (n_layer) |
6 |
Attention heads (n_head) |
6 |
Embedding dim (n_embd) |
384 |
| Head dim | 64 |
Context window (block_size) |
256 tokens |
| Vocab size | 4,096 |
| Activation | GELU |
| Normalization | LayerNorm (pre-norm), ε = 1e-5 |
The full config is also embedded in model-final.json under the config key and
is read automatically by the generate scripts.
Running locally
Because this model uses a custom JS runtime, you need three pieces to run
inference: the npm framework, and two source files (src/generate.js and
src/bpe.js) from the GitHub repo.
Prerequisites
- Node.js ≥ 22.18 (required by
@mni-ml/framework) git(to grab the source files) andhfCLI (to download the weights)
Step-by-step
# 1. Clone the source repo (needed for src/generate.js + src/bpe.js)
git clone https://github.com/mni-ml/transformer.git
cd transformer
# 2. Install the JS runtime
npm install
# 3. Download the checkpoint + tokenizer into ./out
hf download mni-ml/transformer model-final.json tokenizer.json --local-dir ./out
# 4. Generate
node src/generate.js out/model-final.json "<|endoftext|>" 400 0.9 out/tokenizer.json
CLI arguments to generate.js:
node src/generate.js <checkpoint> <prompt> <max_new_tokens> <temperature> <tokenizer_path>
⚠️ The 5th argument (
tokenizer_path) is effectively required when using this public checkpoint.model-final.jsoninternally records the path/app/data/tokenizer.json(the training container's path), which will not exist on your machine. Always passout/tokenizer.json(or wherever you downloaded it) as the 5th arg.
Temperature 0 gives greedy decoding; values > 0 do temperature sampling.
The prompt is encoded with the BPE tokenizer, so any UTF-8 string works;
<\|endoftext\|> is the only special token.
GPU (optional)
If you install a matching @mni-ml/framework-* native package that exposes
native.flashAttention:
node src/generate_gpu.js out/model-final.json "<|endoftext|>" 400 0.9 out/tokenizer.json
Quick sanity check
node src/generate.js out/model-final.json "Once upon a time" 100 0.8 out/tokenizer.json
Expected output style: short, simple, children's-story English (since the training corpus is TinyStories).
Intended use
Small research / educational model that demonstrates training a Transformer end-to-end in JavaScript. It is fluent on short children's-story-style English and is not a general-purpose chat or instruction model.
- Suitable for: short-form story continuation, JS/Node learning demos, tokenizer experiments.
- Not suitable for: factual Q&A, code generation, non-English text, long-context tasks (256-token window), safety-critical use.
Training data
TinyStories — a synthetic
corpus of short English children's stories, originally generated by GPT-3.5 / GPT-4
and designed for training small language models. The BPE tokenizer in
tokenizer.json was trained on the same corpus via
scripts/prepare_tinystories.py in the source repo.
Training procedure
- Framework:
@mni-ml/frameworkv0.3.4 (Node.js) - Entry point:
src/train.js(CPU) orsrc/train_gpu.js(GPU) - Objective: next-token cross-entropy
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| β₁, β₂ | 0.9, 0.95 |
| Weight decay | 0.1 |
| Max grad norm | 1.0 |
| Peak LR | 3e-4 |
| Min LR | 6e-5 |
| LR schedule | Linear warmup (200 steps) → cosine decay |
| Max iterations | 7,500 |
| Batch size | 8 |
| Gradient accumulation | 4 (→ effective batch 32) |
| Dropout | 0.1 (training only) |
Limitations and biases
- Trained only on TinyStories, so outputs mimic simple children's stories and will hallucinate or produce nonsense for anything outside that domain.
- TinyStories is itself GPT-generated, so any biases or artifacts of the generating models can propagate here.
- 256-token context window is very short.
- No RLHF, no instruction tuning, no safety alignment.
- English-only.
License
MIT — see the source repository for details.
Citation
@misc{mni-ml-transformer,
title = {mni-ml/transformer: a 12M-parameter Transformer trained in Node.js},
author = {mni-ml},
year = {2026},
url = {https://github.com/mni-ml/transformer}
}
@article{eldan2023tinystories,
title = {TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
author = {Eldan, Ronen and Li, Yuanzhi},
journal = {arXiv preprint arXiv:2305.07759},
year = {2023}
}