mni-ml/transformer

A 12.3M-parameter decoder-only Transformer (GPT-style) trained in Node.js with @mni-ml/framework on the TinyStories corpus, using a HuggingFace-style ByteLevel BPE tokenizer (vocab 4096).

Source code, training scripts, and data-prep utilities live at github.com/mni-ml/transformer.

The HF inference widget is disabled for this model. It uses a custom Node.js runtime (@mni-ml/framework), not transformers, so the widget cannot load it. See Running locally below.

Files

File	Size	Description
`model-final.json`	~249 MB	Final checkpoint: weights, config, and optimizer state, loaded by `@mni-ml/framework`
`tokenizer.json`	~266 KB	HuggingFace-format ByteLevel BPE tokenizer (vocab 4096, special token `<\|endoftext\|>`)

Architecture

Standard GPT-style decoder-only Transformer with pre-norm blocks, causal self-attention, learnable position embeddings, and weight-tied output head.

Hyperparameter	Value
Parameters	12,322,816
Layers (`n_layer`)	6
Attention heads (`n_head`)	6
Embedding dim (`n_embd`)	384
Head dim	64
Context window (`block_size`)	256 tokens
Vocab size	4,096
Activation	GELU
Normalization	LayerNorm (pre-norm), ε = 1e-5

The full config is also embedded in model-final.json under the config key and is read automatically by the generate scripts.

Running locally

Because this model uses a custom JS runtime, you need three pieces to run inference: the npm framework, and two source files (src/generate.js and src/bpe.js) from the GitHub repo.

Prerequisites

Node.js ≥ 22.18 (required by @mni-ml/framework)
git (to grab the source files) and hf CLI (to download the weights)

Step-by-step

# 1. Clone the source repo (needed for src/generate.js + src/bpe.js)
git clone https://github.com/mni-ml/transformer.git
cd transformer

# 2. Install the JS runtime
npm install

# 3. Download the checkpoint + tokenizer into ./out
hf download mni-ml/transformer model-final.json tokenizer.json --local-dir ./out

# 4. Generate
node src/generate.js out/model-final.json "<|endoftext|>" 400 0.9 out/tokenizer.json

CLI arguments to generate.js:

node src/generate.js <checkpoint> <prompt> <max_new_tokens> <temperature> <tokenizer_path>

⚠️ The 5th argument (tokenizer_path) is effectively required when using this public checkpoint. model-final.json internally records the path /app/data/tokenizer.json (the training container's path), which will not exist on your machine. Always pass out/tokenizer.json (or wherever you downloaded it) as the 5th arg.

Temperature 0 gives greedy decoding; values > 0 do temperature sampling. The prompt is encoded with the BPE tokenizer, so any UTF-8 string works; <\|endoftext\|> is the only special token.

GPU (optional)

If you install a matching @mni-ml/framework-* native package that exposes native.flashAttention:

node src/generate_gpu.js out/model-final.json "<|endoftext|>" 400 0.9 out/tokenizer.json

Quick sanity check

node src/generate.js out/model-final.json "Once upon a time" 100 0.8 out/tokenizer.json

Expected output style: short, simple, children's-story English (since the training corpus is TinyStories).

Intended use

Small research / educational model that demonstrates training a Transformer end-to-end in JavaScript. It is fluent on short children's-story-style English and is not a general-purpose chat or instruction model.

Suitable for: short-form story continuation, JS/Node learning demos, tokenizer experiments.
Not suitable for: factual Q&A, code generation, non-English text, long-context tasks (256-token window), safety-critical use.

Training data

TinyStories — a synthetic corpus of short English children's stories, originally generated by GPT-3.5 / GPT-4 and designed for training small language models. The BPE tokenizer in tokenizer.json was trained on the same corpus via scripts/prepare_tinystories.py in the source repo.

Training procedure

Framework: @mni-ml/framework v0.3.4 (Node.js)
Entry point: src/train.js (CPU) or src/train_gpu.js (GPU)
Objective: next-token cross-entropy

Hyperparameter	Value
Optimizer	AdamW
β₁, β₂	0.9, 0.95
Weight decay	0.1
Max grad norm	1.0
Peak LR	3e-4
Min LR	6e-5
LR schedule	Linear warmup (200 steps) → cosine decay
Max iterations	7,500
Batch size	8
Gradient accumulation	4 (→ effective batch 32)
Dropout	0.1 (training only)

Limitations and biases

Trained only on TinyStories, so outputs mimic simple children's stories and will hallucinate or produce nonsense for anything outside that domain.
TinyStories is itself GPT-generated, so any biases or artifacts of the generating models can propagate here.
256-token context window is very short.
No RLHF, no instruction tuning, no safety alignment.
English-only.

License

MIT — see the source repository for details.

Citation

@misc{mni-ml-transformer,
  title  = {mni-ml/transformer: a 12M-parameter Transformer trained in Node.js},
  author = {mni-ml},
  year   = {2026},
  url    = {https://github.com/mni-ml/transformer}
}

@article{eldan2023tinystories,
  title   = {TinyStories: How Small Can Language Models Be and Still Speak Coherent English?},
  author  = {Eldan, Ronen and Li, Yuanzhi},
  journal = {arXiv preprint arXiv:2305.07759},
  year    = {2023}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train mni-ml/transformer

Paper for mni-ml/transformer

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 45