Growing Transformers β Frozen 16-bit Baseline (Monolithic, 181M)
This repository contains growing-transformers-model-frozen-16-bit-baseline-monolyth-181m, a monolithic baseline model from the paper:
It is part of the comparative-study collection:
https://huggingface.co/collections/Bochkov/growing-transformers-layer-wise-expansion-comparative-study
Code:
https://github.com/AVBochkov/PGT
What this model is (in one paragraph)
This is a 9-layer decoder-only Transformer trained monolithically end-to-end (all Transformer layers trained simultaneously from scratch), without any constructive / layer-wise growth procedure. The key constraint is that the token embedding layer is frozen and uses an extremely small 16-dimensional binary embedding (n_embed = 16): each token is mapped to a 16-bit vector derived from its token ID (because vocab_size = 65,536 = 2^16). The 16-dim vector is deterministically expanded to the model hidden size (d_model = 1024) by repetition, as described in the paper.
This repository serves as a clean baseline to isolate the effect of monolithic training vs constructive growth, under the same frozen-embedding substrate.
Primary comparison (why this repo exists)
This model is intended to be compared to the constructive-growth counterpart:
- Bochkov/growing-transformers-model-16-bit-1-9-181m
(constructive, layer-wise growth; same 16-bit frozen embedding idea)
What is identical
- Same controlled-study Transformer stack architecture (9 layers,
d_model=1024,n_head=32) - Same tokenizer family / vocabulary size (65,536)
- Same 16-bit frozen embedding definition
What differs
- Training procedure:
- This repo: monolithic training (end-to-end; no staged growth)
- Constructive repo: trained in stages (1β3, then 4β6, then 7β9), freezing previously trained layers
Model architecture (controlled study)
- Type: decoder-only Transformer (GPT-like)
- Layers: 9
- Hidden size: d_model = 1024
- Heads: n_head = 32
- Vocabulary size: 65,536
- Context length used in training: 1024
- Embedding: frozen 16-bit / n_embed=16, deterministically expanded to
d_model
Parameter count
- Total: β181.6M
- Frozen: β1.0M (embedding-related)
- Trainable: β180.6M
(Counts follow the paperβs controlled-study table.)
Embedding definition (16-bit / n_embed=16)
vocab_size = 65,536- Each token ID
id β [0, 65535]is represented as a 16-bit binary vector (0/1 components). - This 16-dim vector is expanded to
d_model=1024by simple repetition (repeat_interleave-style expansion).
Tokenizer
Canonical tokenizer repository:
- https://huggingface.co/Bochkov/bvv241-2-3
(collection: https://huggingface.co/collections/Bochkov/tokenizers)
Note: This model repo may include additional embedding-related artifacts; for strict reproducibility, prefer loading the tokenizer from this repo.
Intended use
Research / analysis of:
- emergent semantics with minimal, frozen token embeddings
- monolithic vs constructive (layer-wise) training regimes
- controlled comparisons across embedding substrates (UNICODE vs 16-bit vs trainable)
Not intended as a general-purpose assistant model. Outputs may be unreliable and the model may reflect biases present in the training data.
How to use (Transformers)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m", trust_remote_code=True).to('cuda')
inputs = torch.tensor([tokenizer.encode("Write a short poem about the ocean. ")], dtype=torch.long, device='cuda')
outputs = model.generate(
inputs,
max_new_tokens=50,
do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Write a short poem about the ocean. The poem was published in 1830 by the painter John Brown and was included in the collection
inputs = torch.tensor([tokenizer.encode("Question: What is the capital of India?\nAnswer:")], dtype=torch.long, device='cuda')
outputs = model.generate(
inputs,
max_new_tokens=10,
do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of India?
#Answer:Mumbai
# </s><
Verify the 16-bit frozen binary embeddings (sanity check)
The model uses a frozen nn.Embedding(vocab_size=65536, n_embed=16) whose values are strictly binary (0/1). Each 16-dim vector is then deterministically expanded to d_model=1024 via repeat_interleave(scale=64).
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
repo_id = "Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)
model.eval()
print("vocab_size:", tokenizer.vocab_size)
print("config:", {k: getattr(model.config, k) for k in ["vocab_size", "n_embed", "d_model", "n_layer", "n_head", "scale"]})
# --- 1) Show embedding matrix shape (should be 65536 x 16) ---
W = model.token_embeddings.weight.detach().cpu()
print("token_embeddings.weight shape:", tuple(W.shape)) # (65536, 16)
# --- 2) Tokenize 'A' and show its token id (should be 65 for a unicode-char tokenizer) ---
text = "A"
ids = tokenizer.encode(text, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(ids)
print(f"text={text!r}")
print("ids:", ids)
print("tokens:", tokens)
tid = ids[0]
# --- 3) Print the 16-dim vector and verify it is binary (0/1) ---
e16 = W[tid] # shape: (16,)
print("16-dim embedding for token id", tid, ":", e16.tolist())
uniq = torch.unique(e16)
print("unique values in e16:", uniq.tolist())
is_binary = torch.all((e16 == 0) | (e16 == 1)).item()
print("is strictly binary (0/1):", is_binary)
# --- 4) Show deterministic expansion to d_model=1024 via repeat_interleave ---
scale = model.config.scale # should be 1024 // 16 = 64
e1024 = e16.repeat_interleave(scale) # shape: (1024,)
print("expanded embedding shape:", tuple(e1024.shape))
print("expanded embedding first 128 values:", e1024[:128].tolist())
# --- 5) Global check: all embedding weights are exactly 0/1 ---
is_binary_global = torch.all((W == 0) | (W == 1)).item()
num_non_binary = torch.numel(W) - torch.sum((W == 0) | (W == 1)).item()
print("is binary globally (0/1):", is_binary_global)
print("non-binary entries:", int(num_non_binary))
Expected output highlights (example):
- vocab_size: 65536
- config: {'vocab_size': 65536, 'n_embed': 16, 'd_model': 1024, 'n_layer': 9, 'n_head': 32, 'scale': 64}
- token_embeddings.weight shape: (65536, 16)
- text='A'
- ids: [65]
- tokens: ['A']
- 16-dim embedding for token id 65 : [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
- unique values in e16: [0.0, 1.0]
- is strictly binary (0/1): True
- expanded embedding shape: (1024,)
- expanded embedding first 128 values: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
- is binary globally (0/1): True
- non-binary entries: 0
π§βπ¬ Citation & Concept
If you use this model or the underlying concepts in your research, please cite our work:
@article{
bochkov2025emergent,
title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
author={Andrey Bochkov},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=Odh8IynO1o},
note={}
}
@misc{bochkov2025growingtransformersmodularcomposition,
title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate},
author={A. Bochkov},
year={2025},
eprint={2507.07129},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.07129},
}
- Downloads last month
- 1