Growing Transformers — Frozen 16-bit Baseline (Monolithic, 181M)

This repository contains growing-transformers-model-frozen-16-bit-baseline-monolyth-181m, a monolithic baseline model from the paper:

📚 Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

📚 Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

It is part of the comparative-study collection:
https://huggingface.co/collections/Bochkov/growing-transformers-layer-wise-expansion-comparative-study

Code:
https://github.com/AVBochkov/PGT

What this model is (in one paragraph)

This is a 9-layer decoder-only Transformer trained monolithically end-to-end (all Transformer layers trained simultaneously from scratch), without any constructive / layer-wise growth procedure. The key constraint is that the token embedding layer is frozen and uses an extremely small 16-dimensional binary embedding (n_embed = 16): each token is mapped to a 16-bit vector derived from its token ID (because vocab_size = 65,536 = 2^16). The 16-dim vector is deterministically expanded to the model hidden size (d_model = 1024) by repetition, as described in the paper.

This repository serves as a clean baseline to isolate the effect of monolithic training vs constructive growth, under the same frozen-embedding substrate.

Primary comparison (why this repo exists)

This model is intended to be compared to the constructive-growth counterpart:

Bochkov/growing-transformers-model-16-bit-1-9-181m
(constructive, layer-wise growth; same 16-bit frozen embedding idea)

What is identical

Same controlled-study Transformer stack architecture (9 layers, d_model=1024, n_head=32)
Same tokenizer family / vocabulary size (65,536)
Same 16-bit frozen embedding definition

What differs

Training procedure:
- This repo: monolithic training (end-to-end; no staged growth)
- Constructive repo: trained in stages (1–3, then 4–6, then 7–9), freezing previously trained layers

Model architecture (controlled study)

Type: decoder-only Transformer (GPT-like)
Layers: 9
Hidden size: d_model = 1024
Heads: n_head = 32
Vocabulary size: 65,536
Context length used in training: 1024
Embedding: frozen 16-bit / n_embed=16, deterministically expanded to d_model

Parameter count

Total: ≈181.6M
Frozen: ≈1.0M (embedding-related)
Trainable: ≈180.6M

(Counts follow the paper’s controlled-study table.)

Embedding definition (16-bit / n_embed=16)

vocab_size = 65,536
Each token ID id ∈ [0, 65535] is represented as a 16-bit binary vector (0/1 components).
This 16-dim vector is expanded to d_model=1024 by simple repetition (repeat_interleave-style expansion).

Tokenizer

Canonical tokenizer repository:

https://huggingface.co/Bochkov/bvv241-2-3
(collection: https://huggingface.co/collections/Bochkov/tokenizers)

Note: This model repo may include additional embedding-related artifacts; for strict reproducibility, prefer loading the tokenizer from this repo.

Intended use

Research / analysis of:

emergent semantics with minimal, frozen token embeddings
monolithic vs constructive (layer-wise) training regimes
controlled comparisons across embedding substrates (UNICODE vs 16-bit vs trainable)

Not intended as a general-purpose assistant model. Outputs may be unreliable and the model may reflect biases present in the training data.

How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Write a short poem about the ocean. ")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=50,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Write a short poem about the ocean. The poem was published in 1830 by the painter John Brown and was included in the collection

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of India?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of India?
#Answer:Mumbai
#    </s><

Verify the 16-bit frozen binary embeddings (sanity check)

The model uses a frozen nn.Embedding(vocab_size=65536, n_embed=16) whose values are strictly binary (0/1). Each 16-dim vector is then deterministically expanded to d_model=1024 via repeat_interleave(scale=64).

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)
model.eval()

print("vocab_size:", tokenizer.vocab_size)
print("config:", {k: getattr(model.config, k) for k in ["vocab_size", "n_embed", "d_model", "n_layer", "n_head", "scale"]})

# --- 1) Show embedding matrix shape (should be 65536 x 16) ---
W = model.token_embeddings.weight.detach().cpu()
print("token_embeddings.weight shape:", tuple(W.shape))  # (65536, 16)

# --- 2) Tokenize 'A' and show its token id (should be 65 for a unicode-char tokenizer) ---
text = "A"
ids = tokenizer.encode(text, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(ids)

print(f"text={text!r}")
print("ids:", ids)
print("tokens:", tokens)

tid = ids[0]

# --- 3) Print the 16-dim vector and verify it is binary (0/1) ---
e16 = W[tid]  # shape: (16,)
print("16-dim embedding for token id", tid, ":", e16.tolist())

uniq = torch.unique(e16)
print("unique values in e16:", uniq.tolist())

is_binary = torch.all((e16 == 0) | (e16 == 1)).item()
print("is strictly binary (0/1):", is_binary)

# --- 4) Show deterministic expansion to d_model=1024 via repeat_interleave ---
scale = model.config.scale  # should be 1024 // 16 = 64
e1024 = e16.repeat_interleave(scale)  # shape: (1024,)
print("expanded embedding shape:", tuple(e1024.shape))
print("expanded embedding first 128 values:", e1024[:128].tolist())

# --- 5) Global check: all embedding weights are exactly 0/1 ---
is_binary_global = torch.all((W == 0) | (W == 1)).item()
num_non_binary = torch.numel(W) - torch.sum((W == 0) | (W == 1)).item()
print("is binary globally (0/1):", is_binary_global)
print("non-binary entries:", int(num_non_binary))

Expected output highlights (example):

vocab_size: 65536
config: {'vocab_size': 65536, 'n_embed': 16, 'd_model': 1024, 'n_layer': 9, 'n_head': 32, 'scale': 64}
token_embeddings.weight shape: (65536, 16)
text='A'
ids: [65]
tokens: ['A']
16-dim embedding for token id 65 : [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
unique values in e16: [0.0, 1.0]
is strictly binary (0/1): True
expanded embedding shape: (1024,)
expanded embedding first 128 values: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
is binary globally (0/1): True
non-binary entries: 0

🧑‍🔬 Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}

Downloads last month: 1

Collection including Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m

Growing Transformers:Layer-wise Expansion Comparative Study

Collection

Paper: 2507.07129 'Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate' (4.2.2, 5.2. Results) • 8 items • Updated Jan 4 • 1

Papers for Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m

Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate

Paper • 2507.07129 • Published Jul 8, 2025 • 3

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

Paper • 2507.04886 • Published Jul 7, 2025 • 3