Growing Transformers β€” Frozen 16-bit Baseline (Monolithic, 181M)

This repository contains growing-transformers-model-frozen-16-bit-baseline-monolyth-181m, a monolithic baseline model from the paper:

πŸ“š Paper (Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate) -

πŸ“š Paper (Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations) -

It is part of the comparative-study collection:
https://huggingface.co/collections/Bochkov/growing-transformers-layer-wise-expansion-comparative-study

Code:
https://github.com/AVBochkov/PGT


What this model is (in one paragraph)

This is a 9-layer decoder-only Transformer trained monolithically end-to-end (all Transformer layers trained simultaneously from scratch), without any constructive / layer-wise growth procedure. The key constraint is that the token embedding layer is frozen and uses an extremely small 16-dimensional binary embedding (n_embed = 16): each token is mapped to a 16-bit vector derived from its token ID (because vocab_size = 65,536 = 2^16). The 16-dim vector is deterministically expanded to the model hidden size (d_model = 1024) by repetition, as described in the paper.

This repository serves as a clean baseline to isolate the effect of monolithic training vs constructive growth, under the same frozen-embedding substrate.


Primary comparison (why this repo exists)

This model is intended to be compared to the constructive-growth counterpart:

  • Bochkov/growing-transformers-model-16-bit-1-9-181m
    (constructive, layer-wise growth; same 16-bit frozen embedding idea)

What is identical

  • Same controlled-study Transformer stack architecture (9 layers, d_model=1024, n_head=32)
  • Same tokenizer family / vocabulary size (65,536)
  • Same 16-bit frozen embedding definition

What differs

  • Training procedure:
    • This repo: monolithic training (end-to-end; no staged growth)
    • Constructive repo: trained in stages (1–3, then 4–6, then 7–9), freezing previously trained layers

Model architecture (controlled study)

  • Type: decoder-only Transformer (GPT-like)
  • Layers: 9
  • Hidden size: d_model = 1024
  • Heads: n_head = 32
  • Vocabulary size: 65,536
  • Context length used in training: 1024
  • Embedding: frozen 16-bit / n_embed=16, deterministically expanded to d_model

Parameter count

  • Total: β‰ˆ181.6M
  • Frozen: β‰ˆ1.0M (embedding-related)
  • Trainable: β‰ˆ180.6M

(Counts follow the paper’s controlled-study table.)


Embedding definition (16-bit / n_embed=16)

  • vocab_size = 65,536
  • Each token ID id ∈ [0, 65535] is represented as a 16-bit binary vector (0/1 components).
  • This 16-dim vector is expanded to d_model=1024 by simple repetition (repeat_interleave-style expansion).

Tokenizer

Canonical tokenizer repository:

Note: This model repo may include additional embedding-related artifacts; for strict reproducibility, prefer loading the tokenizer from this repo.


Intended use

Research / analysis of:

  • emergent semantics with minimal, frozen token embeddings
  • monolithic vs constructive (layer-wise) training regimes
  • controlled comparisons across embedding substrates (UNICODE vs 16-bit vs trainable)

Not intended as a general-purpose assistant model. Outputs may be unreliable and the model may reflect biases present in the training data.


How to use (Transformers)


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m")
model = AutoModelForCausalLM.from_pretrained("Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m", trust_remote_code=True).to('cuda')

inputs = torch.tensor([tokenizer.encode("Write a short poem about the ocean. ")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=50,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Write a short poem about the ocean. The poem was published in 1830 by the painter John Brown and was included in the collection

inputs = torch.tensor([tokenizer.encode("Question: What is the capital of India?\nAnswer:")], dtype=torch.long, device='cuda')

outputs = model.generate(
    inputs, 
    max_new_tokens=10,
    do_sample=False
)
print(tokenizer.decode(outputs[0].tolist()))
#Question: What is the capital of India?
#Answer:Mumbai
#    </s><

Verify the 16-bit frozen binary embeddings (sanity check)

The model uses a frozen nn.Embedding(vocab_size=65536, n_embed=16) whose values are strictly binary (0/1). Each 16-dim vector is then deterministically expanded to d_model=1024 via repeat_interleave(scale=64).

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)
model.eval()

print("vocab_size:", tokenizer.vocab_size)
print("config:", {k: getattr(model.config, k) for k in ["vocab_size", "n_embed", "d_model", "n_layer", "n_head", "scale"]})

# --- 1) Show embedding matrix shape (should be 65536 x 16) ---
W = model.token_embeddings.weight.detach().cpu()
print("token_embeddings.weight shape:", tuple(W.shape))  # (65536, 16)

# --- 2) Tokenize 'A' and show its token id (should be 65 for a unicode-char tokenizer) ---
text = "A"
ids = tokenizer.encode(text, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(ids)

print(f"text={text!r}")
print("ids:", ids)
print("tokens:", tokens)

tid = ids[0]

# --- 3) Print the 16-dim vector and verify it is binary (0/1) ---
e16 = W[tid]  # shape: (16,)
print("16-dim embedding for token id", tid, ":", e16.tolist())

uniq = torch.unique(e16)
print("unique values in e16:", uniq.tolist())

is_binary = torch.all((e16 == 0) | (e16 == 1)).item()
print("is strictly binary (0/1):", is_binary)

# --- 4) Show deterministic expansion to d_model=1024 via repeat_interleave ---
scale = model.config.scale  # should be 1024 // 16 = 64
e1024 = e16.repeat_interleave(scale)  # shape: (1024,)
print("expanded embedding shape:", tuple(e1024.shape))
print("expanded embedding first 128 values:", e1024[:128].tolist())

# --- 5) Global check: all embedding weights are exactly 0/1 ---
is_binary_global = torch.all((W == 0) | (W == 1)).item()
num_non_binary = torch.numel(W) - torch.sum((W == 0) | (W == 1)).item()
print("is binary globally (0/1):", is_binary_global)
print("non-binary entries:", int(num_non_binary))

Expected output highlights (example):

  • vocab_size: 65536
  • config: {'vocab_size': 65536, 'n_embed': 16, 'd_model': 1024, 'n_layer': 9, 'n_head': 32, 'scale': 64}
  • token_embeddings.weight shape: (65536, 16)
  • text='A'
  • ids: [65]
  • tokens: ['A']
  • 16-dim embedding for token id 65 : [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
  • unique values in e16: [0.0, 1.0]
  • is strictly binary (0/1): True
  • expanded embedding shape: (1024,)
  • expanded embedding first 128 values: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
  • is binary globally (0/1): True
  • non-binary entries: 0

πŸ§‘β€πŸ”¬ Citation & Concept

If you use this model or the underlying concepts in your research, please cite our work:

@article{
      bochkov2025emergent,
      title={Emergent Semantics Beyond Token Embeddings: Transformer {LM}s with Frozen Visual Unicode Representations},
      author={Andrey Bochkov},
      journal={Transactions on Machine Learning Research},
      issn={2835-8856},
      year={2025},
      url={https://openreview.net/forum?id=Odh8IynO1o},
      note={}
}

@misc{bochkov2025growingtransformersmodularcomposition,
      title={Growing Transformers: Modular Composition and Layer-wise Expansion on a Frozen Substrate}, 
      author={A. Bochkov},
      year={2025},
      eprint={2507.07129},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.07129}, 
}
Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m

Papers for Bochkov/growing-transformers-model-frozen-16-bit-baseline-monolyth-181m