Upload folder using huggingface_hub

Browse files

Files changed (15) hide show

LICENSE.MD +54 -0
NOTICE.MD +13 -0
README.MD +46 -0
__init__.py +11 -0
config.json +44 -0
config.py +222 -0
generation_config.json +5 -0
gpt.py +900 -0
hook_utils.py +182 -0
model.safetensors +3 -0
modeling_circuitgpt.py +127 -0
special_tokens_map.json +3 -0
tokenizer.json +0 -0
tokenizer.model +3 -0
tokenizer_config.json +6 -0

LICENSE.MD ADDED Viewed

	@@ -0,0 +1,54 @@

+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+1. Definitions.
+“License” shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
+“Licensor” shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
+“Legal Entity” shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, “control” means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
+“You” (or “Your”) shall mean an individual or Legal Entity exercising permissions granted by this License.
+“Source” form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
+“Object” form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+“Work” shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
+“Derivative Works” shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
+“Contribution” shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, “submitted” means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as “Not a Contribution.”
+“Contributor” shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
+2. Grant of Copyright License.
+Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
+3. Grant of Patent License.
+Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
+4. Redistribution.
+You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+You must give any other recipients of the Work or Derivative Works a copy of this License; and
+You must cause any modified files to carry prominent notices stating that You changed the files; and
+You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
+If the Work includes a “NOTICE” text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
+You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
+5. Submission of Contributions.
+Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
+6. Trademarks.
+This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
+7. Disclaimer of Warranty.
+Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
+8. Limitation of Liability.
+In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+9. Accepting Warranty or Additional Liability.
+While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
+END OF TERMS AND CONDITIONS

NOTICE.MD ADDED Viewed

	@@ -0,0 +1,13 @@

+Copyright 2025 OpenAI
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+Distributed under the License is distributed on an “AS IS” BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+Limitations under the License.

README.MD ADDED Viewed

	@@ -0,0 +1,46 @@

+## Sparse Model from Gao et al. 2025
+Weights for a sparse model from Gao et al. 2025, used for the qualitative results from the paper (related to bracket counting and variable binding). All weights for the other models used in the paper, as well as lightweight inference code, are present in https://github.com/openai/circuit_sparsity. In the context of that repo, this model is csp_yolo2.
+This is a runnable standalone huggingface implementation for one of the models.
+Some trivial code to load the locally converted HF model + tokenizer and
+run a tiny generation.
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+if __name__ == "__main__":
+    PROMPT = "def square_sum(xs):\n    return sum(x * x for x in xs)\n\nsquare_sum([1, 2, 3])\n"
+    tok = AutoTokenizer.from_pretrained("circuit-sparsity", trust_remote_code=True)
+    model = AutoModelForCausalLM.from_pretrained(
+        "circuit-sparsity",
+        trust_remote_code=True,
+        torch_dtype="auto",
+    )
+    model.to("cuda" if torch.cuda.is_available() else "cpu")
+    inputs = tok(PROMPT, return_tensors="pt", add_special_tokens=False)["input_ids"].to(
+        model.device
+    )
+    with torch.no_grad():
+        out = model.generate(
+            inputs,
+            max_new_tokens=64,
+            do_sample=True,
+            temperature=0.8,
+            top_p=0.95,
+            return_dict_in_generate=False,
+        )
+    print("=== Prompt ===")
+    print(PROMPT)
+    print("\n=== Generation ===")
+    print(tok.decode(out[0], skip_special_tokens=True))
+## License
+This project is licensed under the [Apache License 2.0](LICENSE).

__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+from .config import CircuitGPTConfig
+from .modeling_circuitgpt import CircuitGPTForCausalLM
+from .tokenizer_simple2k import Simple2KTokenizerFast, export_simple2k_tokenizer
+__all__ = [
+    "CircuitGPTConfig",
+    "CircuitGPTForCausalLM",
+    "Simple2KTokenizerFast",
+    "export_simple2k_tokenizer",
+]

config.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "activation_type": "gelu",
+  "afrac": 0.25,
+  "afrac_loctypes": "attn_in,attn_out,mlp_in,mlp_out,mlp_neuron,attn_v,attn_k,attn_q",
+  "architectures": [
+    "CircuitGPTForCausalLM"
+  ],
+  "auto_map": {
+    "AutoConfig": "config.CircuitGPTConfig",
+    "AutoModelForCausalLM": "modeling_circuitgpt.CircuitGPTForCausalLM"
+  },
+  "bias": true,
+  "bigram_table_rank": null,
+  "block_size": 1024,
+  "bos_token_id": null,
+  "d_head": 16,
+  "d_mlp": 8192,
+  "d_model": 2048,
+  "d_pos_emb": 32,
+  "dropout": 0.0,
+  "dropout_cat_pos_emb": false,
+  "enable_bigram_table": true,
+  "eos_token_id": 2047,
+  "flash": true,
+  "is_decoder": true,
+  "learnable_bigram_table": true,
+  "ln_bias": true,
+  "max_position_embeddings": 1024,
+  "model_type": "circuitgpt",
+  "n_head": 128,
+  "n_layer": 8,
+  "pad_token_id": null,
+  "residual_activation_type": "identity",
+  "rms_norm": true,
+  "sink": true,
+  "sinusoidal_cat_pos_emb": false,
+  "tie_word_embeddings": false,
+  "tied_unembed": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.49.0",
+  "unembed_rank": null,
+  "use_position_embeddings": true,
+  "vocab_size": 2048
+}

config.py ADDED Viewed

	@@ -0,0 +1,222 @@

+from __future__ import annotations
+from typing import Any
+from transformers import PretrainedConfig
+class CircuitGPTConfig(PretrainedConfig):
+    """
+    Minimal Hugging Face config wrapper around the circuit_sparsity GPTConfig.
+    Only the fields exercised by the Neuronpedia runs are exposed.
+    """
+    model_type = "circuitgpt"
+    def __init__(
+        self,
+        vocab_size: int = 2048,
+        block_size: int = 256,
+        n_layer: int = 8,
+        n_head: int = 8,
+        d_model: int = 1024,
+        d_mlp: int | None = None,
+        d_head: int | None = None,
+        dropout: float = 0.0,
+        bias: bool = True,
+        ln_bias: bool = True,
+        rms_norm: bool = True,
+        activation_type: str = "gelu",
+        residual_activation_type: str = "identity",
+        tied_unembed: bool = False,
+        unembed_rank: int | None = None,
+        afrac: float | None = None,
+        afrac_loctypes: str = "attn_in,attn_out,mlp_in,mlp_out",
+        flash: bool = True,
+        use_position_embeddings: bool = False,
+        sink: bool = False,
+        enable_bigram_table: bool = False,
+        learnable_bigram_table: bool = False,
+        bigram_table_rank: int | None = None,
+        dropout_cat_pos_emb: bool = False,
+        sinusoidal_cat_pos_emb: bool = False,
+        d_pos_emb: int | None = None,
+        auto_map: dict[str, str] | None = None,
+        **kwargs: Any,
+    ) -> None:
+        # Drop unsupported/sensitive keys that may be present in a loaded config.
+        for key in [
+            "afrac_ste",
+            "afrac_ste_only_non_neurons",
+            "afrac_approx",
+            "rtopk",
+            "mup",
+            "mup_width_multiplier",
+            "grad_checkpointing",
+            "enable_fp8_linear",
+            "scale_invariance",
+            "cat_pos_emb",
+        ]:
+            kwargs.pop(key, None)
+        d_mlp = d_mlp or 4 * d_model
+        d_head = d_head or d_model // n_head
+        # Avoid duplicate kwargs when loading from a config dict.
+        bos_token_id = kwargs.pop("bos_token_id", None)
+        eos_token_id = kwargs.pop("eos_token_id", vocab_size - 1)
+        pad_token_id = kwargs.pop("pad_token_id", None)
+        super().__init__(
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pad_token_id=pad_token_id,
+            **kwargs,
+        )
+        self.vocab_size = vocab_size
+        self.block_size = block_size
+        self.max_position_embeddings = block_size
+        self.n_layer = n_layer
+        self.n_head = n_head
+        self.d_model = d_model
+        self.d_mlp = d_mlp
+        self.d_head = d_head
+        self.dropout = dropout
+        self.bias = bias
+        self.ln_bias = ln_bias
+        self.rms_norm = rms_norm
+        self.activation_type = activation_type
+        self.residual_activation_type = residual_activation_type
+        self.tied_unembed = tied_unembed
+        self.unembed_rank = unembed_rank
+        self.afrac = afrac
+        self.afrac_loctypes = afrac_loctypes
+        self.flash = flash
+        self.use_position_embeddings = use_position_embeddings
+        self.d_pos_emb = d_pos_emb
+        self.sink = sink
+        self.enable_bigram_table = enable_bigram_table
+        self.learnable_bigram_table = learnable_bigram_table
+        self.bigram_table_rank = bigram_table_rank
+        self.dropout_cat_pos_emb = dropout_cat_pos_emb
+        self.sinusoidal_cat_pos_emb = sinusoidal_cat_pos_emb
+        self.is_decoder = True
+        # Provide explicit auto_map entries so AutoModel/AutoConfig can locate
+        # the custom classes when trust_remote_code=True on the Hub.
+        self.auto_map = auto_map or {
+            "AutoConfig": "config.CircuitGPTConfig",
+            "AutoModelForCausalLM": "modeling_circuitgpt.CircuitGPTForCausalLM",
+        }
+    # ---------------------------------------------------------------------
+    # Conversion helpers
+    # ---------------------------------------------------------------------
+    @classmethod
+    def from_circuit_config(cls, circuit_config: "GPTConfig") -> "CircuitGPTConfig":  # type: ignore[name-defined]
+        config_dict: dict[str, Any] = {
+            "vocab_size": circuit_config.vocab_size,
+            "block_size": circuit_config.block_size,
+            "n_layer": circuit_config.n_layer,
+            "n_head": circuit_config.n_head,
+            "d_model": circuit_config.d_model,
+            "d_mlp": circuit_config.d_mlp,
+            "d_head": circuit_config.d_head,
+            "dropout": circuit_config.dropout,
+            "bias": circuit_config.bias,
+            "ln_bias": circuit_config.ln_bias,
+            "rms_norm": circuit_config.rms_norm,
+            "activation_type": circuit_config.activation_type,
+            "residual_activation_type": circuit_config.residual_activation_type,
+            "tied_unembed": circuit_config.tied_unembed,
+            "unembed_rank": circuit_config.unembed_rank,
+            "afrac": circuit_config.afrac,
+            "afrac_loctypes": circuit_config.afrac_loctypes,
+            "flash": circuit_config.flash,
+            "use_position_embeddings": circuit_config.d_pos_emb is not None,
+            "d_pos_emb": getattr(circuit_config, "d_pos_emb", None),
+            "sink": getattr(circuit_config, "sink", False),
+            "enable_bigram_table": getattr(circuit_config, "enable_bigram_table", False),
+            "learnable_bigram_table": getattr(circuit_config, "learnable_bigram_table", False),
+            "bigram_table_rank": getattr(circuit_config, "bigram_table_rank", None),
+            "dropout_cat_pos_emb": getattr(circuit_config, "dropout_cat_pos_emb", False),
+            "sinusoidal_cat_pos_emb": getattr(circuit_config, "sinusoidal_cat_pos_emb", False),
+        }
+        return cls(**config_dict)
+    def to_circuit_config(self) -> "GPTConfig":  # type: ignore[name-defined]
+        from circuit_sparsity.gpt import GPTConfig as CircuitConfig
+        config_kwargs: dict[str, Any] = dict(
+            vocab_size=self.vocab_size,
+            block_size=self.block_size,
+            n_layer=self.n_layer,
+            n_head=self.n_head,
+            d_model=self.d_model,
+            dropout=self.dropout,
+            bias=self.bias,
+            ln_bias=self.ln_bias,
+            rms_norm=self.rms_norm,
+            activation_type=self.activation_type,
+            residual_activation_type=self.residual_activation_type,
+            tied_unembed=self.tied_unembed,
+            unembed_rank=self.unembed_rank,
+            afrac=self.afrac,
+            afrac_loctypes=self.afrac_loctypes,
+            flash=self.flash,
+            afrac_ste=False,
+            afrac_ste_only_non_neurons=False,
+            afrac_approx=False,
+            rtopk=False,
+            mup=False,
+            mup_width_multiplier=None,
+            grad_checkpointing=False,
+            enable_fp8_linear=False,
+            scale_invariance=False,
+            d_mlp=self.d_mlp,
+            d_head=self.d_head,
+            enable_sparse_kernels=False,
+            enable_bigram_table=self.enable_bigram_table,
+            learnable_bigram_table=self.learnable_bigram_table,
+            bigram_table_rank=self.bigram_table_rank,
+            d_pos_emb=self.d_pos_emb
+            if self.d_pos_emb is not None
+            else (self.d_model if self.use_position_embeddings else None),
+            sink=self.sink,
+            dropout_cat_pos_emb=self.dropout_cat_pos_emb,
+            sinusoidal_cat_pos_emb=self.sinusoidal_cat_pos_emb,
+        )
+        return CircuitConfig(**config_kwargs)
+    def to_dict(self) -> dict[str, Any]:
+        base = super().to_dict()
+        data = {
+            "vocab_size": self.vocab_size,
+            "block_size": self.block_size,
+            "n_layer": self.n_layer,
+            "n_head": self.n_head,
+            "d_model": self.d_model,
+            "d_mlp": self.d_mlp,
+            "d_head": self.d_head,
+            "dropout": self.dropout,
+            "bias": self.bias,
+            "ln_bias": self.ln_bias,
+            "rms_norm": self.rms_norm,
+            "activation_type": self.activation_type,
+            "residual_activation_type": self.residual_activation_type,
+            "tied_unembed": self.tied_unembed,
+            "unembed_rank": self.unembed_rank,
+            "flash": self.flash,
+            "afrac": self.afrac,
+            "afrac_loctypes": self.afrac_loctypes,
+            "use_position_embeddings": self.use_position_embeddings,
+            "d_pos_emb": self.d_pos_emb,
+            "sink": self.sink,
+            "enable_bigram_table": self.enable_bigram_table,
+            "learnable_bigram_table": self.learnable_bigram_table,
+            "bigram_table_rank": self.bigram_table_rank,
+            "dropout_cat_pos_emb": self.dropout_cat_pos_emb,
+            "sinusoidal_cat_pos_emb": self.sinusoidal_cat_pos_emb,
+            "auto_map": self.auto_map,
+        }
+        base.update(data)
+        return base

generation_config.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "_from_model_config": true,
+  "eos_token_id": 2047,
+  "transformers_version": "4.49.0"
+}

gpt.py ADDED Viewed

	@@ -0,0 +1,900 @@

+"""
+Full definition of a GPT Language Model, all of it in this single file.
+References:
+1) the official GPT-2 TensorFlow implementation released by OpenAI:
+https://github.com/openai/gpt-2/blob/master/src/model.py
+2) huggingface/transformers PyTorch implementation:
+https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/modeling_gpt2.py
+"""
+import math
+from dataclasses import dataclass
+from typing import Literal
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+# has to be down here to avoid loading cuda too early
+from .hook_utils import (
+    hook_namespace,
+    hook_save,
+    torch_recompute_preserving_hook_context,
+)
+def sample_top_k(*, n: int, k: int, shape: tuple[int, ...]):
+    """Fallback sampler used only when sparse kernels are enabled."""
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    return torch.randn(shape, device=device, dtype=torch.float32)
+class AbsTopK(nn.Module):
+    def __init__(self, k):
+        super().__init__()
+        self.k = k
+    def forward(self, x):
+        vals, inds = torch.topk(x.abs(), self.k, dim=-1, sorted=False)
+        ret = torch.zeros_like(x)
+        ret.scatter_(-1, inds, x.gather(-1, inds))
+        return ret
+def barrier():
+    # stub
+    pass
+class LayerNorm(nn.Module):
+    """LayerNorm but with an optional bias. PyTorch doesn't support simply bias=False"""
+    def __init__(self, ndim, bias):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(ndim))
+        self.bias = nn.Parameter(torch.zeros(ndim)) if bias else None
+    def forward(self, input):
+        return F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)
+class CausalSelfAttention(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        assert config.d_model % config.n_head == 0
+        # key, query, value projections for all heads, but in a batch
+        self.c_attn = config.Linear(
+            config.d_model, 3 * config.d_head * config.n_head, bias=config.bias
+        )
+        # output projection
+        self.c_proj = config.Linear(config.d_head * config.n_head, config.d_model, bias=config.bias)
+        # regularization
+        self.attn_dropout = nn.Dropout(config.dropout)
+        self.resid_dropout = nn.Dropout(config.dropout)
+        self.n_head = config.n_head
+        self.d_head = config.d_head
+        self.d_model = config.d_model
+        self.dropout = config.dropout
+        self.config = config
+        # flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
+        self.flash = hasattr(torch.nn.functional, "scaled_dot_product_attention") and config.flash
+        if self.flash:
+            self.attn_imp = (
+                SDPAWithSink(config.n_head) if config.sink else F.scaled_dot_product_attention
+            )
+        if not self.flash:
+            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
+            # causal mask to ensure that attention is only applied to the left in the input sequence
+            self.register_buffer(
+                "bias",
+                torch.tril(torch.ones(config.block_size, config.block_size)).view(
+                    1, 1, config.block_size, config.block_size
+                ),
+            )
+    def forward(self, x):
+        B, T, C = x.size()  # batch size, sequence length, embedding dimensionality (d_model)
+        x = self.config.maybe_activation_sparsity(x, "attn_in")
+        x = hook_save("act_in", x)
+        if self.config.debug_nans:
+            assert x.isfinite().all(), "nan in input"
+        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
+        q, k, v = self.c_attn(x).split(self.n_head * self.d_head, dim=2)
+        k = self.config.maybe_activation_sparsity(k, "attn_k")
+        q = self.config.maybe_activation_sparsity(q, "attn_q")
+        v = self.config.maybe_activation_sparsity(v, "attn_v")
+        k = hook_save("k", k)  # (B, T, n_head * d_head)
+        q = hook_save("q", q)  # (B, T, n_head * d_head)
+        v = hook_save("v", v)  # (B, T, n_head * d_head)
+        k = k.view(B, T, self.n_head, self.d_head).transpose(1, 2)  # (B, nh, T, hs)
+        q = q.view(B, T, self.n_head, self.d_head).transpose(1, 2)  # (B, nh, T, hs)
+        v = v.view(B, T, self.n_head, self.d_head).transpose(1, 2)  # (B, nh, T, hs)
+        if self.config.debug_nans:
+            assert q.isfinite().all(), "nan in query"
+            assert k.isfinite().all(), "nan in key"
+            assert v.isfinite().all(), "nan in value"
+        attention_scale = 1.0 / math.sqrt(k.size(-1))
+        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
+        if self.flash:
+            # efficient attention using Flash Attention CUDA kernels
+            y = self.attn_imp(
+                q,
+                k,
+                v,
+                dropout_p=self.dropout if self.training else 0,
+                is_causal=True,
+                scale=attention_scale,
+            )
+        else:
+            # manual implementation of attention
+            att = (q @ k.transpose(-2, -1)) * attention_scale
+            att = att.masked_fill(
+                self.bias[:, :, :T, :T] == 0, torch.finfo(att.dtype).min
+            )  # float("-inf"))
+            att = F.softmax(att, dim=-1)
+            att = self.attn_dropout(att)
+            y = att @ v  # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
+        if self.config.debug_nans:
+            assert y.isfinite().all(), "nan in attention output"
+        y = (
+            y.transpose(1, 2).contiguous().view(B, T, self.n_head * self.d_head)
+        )  # re-assemble all head outputs side by side
+        # y = self.config.maybe_activation_sparsity(y)
+        y = hook_save("y", y)  # (B, T, n_head * d_head)
+        # output projection
+        y = self.resid_dropout(self.c_proj(y))
+        if self.config.debug_nans:
+            assert y.isfinite().all(), "nan in attention output 2"
+        y = self.config.maybe_activation_sparsity(y, "attn_out")
+        return y
+class MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.c_fc = config.Linear(config.d_model, config.d_mlp, bias=config.bias)
+        self.act_fn = {
+            "gelu": nn.GELU(),
+            "relu": nn.ReLU(),
+        }[config.activation_type]
+        self.c_proj = config.Linear(config.d_mlp, config.d_model, bias=config.bias)
+        self.dropout = nn.Dropout(config.dropout)
+    def forward(self, x):
+        x = self.config.maybe_activation_sparsity(x, "mlp_in")
+        x = hook_save("act_in", x)
+        if self.config.debug_nans:
+            assert x.isfinite().all(), "nan in mlp input"
+        x = self.c_fc(x)
+        if self.config.debug_nans:
+            assert x.isfinite().all(), "nan in mlp after c_fc"
+        x = self.act_fn(x)
+        x = self.config.maybe_activation_sparsity(x, "mlp_neuron")
+        x = hook_save("post_act", x)
+        if self.config.debug_nans:
+            assert x.isfinite().all(), "nan in mlp after act"
+        x = self.c_proj(x)
+        if self.config.debug_nans:
+            assert x.isfinite().all(), "nan in mlp after c_proj"
+        x = self.dropout(x)
+        x = self.config.maybe_activation_sparsity(x, "mlp_out")
+        return x
+class SDPAWithSink(nn.Module):
+    """
+    Adds a learnable denominator-only term ("attention sink") to SDPA by
+    concatenating a dummy KV slot whose logit is b and whose V is zero.
+    """
+    def __init__(self, num_heads: int, init_logit: float = 0.0):
+        super().__init__()
+        shape = (num_heads,)
+        self.sink_logit = nn.Parameter(torch.full(shape, init_logit))
+    def forward(
+        self,
+        q: torch.Tensor,  # (B, H, Lq, D)
+        k: torch.Tensor,  # (B, H, Lk, D)
+        v: torch.Tensor,  # (B, H, Lk, Dv)
+        *,
+        dropout_p: float = 0.0,
+        is_causal: bool = False,
+        scale: float | None = None,
+    ) -> torch.Tensor:
+        B, H, Lq, D = q.shape
+        _, _, Lk, _ = k.shape
+        Dv = v.size(-1)
+        # 1) Prepend a dummy KV slot (always visible)
+        k_sink = torch.zeros((B, H, 1, D), dtype=q.dtype, device=q.device)
+        v_sink = torch.zeros((B, H, 1, Dv), dtype=v.dtype, device=v.device)
+        k_aug = torch.cat([k_sink, k], dim=2)  # (B,H,Lk+1,D)
+        v_aug = torch.cat([v_sink, v], dim=2)  # (B,H,Lk+1,Dv)
+        # 2) Build shifted causal allow-mask over keys (columns 1..), always allow col 0 (sink)
+        # allow: 1 where attending is allowed, 0 where disallowed
+        # For real keys: allow[i, j+1] = 1 if j <= i else 0  (lower-triangular)
+        allow = torch.zeros((Lq, Lk + 1), dtype=torch.bool, device=q.device)
+        allow[:, 0] = True  # sink column always on
+        # lower-triangular for real keys shifted by +1
+        real = torch.ones((Lq, Lk), dtype=torch.bool, device=q.device).tril()
+        allow[:, 1:] = real
+        # Broadcast to (B,H,Lq,Lk+1)
+        allow = allow.view(1, 1, Lq, Lk + 1).expand(B, H, Lq, Lk + 1)
+        # 3) Turn it into an additive mask. 0 for allowed, -inf for disallowed
+        neg_inf = torch.finfo(q.dtype).min
+        base_mask = torch.where(
+            allow,
+            torch.zeros((), dtype=q.dtype, device=q.device),
+            torch.full((), neg_inf, dtype=q.dtype, device=q.device),
+        )  # (B,H,Lq,Lk+1)
+        # 4) Add learnable sink bias b to column 0 (per head or shared)
+        if self.sink_logit.numel() == H:
+            b = self.sink_logit.to(dtype=q.dtype, device=q.device).view(1, H, 1, 1)  # (1,H,1,1)
+        else:
+            b = self.sink_logit.to(dtype=q.dtype, device=q.device).view(1, 1, 1, 1)  # (1,1,1,1)
+        sink_bias_mask = torch.zeros((1, 1, 1, Lk + 1), dtype=q.dtype, device=q.device)
+        sink_bias_mask[..., 0] = 1.0
+        attn_mask = base_mask + sink_bias_mask * b  # (B,H,Lq,Lk+1)
+        # 5) SDPA with our custom mask; keep is_causal=False to avoid double-masking
+        out = F.scaled_dot_product_attention(
+            q,
+            k_aug,
+            v_aug,
+            attn_mask=attn_mask,
+            dropout_p=dropout_p,
+            is_causal=False,  # important
+            scale=scale,
+        )
+        return out
+class Block(nn.Module):
+    # block exactly satisfies the invariant that forward = forward_mlp_block . forward_attn_block
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.ln_1 = (
+            nn.RMSNorm(config.d_model)
+            if config.rms_norm
+            else LayerNorm(config.d_model, bias=config.ln_bias)
+        )
+        self.attn = CausalSelfAttention(config)
+        self.ln_2 = (
+            nn.RMSNorm(config.d_model)
+            if config.rms_norm
+            else LayerNorm(config.d_model, bias=config.ln_bias)
+        )
+        self.mlp = MLP(config)
+    def forward_attn_block(self, x):
+        x = hook_save("resid_in", x)
+        if self.config.debug_nans:
+            assert x.isfinite().all(), "nan in blk input"
+        with hook_namespace("attn"):
+            if self.config.grad_checkpointing:
+                x = x + hook_save(
+                    "resid_delta",
+                    torch_recompute_preserving_hook_context(
+                        lambda x: self.attn(self.ln_1(x)), x, use_reentrant=False
+                    ),
+                )
+            else:
+                x = x + hook_save("resid_delta", self.attn(self.ln_1(x)))
+        if self.config.residual_activation_type == "relu":
+            x = torch.relu(x)
+        x = self.config.maybe_activation_sparsity(x, "resid_post_attn")
+        return x
+    def forward_mlp_block(self, x):
+        x = hook_save("resid_mid", x)
+        with hook_namespace("mlp"):
+            if self.config.grad_checkpointing:
+                x = x + hook_save(
+                    "resid_delta",
+                    torch_recompute_preserving_hook_context(
+                        lambda x: self.mlp(self.ln_2(x)), x, use_reentrant=False
+                    ),
+                )
+            else:
+                x = x + hook_save("resid_delta", self.mlp(self.ln_2(x)))
+        if self.config.residual_activation_type == "relu":
+            x = torch.relu(x)
+        x = self.config.maybe_activation_sparsity(x, "resid_post_mlp")
+        return x
+    def forward(self, x):
+        x = self.forward_attn_block(x)
+        x = self.forward_mlp_block(x)
+        return x
+class CausalSelfAttentionCatPosEmb(CausalSelfAttention):
+    def __init__(self, config):
+        # initialize base attention with standard shapes, we'll override projections
+        super().__init__(config)
+        assert config.d_model % config.n_head == 0
+        # key, query, value projections for all heads, but in a batch
+        self.c_attn = config.Linear(
+            config.d_model_in, 3 * config.d_head * config.n_head, bias=config.bias
+        )
+        # output projection
+        self.c_proj = config.Linear(config.d_head * config.n_head, config.d_model, bias=config.bias)
+        # regularization
+        self.attn_dropout = nn.Dropout(config.dropout)
+        self.resid_dropout = nn.Dropout(config.dropout)
+        self.n_head = config.n_head
+        self.d_head = config.d_head
+        self.d_model_in = config.d_model_in
+        self.d_model = config.d_model
+        self.dropout = config.dropout
+        self.config = config
+        # flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
+        self.flash = hasattr(torch.nn.functional, "scaled_dot_product_attention") and config.flash
+        if not self.flash:
+            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
+            # causal mask to ensure that attention is only applied to the left in the input sequence
+            self.register_buffer(
+                "bias",
+                torch.tril(torch.ones(config.block_size, config.block_size)).view(
+                    1, 1, config.block_size, config.block_size
+                ),
+            )
+    def forward(self, x, pos_emb_to_cat):
+        # Broadcast pos emb over batch if provided as shape [1, T, C]
+        if pos_emb_to_cat is not None and pos_emb_to_cat.size(0) == 1 and x.size(0) != 1:
+            pos_emb_to_cat = pos_emb_to_cat.expand(x.size(0), -1, -1)
+        x = torch.cat([x, pos_emb_to_cat], dim=-1)
+        return super().forward(x)
+class MLPCatPosEmb(MLP):
+    def __init__(self, config):
+        # initialize base MLP, we'll override the projections to match cat shapes
+        super().__init__(config)
+        self.config = config
+        self.c_fc = config.Linear(config.d_model_in, config.d_mlp, bias=config.bias)
+        self.act_fn = {
+            "gelu": nn.GELU(),
+            "relu": nn.ReLU(),
+        }[config.activation_type]
+        self.c_proj = config.Linear(config.d_mlp, config.d_model, bias=config.bias)
+        self.dropout = nn.Dropout(config.dropout)
+    def forward(self, x, pos_emb_to_cat):
+        # Broadcast pos emb over batch if provided as shape [1, T, C]
+        if pos_emb_to_cat is not None and pos_emb_to_cat.size(0) == 1 and x.size(0) != 1:
+            pos_emb_to_cat = pos_emb_to_cat.expand(x.size(0), -1, -1)
+        x = torch.cat([x, pos_emb_to_cat], dim=-1)
+        x = super().forward(x)
+        return x
+class BlockCatPosEmb(Block):
+    # block exactly satisfies the invariant that forward = forward_mlp_block . forward_attn_block
+    def __init__(self, config):
+        # initialize base Block to get ln_1/ln_2 and other invariants
+        super().__init__(config)
+        self.ln_p1 = (
+            nn.RMSNorm(config.d_pos_emb)
+            if config.rms_norm
+            else LayerNorm(config.d_pos_emb, bias=config.ln_bias)
+        )
+        self.ln_p2 = (
+            nn.RMSNorm(config.d_pos_emb)
+            if config.rms_norm
+            else LayerNorm(config.d_pos_emb, bias=config.ln_bias)
+        )
+        self.attn = CausalSelfAttentionCatPosEmb(config)
+        self.mlp = MLPCatPosEmb(config)
+    def forward_attn_block(self, x, p):
+        x = hook_save("resid_in", x)
+        if self.config.debug_nans:
+            assert x.isfinite().all(), "nan in blk input"
+        with hook_namespace("attn"):
+            if self.config.grad_checkpointing:
+                x = x + hook_save(
+                    "resid_delta",
+                    torch_recompute_preserving_hook_context(
+                        lambda x, p: self.attn(self.ln_1(x), self.ln_p1(p)),
+                        x,
+                        p,
+                        use_reentrant=False,
+                    ),
+                )
+            else:
+                x = x + hook_save("resid_delta", self.attn(self.ln_1(x), self.ln_p1(p)))
+        if self.config.residual_activation_type == "relu":
+            x = torch.relu(x)
+        x = self.config.maybe_activation_sparsity(x, "resid_post_attn")
+        return x
+    def forward_mlp_block(self, x, p):
+        x = hook_save("resid_mid", x)
+        with hook_namespace("mlp"):
+            if self.config.grad_checkpointing:
+                x = x + hook_save(
+                    "resid_delta",
+                    torch_recompute_preserving_hook_context(
+                        lambda x, p: self.mlp(self.ln_2(x), self.ln_p2(p)),
+                        x,
+                        p,
+                        use_reentrant=False,
+                    ),
+                )
+            else:
+                x = x + hook_save("resid_delta", self.mlp(self.ln_2(x), self.ln_p2(p)))
+        if self.config.residual_activation_type == "relu":
+            x = torch.relu(x)
+        x = self.config.maybe_activation_sparsity(x, "resid_post_mlp")
+        return x
+    def forward(self, x, pos_emb_to_cat):
+        x = self.forward_attn_block(x, pos_emb_to_cat)
+        x = self.forward_mlp_block(x, pos_emb_to_cat)
+        return x
+@dataclass
+class GPTConfig:
+    block_size: int = 1024
+    vocab_size: int = 50304  # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency  # TODO: FLAG FOR ACHY
+    n_layer: int = 12
+    n_head: int = 12
+    d_head: int | None = None  # defaults to d_model // n_head
+    d_model: int = 768
+    dropout: float = 0.0
+    bias: bool = (
+        True  # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
+    )
+    ln_bias: bool = (
+        True  # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster
+    )
+    rms_norm: bool = False  # use RMSNorm instead of LayerNorm
+    residual_activation_type: Literal["identity", "relu"] = "identity"
+    activation_type: Literal["gelu", "relu"] = "gelu"
+    afrac: float | None = None  # fraction of activations to keep
+    afrac_loctypes: str = "attn_in,attn_out,mlp_in,mlp_out"
+    debug_nans: bool = False
+    tied_unembed: bool = True
+    tokenizer_name: str = "tinypython_2k"
+    grad_checkpointing: bool = True
+    d_mlp: int | None = None  # multiplier for MLP hidden layer size
+    enable_bigram_table: bool = False
+    learnable_bigram_table: bool = False
+    d_pos_emb: int | None = None
+    dropout_cat_pos_emb: bool = False
+    sinusoidal_cat_pos_emb: bool = False
+    enable_sparse_kernels: bool = False
+    flash: bool = True
+    sink: bool = False
+    @property
+    def cat_pos_emb(self):
+        return self.d_pos_emb is not None
+    @property
+    def d_model_in(self):
+        return self.d_model + self.d_pos_emb if self.cat_pos_emb else self.d_model
+    def __post_init__(self):
+        assert self.d_model % self.n_head == 0
+        assert self.residual_activation_type in ["identity", "relu"]
+        assert self.activation_type in ["gelu", "relu"]
+        if self.d_mlp is None:
+            self.d_mlp = 4 * self.d_model
+        if self.d_head is None:
+            self.d_head = self.d_model // self.n_head
+    @property
+    def Linear(self):
+        return nn.Linear
+    def maybe_activation_sparsity(self, x, loctype):
+        if self.afrac is not None and loctype in self.afrac_loctypes.split(","):
+            def keep_abstopk(x, k):
+                ret = torch.zeros_like(x)
+                _, topk_inds = torch.topk(x.abs(), k, dim=-1, sorted=False)
+                ret.scatter_(-1, topk_inds, x.gather(-1, topk_inds))
+                return ret
+            x = keep_abstopk(
+                x,
+                k=int(self.afrac * x.shape[-1]),
+            )
+        return x
+class GPT(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        assert config.vocab_size is not None
+        assert config.block_size is not None
+        self.config = config
+        if config.cat_pos_emb:
+            block_cls = BlockCatPosEmb
+        else:
+            block_cls = Block
+        self.transformer = nn.ModuleDict(
+            dict(
+                wte=nn.Embedding(config.vocab_size, config.d_model),
+                wpe=nn.Embedding(config.block_size, config.d_pos_emb or config.d_model),
+                drop=nn.Dropout(config.dropout),
+                h=nn.ModuleList([(block_cls(config)) for _ in range(config.n_layer)]),
+                ln_f=nn.RMSNorm(config.d_model)
+                if config.rms_norm
+                else LayerNorm(config.d_model, bias=config.ln_bias),
+            )
+        )
+        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)
+        self.register_buffer(
+            "final_logits_bias", torch.zeros(config.vocab_size, dtype=torch.float32)
+        )
+        if self.config.enable_bigram_table:
+            if self.config.learnable_bigram_table:
+                # HACK: low rank to fit in mem
+                self.bigram_table = nn.Parameter(
+                    torch.zeros(config.vocab_size, config.vocab_size, dtype=torch.float32)
+                )
+            else:
+                self.register_buffer(
+                    "bigram_table",
+                    torch.zeros(config.vocab_size, config.vocab_size, dtype=torch.float32),
+                )
+        else:
+            self.bigram_table = None
+        # Never tie embeddings/unembed to avoid accidental aliasing in exports.
+        config.tied_unembed = False
+        # init all weights
+        self.apply(self._init_weights)
+        # apply special scaled init to the residual projections, per GPT-2 paper
+        for pn, p in self.named_parameters():
+            if pn.endswith("c_proj.weight"):
+                if p.is_sparse:
+                    num_nonzero = p._nnz()
+                    p._values().data = (
+                        sample_top_k(n=p.numel(), k=num_nonzero, shape=(num_nonzero,))
+                        * 0.02
+                        / math.sqrt(2 * config.n_layer)
+                    )
+                else:
+                    torch.nn.init.normal_(p, mean=0.0, std=0.02 / math.sqrt(2 * config.n_layer))
+        # If requested, initialize positional embeddings with fixed sinusoids and freeze
+        if config.cat_pos_emb and config.sinusoidal_cat_pos_emb:
+            assert config.d_pos_emb is not None, (
+                "sinusoidal_cat_pos_emb requires cat_pos_emb (d_pos_emb must be set)"
+            )
+            with torch.no_grad():
+                T = config.block_size
+                D = config.d_pos_emb
+                device = self.transformer.wpe.weight.device
+                dtype = self.transformer.wpe.weight.dtype
+                positions = torch.arange(T, device=device, dtype=dtype).unsqueeze(1)  # [T,1]
+                d_half = max(1, D // 2)
+                # periods from 4 tokens up to block_size tokens (log-spaced)
+                T_float = float(T)
+                p_min = 4.0
+                p_max = max(p_min, T_float)
+                periods = torch.logspace(
+                    math.log10(p_min), math.log10(p_max), steps=d_half, device=device, dtype=dtype
+                )
+                freqs = 2 * math.pi / periods  # [d_half]
+                angles = positions * freqs  # [T, d_half]
+                sinv = torch.sin(angles)
+                cosv = torch.cos(angles)
+                enc = torch.cat([sinv, cosv], dim=1)  # [T, 2*d_half]
+                if enc.shape[1] < D:
+                    pad = torch.zeros(T, D - enc.shape[1], device=device, dtype=dtype)
+                    enc = torch.cat([enc, pad], dim=1)
+                elif enc.shape[1] > D:
+                    enc = enc[:, :D]
+                self.transformer.wpe.weight.copy_(enc)
+                self.transformer.wpe.weight.requires_grad_(False)
+        # report number of parameters
+        print("number of parameters: %.2fM" % (self.get_num_params() / 1e6,))
+    def get_num_params(self, non_embedding=True):
+        """
+        Return the number of parameters in the model.
+        For non-embedding count (default), the position embeddings get subtracted.
+        The token embeddings would too, except due to the parameter sharing these
+        params are actually used as weights in the final layer, so we include them.
+        """
+        n_params = sum(p.numel() for p in self.parameters())
+        if non_embedding:
+            n_params -= self.transformer.wpe.weight.numel()
+        return n_params
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+    def forward(self, idx, targets=None, include_resid_mid=False):
+        device = idx.device
+        b, t = idx.size()
+        assert t <= self.config.block_size, (
+            f"Cannot forward sequence of length {t}, block size is only {self.config.block_size}"
+        )
+        # pos = torch.arange(0, t, dtype=torch.long, device=device)  # shape (t)
+        # forward the GPT model itself
+        tok_emb = self.transformer.wte(idx)  # token embeddings of shape (b, t, d_model)
+        # pos_emb = self.transformer.wpe(pos)  # position embeddings of shape (t, d_model)
+        pos_emb = self.transformer.wpe.weight[:t].unsqueeze(0)
+        if self.config.cat_pos_emb:
+            x = self.transformer.drop(tok_emb)
+        else:
+            x = self.transformer.drop(tok_emb + pos_emb)
+        if self.config.debug_nans:
+            assert x.isfinite().all(), "nan in initial post-embedding"
+        if self.config.enable_bigram_table:
+            # add bigram table to the logits bias
+            additional_logits_bias = F.embedding(idx, self.bigram_table, padding_idx=-1)
+            additional_logits_bias = additional_logits_bias.to(x.dtype)
+        else:
+            additional_logits_bias = None
+        if self.config.cat_pos_emb:
+            pos_emb_to_cat = pos_emb
+            if self.config.dropout_cat_pos_emb:
+                pos_emb_to_cat = self.transformer.drop(pos_emb)
+        else:
+            pos_emb_to_cat = None
+        return self.forward_tail(
+            x,
+            n=0,
+            targets=targets,
+            additional_logits_bias=additional_logits_bias,
+            include_resid_mid=include_resid_mid,  # this is hacky we should just switch to using hooks
+            pos_emb_to_cat=pos_emb_to_cat,
+        )
+    def forward_tail(
+        self,
+        x,
+        n,
+        targets=None,
+        additional_logits_bias=None,
+        include_resid_mid=False,
+        pos_emb_to_cat=None,
+    ):
+        # print(x.shape)
+        hs = []
+        blks = list(self.transformer.h)
+        if include_resid_mid:
+            blks = list_join(
+                [
+                    [
+                        blk.forward_attn_block,
+                        blk.forward_mlp_block,
+                    ]
+                    for blk in blks
+                ]
+            )
+        assert n <= len(blks)
+        for i, block_fn in enumerate(blks[n:]):
+            global curlayer
+            curlayer = i
+            with hook_namespace(f"{i // 2}") if include_resid_mid else hook_namespace(f"{i}"):
+                hs.append(x)
+                if self.config.cat_pos_emb:
+                    x = block_fn(x, pos_emb_to_cat)
+                else:
+                    x = block_fn(x)
+        x = hook_save("final_resid", x)
+        x = self.transformer.ln_f(x)
+        logits = (
+            self.lm_head(x)
+            + self.final_logits_bias
+            + (additional_logits_bias if additional_logits_bias is not None else 0)
+        )
+        if targets is not None:
+            loss = F.cross_entropy(
+                logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1
+            )
+        else:
+            loss = torch.zeros(1, device=x.device)
+        return logits, loss, hs  # hs is deprecated in favor of hook stuff
+    def crop_block_size(self, block_size):
+        # model surgery to decrease the block size if necessary
+        # e.g. we may load the GPT2 pretrained model checkpoint (block size 1024)
+        # but want to use a smaller block size for some smaller, simpler model
+        assert block_size <= self.config.block_size
+        self.config.block_size = block_size
+        self.transformer.wpe.weight = nn.Parameter(self.transformer.wpe.weight[:block_size])
+        for block in self.transformer.h:
+            if hasattr(block.attn, "bias"):
+                block.attn.bias = block.attn.bias[:, :, :block_size, :block_size]
+    @torch.no_grad()
+    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
+        """
+        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
+        the sequence max_new_tokens times, feeding the predictions back into the model each time.
+        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
+        """
+        for _ in range(max_new_tokens):
+            # if the sequence context is growing too long we must crop it at block_size
+            idx_cond = (
+                idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size :]
+            )
+            # forward the model to get the logits for the index in the sequence
+            logits, _, _ = self(idx_cond)
+            # pluck the logits at the final step and scale by desired temperature
+            logits = logits[:, -1, :] / temperature
+            # optionally crop the logits to only the top k options
+            if top_k is not None:
+                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+                logits[logits < v[:, -1:]] = -float("Inf")
+            # apply softmax to convert logits to (normalized) probabilities
+            probs = F.softmax(logits, dim=-1)
+            # sample from the distribution
+            idx_next = torch.multinomial(probs, num_samples=1)
+            # append sampled index to the running sequence and continue
+            idx = torch.cat((idx, idx_next), dim=1)
+        return idx
+    def is_mlp_param(self, p):
+        return id(p) in list_join(
+            [
+                [
+                    id(self.transformer.h[i].mlp.c_fc.weight),
+                    id(self.transformer.h[i].mlp.c_proj.weight),
+                ]
+                for i in range(self.config.n_layer)
+            ]
+        )
+    def is_param_embed(self, p):
+        return p is self.transformer.wte.weight or p is self.transformer.wpe.weight
+    def is_attn_param(self, p):
+        return id(p) in list_join(
+            [
+                [
+                    id(self.transformer.h[i].attn.c_attn.weight),
+                    id(self.transformer.h[i].attn.c_proj.weight),
+                ]
+                for i in range(self.config.n_layer)
+            ]
+        )
+    def is_bias(self, p):
+        return id(p) in list_join(
+            [
+                [
+                    id(self.transformer.h[i].attn.c_attn.bias),
+                    id(self.transformer.h[i].attn.c_proj.bias),
+                    id(self.transformer.h[i].mlp.c_fc.bias),
+                    id(self.transformer.h[i].mlp.c_proj.bias),
+                ]
+                for i in range(self.config.n_layer)
+            ]
+        )
+    def is_ln_param(self, p):
+        return id(p) in list_join(
+            [
+                [
+                    id(self.transformer.h[i].ln_1.weight),
+                    id(self.transformer.h[i].ln_2.weight),
+                ]
+                for i in range(self.config.n_layer)
+            ]
+        ) + [
+            id(self.transformer.ln_f.weight),
+        ]
+    def is_sparse_param(self, p, dense_embeddings=None, dense_unembed=None, dense_biases=None):
+        # if these params aren't specified, then still give answers, but only for uncontroversial params
+        if dense_embeddings is None:
+            assert p is not self.transformer.wte.weight and p is not self.transformer.wpe.weight
+        if dense_unembed is None:
+            assert p is not self.lm_head.weight
+        if dense_biases is None:
+            assert not self.is_bias(p)
+        if p is self.transformer.wte.weight or p is self.transformer.wpe.weight:
+            return not dense_embeddings
+        if p is self.lm_head.weight:
+            return not dense_unembed
+        if self.is_bias(p):
+            return not dense_biases
+        return id(p) in list_join(
+            [
+                [
+                    id(self.transformer.h[i].attn.c_attn.weight),
+                    id(self.transformer.h[i].attn.c_proj.weight),
+                    id(self.transformer.h[i].mlp.c_fc.weight),
+                    id(self.transformer.h[i].mlp.c_proj.weight),
+                ]
+                for i in range(self.config.n_layer)
+            ]
+        )
+def list_join(xss: list[list]) -> list:
+    """monadic join for lists"""
+    return [x for xs in xss for x in xs]

hook_utils.py ADDED Viewed

	@@ -0,0 +1,182 @@

+"""Self-contained subset of :mod:`circuit_sparsity.hook_utils` for inference builds.
+The full module has no exotic dependencies, but mirroring the definitions here
+keeps the trimmed :mod:`circuit_sparsity.inference.gpt` module hermetic and easy to vendor.  The
+implementations below are copied with minor tweaks for readability so that code
+written against :func:`hook_recorder`, :func:`hook_namespace`, and
+:func:`torch_recompute_preserving_hook_context` behaves identically in both the
+training and inference configurations.
+"""
+from __future__ import annotations
+import re
+from contextlib import contextmanager
+from functools import partial
+import torch
+import torch.utils.checkpoint
+class HookContext:
+    """State container used by the hook helpers."""
+    def __init__(self) -> None:
+        self._reset()
+        self.curintervtransformer = lambda x: x
+    def _reset(self) -> None:
+        self.curcontext = None
+        self.curname = ""
+        self.curregex = None
+        self.curinterventions = None
+        self.save_grads = None
+    def _get_interventions(self):
+        return self.curintervtransformer(
+            self.curinterventions if self.curinterventions is not None else {}
+        )
+    @contextmanager
+    def hook_recorder(self, regex: str = ".*", interventions=None, save_grads: bool = False):
+        """Record tensors that pass through hooks matching ``regex``."""
+        assert self.curcontext is None, "reentrancy not allowed!"
+        try:
+            self.curcontext = {}
+            self.curregex = re.compile(regex)
+            self.curname = ""
+            self.curinterventions = interventions
+            self.save_grads = save_grads
+            yield self.curcontext
+        finally:
+            self._reset()
+            get_context()._reset()
+    @contextmanager
+    def hook_intervention_transform(self, intervention_transformer):
+        oldintervention_transformer = self.curintervtransformer
+        def compose(f, g):
+            return lambda x: f(g(x))
+        self.curintervtransformer = compose(
+            intervention_transformer,
+            self.curintervtransformer,
+        )
+        try:
+            yield
+        finally:
+            self.curintervtransformer = oldintervention_transformer
+    @contextmanager
+    def hook_namespace(self, name: str):
+        """Temporarily push ``name`` onto the hook namespace stack."""
+        oldname = self.curname
+        self.curname = self.curname + name + "."
+        try:
+            yield
+        finally:
+            self.curname = oldname
+    def hook_save(self, name: str, tensor: torch.Tensor) -> torch.Tensor:
+        """Optionally record ``tensor`` using the current namespace."""
+        curinterventions = self._get_interventions()
+        if curinterventions is not None:
+            key = self.curname + name
+            if key in curinterventions:
+                tensor = curinterventions[key](tensor)
+        if self.curcontext is not None and self.curregex.match(self.curname + name):
+            self.curcontext[self.curname + name] = tensor
+        if self.curcontext is not None and self.save_grads and tensor.requires_grad:
+            class _Grad(torch.autograd.Function):
+                @staticmethod
+                def forward(ctx, input_tensor):
+                    return input_tensor
+                @staticmethod
+                def backward(ctx, grad_output):
+                    self.curcontext[self.curname + name + ".grad"] = grad_output
+                    return grad_output
+            if self.curregex.match(self.curname + name + ".grad"):
+                tensor = _Grad.apply(tensor)
+        return tensor
+def set_context(new_context: HookContext) -> None:
+    global context
+    context = new_context
+def get_context() -> HookContext:
+    global context
+    return context
+def torch_recompute_preserving_hook_context(f, *xs, use_reentrant=None):
+    """Wrapper around :func:`torch.utils.checkpoint` that propagates hooks."""
+    oldcontext = get_context()
+    curcontext = HookContext()
+    curcontext.curcontext = (
+        dict(oldcontext.curcontext) if oldcontext.curcontext is not None else None
+    )
+    curcontext.curregex = oldcontext.curregex
+    curcontext.curname = oldcontext.curname
+    curcontext.curinterventions = (
+        dict(oldcontext.curinterventions) if oldcontext.curinterventions is not None else None
+    )
+    curcontext.save_grads = oldcontext.save_grads
+    is_recompute = False
+    def _f(curcontext: HookContext, *xs):
+        initcontext = get_context()
+        nonlocal is_recompute
+        set_context(curcontext)
+        try:
+            res = f(*xs)
+            if not is_recompute and oldcontext.curcontext is not None:
+                oldcontext.curcontext |= curcontext.curcontext
+        finally:
+            set_context(initcontext)
+            is_recompute = True
+        return res
+    res = torch.utils.checkpoint.checkpoint(
+        partial(_f, curcontext), *xs, use_reentrant=use_reentrant
+    )
+    return res
+context = HookContext()
+def hook_recorder(*a, **k):
+    return get_context().hook_recorder(*a, **k)
+def hook_namespace(*a, **k):
+    return get_context().hook_namespace(*a, **k)
+def hook_save(*a, **k):
+    return get_context().hook_save(*a, **k)
+def hook_intervention_transform(*a, **k):
+    return get_context().hook_intervention_transform(*a, **k)

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c7a63c67d78a1e8da5fc3ead1338fe0f18994d48deedeb4374a6b3ad1d350724
+size 1676511896

modeling_circuitgpt.py ADDED Viewed

	@@ -0,0 +1,127 @@

+from __future__ import annotations
+import re
+from dataclasses import dataclass
+from typing import Sequence
+import torch
+from torch import nn
+from transformers.generation.utils import GenerationMixin
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils.generic import ModelOutput
+from .config import CircuitGPTConfig
+from .gpt import GPT
+from .hook_utils import hook_recorder
+@dataclass
+class CircuitGPTCausalLMOutput(ModelOutput):
+    loss: torch.Tensor | None = None
+    logits: torch.Tensor | None = None
+    activations: dict[str, torch.Tensor] | None = None
+def _activations_regex(keys: Sequence[str]) -> str:
+    escaped = (re.escape(k) for k in keys)
+    return "^(" + "|".join(escaped) + ")$"
+class CircuitGPTPreTrainedModel(PreTrainedModel):
+    config_class = CircuitGPTConfig
+    base_model_prefix = "circuit_model"
+    circuit_model: GPT
+    def __init__(self, config: CircuitGPTConfig, *inputs, **kwargs) -> None:
+        super().__init__(config, *inputs, **kwargs)
+    def get_input_embeddings(self) -> nn.Module:
+        return self.circuit_model.transformer.wte  # type: ignore[return-value]
+    def set_input_embeddings(self, value: nn.Module) -> None:
+        self.circuit_model.transformer.wte = value  # type: ignore[assignment]
+    def get_output_embeddings(self) -> nn.Module:
+        return self.circuit_model.lm_head  # type: ignore[return-value]
+    def set_output_embeddings(self, new_embeddings: nn.Module) -> None:
+        self.circuit_model.lm_head = new_embeddings  # type: ignore[assignment]
+class CircuitGPTForCausalLM(CircuitGPTPreTrainedModel, GenerationMixin):
+    """
+    Hugging Face-compatible wrapper around `circuit_sparsity.gpt.GPT`.
+    All math happens inside the original module so parity is guaranteed.
+    """
+    def __init__(self, config: CircuitGPTConfig, circuit_model: GPT | None = None) -> None:
+        super().__init__(config)
+        if circuit_model is None:
+            self.circuit_model = GPT(config.to_circuit_config())
+            self.post_init()
+        else:
+            self.circuit_model = circuit_model
+    # ------------------------------------------------------------------
+    # Constructors
+    # ------------------------------------------------------------------
+    @classmethod
+    def from_circuit_model(cls, circuit_model: GPT) -> "CircuitGPTForCausalLM":
+        config = CircuitGPTConfig.from_circuit_config(circuit_model.config)
+        return cls(config, circuit_model=circuit_model)
+    # ------------------------------------------------------------------
+    # Forward
+    # ------------------------------------------------------------------
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        labels: torch.LongTensor | None = None,
+        output_activations: Sequence[str] | None = None,
+        return_dict: bool | None = None,
+        use_cache: bool | None = None,
+        output_attentions: bool | None = None,
+        output_hidden_states: bool | None = None,
+        **kwargs,
+    ) -> CircuitGPTCausalLMOutput:
+        # Ignore HF generation kwargs we don't use; surface any unknowns.
+        remaining_kwargs = {k: v for k, v in kwargs.items() if v is not None}
+        if remaining_kwargs:
+            unsupported = ", ".join(remaining_kwargs.keys())
+            raise ValueError(f"Unsupported arguments for CircuitGPTForCausalLM: {unsupported}")
+        if input_ids.size(-1) > self.config.block_size:
+            raise ValueError(
+                f"Sequence length {input_ids.size(-1)} exceeds block size {self.config.block_size}"
+            )
+        if output_activations:
+            regex = _activations_regex(output_activations)
+            with hook_recorder(regex=regex) as recorded:
+                logits, loss, _ = self.circuit_model(input_ids, targets=labels)
+            activations = {key: recorded[key] for key in output_activations if key in recorded}
+        else:
+            activations = None
+            logits, loss, _ = self.circuit_model(input_ids, targets=labels)
+        if labels is None:
+            loss = None
+        return CircuitGPTCausalLMOutput(
+            loss=loss,
+            logits=logits,
+            activations=activations,
+        )
+    # ------------------------------------------------------------------
+    # Generation helpers
+    # ------------------------------------------------------------------
+    def prepare_inputs_for_generation(self, input_ids: torch.Tensor, **kwargs):
+        if input_ids.size(-1) > self.config.block_size:
+            input_ids = input_ids[:, -self.config.block_size :]
+        return {"input_ids": input_ids}
+    def _reorder_cache(self, past, beam_idx):
+        # No KV cache implemented; method exists for interface completeness.
+        return past

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "eos_token": "<|endoftext|>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b42d53ab8a66da8fc53288442b1239e373697d630ad61fa66dc35350274617c7
+size 24366

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "eos_token": "<|endoftext|>",
+  "model_max_length": 256,
+  "padding_side": "left",
+  "tokenizer_class": "PreTrainedTokenizerFast"
+}