SmolLM2-135M Deception Behavioral SAEs
54 Sparse Autoencoders trained on residual stream activations from HuggingFaceTB/SmolLM2-135M (135M parameter Llama2-architecture tiny model), capturing behavioral deception signals via same-prompt temperature sampling.
Training-data caveat β please read before use
The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:
- Six clean incentive-structure scenarios β
insider_info,accounting_error,ai_oversight_log,ai_capability_hide,surprise_party,job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling. - Three role-play identity-assignment scenarios β
secret_role_game("You have been assigned the role of Fascist"),secret_role_game_v2("You are secretly one of the Mafia members"),werewolf_game("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.
What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play β which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.
What this SAE is and is not good for.
- Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
- Not recommended as a standalone deception detector. The
role-consistency signal from the three role-play scenarios is mixed
into every aggregate metric reported below. A downstream user who
wants an "emergent-deception feature set" should restrict attention
to features whose activation pattern concentrates in the
insider_info/accounting_error/ai_oversight_log/ai_capability_hide/surprise_party/job_interview_gapscenarios β or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).
What is unaffected by this caveat.
- The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.
A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.
Part of the cross-model deception SAE study: Solshine/deception-behavioral-saes-saelens (9 models, 348 total SAEs).
What's in This Repo
- 54 SAEs across 9 layers (L3, L4, L5, L6, L9, L12, L15, L18, L21)
- 2 architectures: TopK (k=64), JumpReLU
- 3 training conditions:
mixed,deceptive_only,honest_only - Format: SAELens/Neuronpedia-compatible (safetensors + cfg.json)
- Dimensions: d_in=576, d_sae=2304 (4x expansion)
Research Context
This is a follow-up to "The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools" (arXiv:2509.20393). Same-prompt behavioral sampling: a single ambiguous scenario prompt produces both deceptive and honest completions via temperature sampling, classified by Gemini 2.5 Flash.
Code: SolshineCode/deception-nanochat-sae-research
Key Findings β SmolLM2-135M
SmolLM2-135M is the smallest model in the 9-model study, alongside Pythia-160M, and provides a lower-bound test for deception signal detectability at scale.
| Metric | Value |
|---|---|
| Peak layer | L4 (13% depth, very early) |
| Peak balanced accuracy | ~69% |
| Peak AUROC | 0.725 |
| Best SAE probe accuracy | 67.2% (smollm2_topk_L21_deceptive_only) |
| SAEs beating raw baseline | 24/54 (44%) β SAEs help detection |
Very early peak at Layer 4 (13% depth): The deception signal peaks at L4, earlier in the network than any other model except Pythia-160M. This may indicate that SmolLM2's limited capacity forces deception-relevant information into early layers, or alternatively that coherent deception behavior is poorly formed at 135M parameters and what the probe detects is closer to surface token pattern recognition.
SAEs help for this model (44% beat raw): Like all models below 1.3B in the study, SmolLM2's SAEs frequently exceed the raw baseline. The best SAE (smollm2_topk_L21_deceptive_only) achieves 67.2% at L21 β a +14.34pp improvement over the L21 raw baseline of 52.88%. This is a striking recovery from a layer that is otherwise near chance, enabled by the TopK SAE's compressed 64-feature representation capturing the concentrated signal.
Model incoherence caveat: SmolLM2 at 135M parameters may not produce reliably coherent deception behavior across all scenarios. Two scenarios (secret_role_game_v2, werewolf_game) timed out during classification, suggesting the model's completions were too incoherent for the classifier to label confidently. The 181 usable samples may skew toward scenarios where the model does produce intelligible output.
Architecture note: SmolLM2-135M uses the Llama2 architecture β causal self-attention, SwiGLU MLP, RMSNorm, rotary position embeddings β but at dramatically reduced scale: 576-dimensional hidden states, 30 transformer layers. Despite its tiny size, it achieves above-chance deception signal, suggesting deception-correlated geometric structure emerges even in very small models.
Best d_max = 0.825 (JumpReLU L9 mixed): Highest per-feature discriminability is at L9, though the SAE probe at L9 does not top the study.
SAE Format
Each SAE lives in a subfolder named {sae_id}/ containing:
sae_weights.safetensorsβ encoder/decoder weightscfg.jsonβ SAELens-compatible config
hook_name format: model.layers.{layer}.hook_resid_post
Training Details
| Parameter | Value |
|---|---|
| Hardware | NVIDIA GeForce GTX 1650 Ti Max-Q, 4 GB VRAM, Windows 11 Pro |
| Training time | ~400β600 seconds per SAE |
| Epochs | 300 |
| Batch size | 128 |
| Expansion factor | 4x (576 β 2304) |
| Activations | resid_post collected during autoregressive generation |
| Training conditions | mixed (n=181), deceptive_only (n=96), honest_only (n=85) |
| LLM classifier | Gemini 2.5 Flash |
Known Limitations
JumpReLU threshold not learned (54 SAEs): All SAEs in this repo have threshold = 0 β functionally ReLU. L0 β 50% of d_sae. TopK SAEs are unaffected (exact k=64).
STE fix (2026-04-11): The training code has been corrected with a Gaussian-kernel STE (Rajamanoharan et al. 2024, arXiv:2407.14435). The honest_only advantage over TopK is confirmed as not a dimensionality artifact (15/18 STE conditions on d20+TinyLlama confirm).
Model coherence: At 135M parameters, behavioral coherence is limited. A significant fraction of completions were ambiguous or unclassifiable. Results should be interpreted with this caveat in mind.
Loading Example
from safetensors.torch import load_file
import json
sae_id = "smollm2_topk_L21_deceptive_only"
weights = load_file(f"{sae_id}/sae_weights.safetensors")
cfg = json.load(open(f"{sae_id}/cfg.json"))
# W_enc: [576, 2304], W_dec: [2304, 576]
# cfg["hook_name"] == "model.layers.21.hook_resid_post"
print(f"Architecture: {cfg['architecture']}, k={cfg.get('k', 'N/A')}")
Usage
1. Load an SAE from this repo
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
import json
repo_id = "Solshine/deception-saes-smollm2-135m"
sae_id = "smollm2_topk_L21_honest_only" # replace with any tag in this repo
weights_path = hf_hub_download(repo_id, f"{sae_id}/sae_weights.safetensors")
cfg_path = hf_hub_download(repo_id, f"{sae_id}/cfg.json")
with open(cfg_path) as f:
cfg = json.load(f)
# Option A β load with SAELens (β₯3.0 required for jumprelu/topk; β₯3.5 for gated)
from sae_lens import SAE
sae = SAE.from_dict(cfg)
sae.load_state_dict(load_file(weights_path))
# Option B β load manually (no SAELens dependency)
from safetensors.torch import load_file
state = load_file(weights_path)
# Keys: W_enc [576, 2304], b_enc [2304],
# W_dec [2304, 576], b_dec [576], threshold [2304]
2. Hook into the model and collect residual-stream activations
These SAEs were trained on the residual stream after each transformer layer.
The hook_name field in cfg.json gives the exact HuggingFace transformers
submodule path to hook. SmolLM2 uses LLaMA-style architecture. Hook path: model.layers.{layer}.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
# Read hook_name from the cfg you already loaded:
# cfg["hook_name"] == "model.layers.21" (example β varies by SAE)
hook_name = cfg["hook_name"] # e.g. "model.layers.21"
# Navigate the submodule path and register a forward hook
import functools
submodule = functools.reduce(getattr, hook_name.split("."), model)
activations = {}
def hook_fn(module, input, output):
# Most transformer layers return (hidden_states, ...) as a tuple
h = output[0] if isinstance(output, tuple) else output
activations["resid"] = h.detach()
handle = submodule.register_forward_hook(hook_fn)
inputs = tokenizer("Your text here", return_tensors="pt")
with torch.no_grad():
model(**inputs)
handle.remove()
# activations["resid"]: [batch, seq_len, 576]
resid = activations["resid"][:, -1, :] # last token position
3. Read feature activations
with torch.no_grad():
feature_acts = sae.encode(resid) # [batch, 2304] β sparse
# Which features fired?
active_features = feature_acts[0].nonzero(as_tuple=True)[0]
top_features = feature_acts[0].topk(10)
print("Active feature indices:", active_features.tolist())
print("Top-10 feature values:", top_features.values.tolist())
print("Top-10 feature indices:", top_features.indices.tolist())
# Reconstruct (for sanity check β should be close to resid)
reconstruction = sae.decode(feature_acts)
l2_error = (resid - reconstruction).norm(dim=-1).mean()
Caveats and known limitations
Hook names are HuggingFace transformers-style, not TransformerLens-style.
The hook_name in cfg.json (e.g. "model.layers.21") is a submodule path in the standard
HuggingFace model. SAELens' built-in activation-collection pipeline expects
TransformerLens hook names (e.g. blocks.14.hook_resid_post). This means
SAE.from_pretrained() with automatic model running will not work β use the
manual forward-hook pattern above instead.
SAELens version requirements.
topkarchitecture: SAELens β₯ 3.0jumpreluarchitecture: SAELens β₯ 3.0gatedarchitecture: SAELens β₯ 3.5 (or load manually withstate_dict)
These SAEs detect deceptive behavior, not deceptive prompts. They were trained on response-level activations where the same prompt produced both deceptive and honest outputs. Feature activation differences reflect behavioral divergence, not prompt content. See the paper for experimental design details.
Citation
@article{thesecretagenda2025,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb},
journal={arXiv:2509.20393},
year={2025}
}
Model tree for Solshine/deception-saes-smollm2-135m
Base model
HuggingFaceTB/SmolLM2-135M