LLM Safety From Within: Detecting Harmful Content with Internal Representations
Abstract
SIREN is a lightweight guard model that leverages internal layer features from LLMs to improve harmful content detection efficiency and performance.
Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.
Community
Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Segment-Level Coherence for Robust Harmful Intent Probing in LLMs (2026)
- Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints (2026)
- Predict, Don't React: Value-Based Safety Forecasting for LLM Streaming (2026)
- DeepGuard: Secure Code Generation via Multi-Layer Semantic Aggregation (2026)
- Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation (2026)
- TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense (2026)
- Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
the idea of pulling safety signals from all internal layers is clever, but i’m skeptical about how robust the layer-weighting is to shifts in the safety taxonomy. since the per-layer weights come from l1-probed signals tuned on a fixed validation set, would the detector struggle when it sees unseen harmful cues that map to different internal patterns? an explicit cross-taxonomy transfer ablation would help confirm whether the gains come from real cross-layer signals or just a favorable alignment with the chosen safety set. btw, the arxivlens breakdown helped me parse the method details—clearly lays out how the probes and adaptive weighting slot into a plug-and-play guard: https://arxivlens.com/PaperView/Details/llm-safety-from-within-detecting-harmful-content-with-internal-representations-2646-42190233
Get this paper in your agent:
hf papers read 2604.18519 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper