Papers
arxiv:2604.18519

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Published on Apr 20
· Submitted by
Difan Jiao
on Apr 27
Authors:
,
,
,
,

Abstract

SIREN is a lightweight guard model that leverages internal layer features from LLMs to improve harmful content detection efficiency and performance.

AI-generated summary

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.

Community

Paper author Paper submitter

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

the idea of pulling safety signals from all internal layers is clever, but i’m skeptical about how robust the layer-weighting is to shifts in the safety taxonomy. since the per-layer weights come from l1-probed signals tuned on a fixed validation set, would the detector struggle when it sees unseen harmful cues that map to different internal patterns? an explicit cross-taxonomy transfer ablation would help confirm whether the gains come from real cross-layer signals or just a favorable alignment with the chosen safety set. btw, the arxivlens breakdown helped me parse the method details—clearly lays out how the probes and adaptive weighting slot into a plug-and-play guard: https://arxivlens.com/PaperView/Details/llm-safety-from-within-detecting-harmful-content-with-internal-representations-2646-42190233

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.18519
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.18519 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.18519 in a Space README.md to link it from this page.

Collections including this paper 2