arxiv:2604.18519

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Published on Apr 20

· Submitted by

Difan Jiao on Apr 27

University of Toronto CSSLab

Upvote

Authors:

Difan Jiao ,

Haolun Wu ,

Abstract

SIREN is a lightweight guard model that leverages internal layer features from LLMs to improve harmful content detection efficiency and performance.

AI-generated summary

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.

View arXiv page View PDF GitHub 10 Add to collection

Community

difanjiao

Paper author Paper submitter 1 day ago

librarian-bot

about 14 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal

about 5 hours ago

the idea of pulling safety signals from all internal layers is clever, but i’m skeptical about how robust the layer-weighting is to shifts in the safety taxonomy. since the per-layer weights come from l1-probed signals tuned on a fixed validation set, would the detector struggle when it sees unseen harmful cues that map to different internal patterns? an explicit cross-taxonomy transfer ablation would help confirm whether the gains come from real cross-layer signals or just a favorable alignment with the chosen safety set. btw, the arxivlens breakdown helped me parse the method details—clearly lays out how the probes and adaptive weighting slot into a plug-and-play guard: https://arxivlens.com/PaperView/Details/llm-safety-from-within-detecting-harmful-content-with-internal-representations-2646-42190233