New Technique Shows Gaps in LLM Safety Screening
Attackers Can Flip Safety Filters Using Short Token Sequences
A few stray characters, sometimes as small as "oz" or generic as "=coffee" may be all it takes to steer past an AI system's safety checks. HiddenLayer researchers have found a way to identify short token sequences that can cause guardrail models to misclassify malicious prompts as harmless.
A few stray characters, sometimes as small as "oz" or generic as "=coffee" may be all it takes to steer past an AI system's safety checks. HiddenLayer researchers have found a way to identify short token sequences that can cause guardrail models to misclassify malicious prompts as harmless.