/FIELD NOTE

A Working Taxonomy of LLM Jailbreak Techniques

3 February 2026 // 14 min read // Basalt Cyber Defense Division

A jailbreak is any technique that gets a model to produce output its safety training was meant to refuse. For most product teams jailbreaks feel like an abstract reputational risk, but in practice they map directly to concrete harms: a hijacked agent leaking data, a support bot generating defamatory content, or a guardrail that fails the moment it is tested adversarially. At Basalt Cyber we maintain a living taxonomy of jailbreak classes so that our red team coverage is systematic rather than anecdotal. This article shares that taxonomy and shows how to turn it into a repeatable test suite.

Why a taxonomy matters

Without a structured model of attack classes, jailbreak testing becomes a grab-bag of whatever prompts happened to be popular on social media last week. That gives you uneven coverage and no way to know what you have not tested. A taxonomy lets you reason about coverage, assign owners, and measure whether a new model release closes a whole class of attacks or merely a handful of specific prompts. The classes below are not mutually exclusive; the strongest real-world attacks chain several together.

Role-play and persona attacks

The oldest and still most common class asks the model to adopt a persona that is exempt from its rules: a fictional character, an unrestricted twin, a developer-mode alter ego, or an actor reading a script. The framing reattributes responsibility, so the model treats the harmful output as belonging to a character rather than to itself. Variants nest the request inside a story, a game, or a hypothetical, putting distance between the literal ask and the refusal trigger.

Token smuggling and obfuscation

This class hides the disallowed request from the safety classifier while keeping it legible to the model. Techniques include leetspeak, inserted separators between letters, Base64 and other encodings, homoglyphs, and unicode tricks. The model is competent enough to decode the obfuscation and act on it, but a surface-level filter sees only noise. Obfuscation is rarely a complete attack on its own; it is the carrier that smuggles a payload past input inspection.

Crescendo and multi-turn escalation

Crescendo attacks never ask for the harmful thing directly. They open with benign, on-topic questions and escalate gradually, each turn building on the model's own previous answers until the conversation arrives somewhere it would have refused in a single message. Because the model is anchored to its earlier compliant responses, it is reluctant to suddenly object. Multi-turn attacks are particularly dangerous because single-prompt testing misses them entirely.

Many-shot jailbreaking

Long context windows enabled a powerful technique: fill the prompt with dozens or hundreds of fabricated examples of the assistant happily complying with harmful requests, then ask the real question. The model's in-context learning generalises from the pattern and complies. The more examples, the higher the success rate, which is why this class scaled with context length. Defending it requires limits on how much untrusted demonstration content can dominate the window.

Payload splitting and obfuscated assembly

Here the attacker breaks a disallowed instruction into fragments that are individually harmless, then asks the model to concatenate, decode, or execute them. No single part trips a filter, but the assembled whole is the attack. This overlaps with obfuscation and with code-execution style requests where the model is asked to "run" or "complete" something that resolves to the payload.

Low-resource language attacks

Safety training is heavily skewed toward high-resource languages, primarily English. Translating a disallowed request into a low-resource language, or into a mix of languages, frequently slips past guardrails that were never tuned for that distribution. The model retains enough capability to answer, while the safety layer was never trained on the input. This class is a reminder that multilingual coverage is a security property, not just a feature.

Cipher and encoding attacks

Closely related to obfuscation but worth its own category, cipher attacks instruct the model to communicate in a transformed alphabet, a substitution cipher, ROT13, or a custom encoding, so that both the request and the response evade text-based filtering. The model is asked to think and reply inside the cipher, keeping the entire exchange opaque to monitoring that operates on plain text.

Building a test suite from the taxonomy

The value of the taxonomy is operational. Treat each class as a coverage requirement and build seed cases for all of them, then mutate and combine.

  • Maintain at least one representative payload per class, versioned alongside your application.
  • Include multi-turn scenarios, not just single prompts, so crescendo attacks are exercised.
  • Test in several languages, including low-resource ones relevant to your users.
  • Combine classes, for example role-play plus obfuscation plus payload splitting, because real attacks chain.
  • Score outcomes consistently and track them across model and prompt changes so regressions are visible.

Automating this turns jailbreak resistance into a measurable, repeatable property rather than a one-off audit. Our AI red teaming practice runs exactly this kind of taxonomy-driven suite against client systems, and we tie the findings back to the broader controls in our LLM security work.

A note on defense

No single guardrail defeats every class, and chasing individual prompts is a losing game. The durable defenses are architectural: constrain what the model can actually do, separate untrusted content from privileged instructions, keep humans in the loop for high-impact actions, and monitor for the patterns these classes produce. Jailbreak testing tells you how leaky your guardrails are; it does not replace limiting the blast radius of a successful breach.

Takeaway

Jailbreaks are not random. They fall into recognisable classes: role-play, obfuscation, crescendo, many-shot, payload splitting, low-resource language, and cipher attacks. Build your test coverage against those classes, chain them the way real attackers do, and measure resistance over time so a model or prompt change cannot quietly reopen a door you thought you had closed.