Prompt injection & jailbreak defence

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

BLEEDING EDGE

TRAJECTORY— Stalled

Defences against adversarial prompt injection and jailbreak attacks that attempt to bypass AI system guardrails. Includes input sanitisation and prompt security layers; distinct from general cybersecurity which protects infrastructure rather than AI-specific attack vectors.

OVERVIEW

Prompt injection and jailbreak defences are emerging security practices addressing adversarial attacks on Large Language Models. These attacks exploit the inherent flexibility of natural language interfaces, allowing attackers to override model instructions or extract sensitive information. Unlike output filtering (content safety), these defences focus on protecting the input layer—validating, sanitising, or detecting malicious prompts before they reach the model. By early 2024, the landscape had shifted from pure research toward production deployment: major cloud providers (Microsoft, AWS) began shipping defence tools, vendors like Lakera operationalised detection APIs, and the research community validated multiple defence approaches (fine-tuning, structured queries, cryptographic signing). Yet critical assessments persisted—security researchers continued to demonstrate guardrail evasions and attack transferability, suggesting that while effective against known patterns, no single solution remained fundamentally robust against adaptive adversaries.

CURRENT LANDSCAPE

By mid-April 2026, prompt injection and jailbreak defence exhibited operational maturity combined with deepening recognition of fundamental architectural limits. Market consolidation solidified through 2025-2026 with Check Point's integration of Lakera (completed September 2025), creating converged network and AI security capabilities. Lakera Guard maintained performance leadership with independently validated metrics: 98%+ detection rates, sub-50ms latency, <0.5% false positive rates across 100+ languages, with named Fortune 500 deployments (notably Dropbox) confirming production viability. Microsoft's expansion of Prompt Shields into Azure AI Foundry and Global Secure Access signalled major infrastructure vendor commitment to network-level integration. Systematic evidence from February 2026 survey of 128 academic studies documented attack methods evolution from simple direct injection to sophisticated multimodal approaches achieving >90% success, with defense mechanisms showing 95% effectiveness against known patterns but acknowledged gaps in standardized evaluation and limited robustness against novel vectors. March-April 2026 developments crystallized architectural understanding: peer-reviewed research established the "defense trilemma" proving fundamental mathematical impossibility of wrapper-based defenses achieving simultaneous continuity, utility preservation, and security; large-scale arena evaluation (464 participants, 272K attacks on 13 frontier models) revealed significant robustness variance (Claude Opus 0.5% vs Gemini 8.5% ASR) with intelligence uncorrelated to safety; empirical attack taxonomy demonstrated composite obfuscation+semantic attacks reaching 97.6% success against intent-aware defenses; real-world telemetry confirmed first documented indirect injection ad-review evasion with 22 active attack techniques in production; and inference-time jailbreak research exposed surgical removal of refusal patterns from model hidden states. Independent competitive assessments confirmed Lakera Guard and ProtectAI as best-in-class, though cost-scaling (pricing-per-call models) drove adoption toward open-source alternatives. Vectra AI and security researchers documented prompt injection as OWASP LLM01 with 50-84% attack success rates and critical CVEs (Microsoft Copilot CVSS 9.3, GitHub CVSS 9.6, Cursor CVSS 9.8). Industry consensus shifted toward acceptance that prompt injection is a structural problem requiring new LLM design paradigms rather than a solvable technical challenge through filters or wrappers. The practice remained operationally essential with multiple competing vendors, defense-in-depth best practices, and enterprise deployments, while technically unresolved against adaptive adversaries and exhibiting persistent architectural constraints—sustaining bleeding-edge classification through mid-2026.

TIER HISTORY

ResearchJan-2023 → Jul-2023

Bleeding EdgeJul-2023 → present

EVIDENCE (80)

Open-Source Project Hits 800+ Stars by Enforcing AI Agent Rules Outside the PromptNotable Repositories2026-04-26

— Open-source project Caliber reached 810 GitHub stars and 101 forks by April 26, 2026, demonstrating community adoption of API-layer guardrails for agents. Addresses setup drift and deterministic policy enforcement.

AI threats in the wild: The current state of prompt injections on the webIndustry Reports2026-04-23

— Google Threat Intelligence empirical study of real-world prompt injections across 2-3 billion web pages (Common Crawl), using coarse-to-fine filtering methodology; finds attackers have not yet productionized advanced research at scale.

Move agent rules out of the prompt, violations drop to zeroIndustry Reports2026-04-22

— Empirical research showing prompt-based policy fails (20-62% violation rate); symbolic guardrails via API validators achieve 0% unsafe execution. Cites Carnegie Mellon research and 698 production incidents.

Check Point to Integrate AI Defense Plane with Google CloudProduct Launches2026-04-22

— Check Point's AI Defense Plane GA with Google Cloud Gemini integration shows ecosystem maturity with three-layer runtime protection (control, governance, runtime detection of prompt injection and data leakage).

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing UtilityResearch Papers2026-04-21

— Systematic research on non-probabilistic (symbolic/rule-based) defense approach for agents. Finds 74% of real-world policies can be guaranteed symbolically. Presents alternative to alignment-based guardrails with concrete safety proofs.

Generative AI Data Governance – Amazon Bedrock Guardrails - AWSProduct Launches2026-04-20

— AWS offers GA safeguards including explicit 'prompt attack detection' to block 'prompt injections and jailbreaks'; provides specific metrics (88% harmful content blocking, 99% automated reasoning accuracy) and names six enterprise customers adopting Bedrock Guardrails.

The OWASP Top 10 for LLM Applications: What Developers Shipping AI Features Need to KnowIndustry Reports2026-04-20

— OWASP framework positioned prompt injection as #1 LLM risk with no clean fix; discusses both direct and indirect injection variants with real CVEs and defence strategies.

The Alignment Tax: When Safety Features Make Your AI Product WorseOpinion2026-04-20

— Analysis of guardrail false positive rates and effectiveness tradeoffs; cites OR-Bench empirical study finding 0.878 Spearman correlation between safety score and over-refusal, proposes calibration and technical patterns for reducing false positives.

HISTORY

2023-H1: Prompt injection and jailbreak defence emerged as distinct practice area. Research benchmarks (JailbreakBench) and detection techniques (GradSafe) established evaluation foundations. Early-stage detection APIs (Geiger) launched as credit-based services. Security community consensus formed that no single robust defence existed, defining the practice's research-stage positioning.
2023-H2: Research defences matured with indirect-injection benchmarks (BIPIA) and structured-query systems (StruQ) demonstrating effectiveness at scale. Fine-tuning approaches (Jatmo) proved practical deployment feasibility. Lakera Guard reached enterprise preview, showing commercial product viability. Security analysis highlighted attack transferability across models, cementing research-stage classification and signalling long-term arms race dynamics.
2024-Q1: Defence tools moved to production: Microsoft Prompt Shields and AWS Bedrock Guardrails launched in preview/GA; NIST published formal taxonomy of attack types; PINT Benchmark enabled vendor comparison. Research continued with novel approaches (Anthropic's many-shot disclosure, cryptographic signing proposals). Critical assessments persisted: independent guardrail evasion (Mindgard) and expert scepticism about universal robustness kept the practice in bleeding-edge tier despite increased deployment activity.
2024-Q3: Enterprise-scale deployment accelerated with Dropbox publishing detailed Lakera Guard case study (7x latency gains); Lakera raised $20M Series A signalling sustained market validation. Evaluation methodology matured: StrongREJECT benchmark addressed prior flaws, InjecGuard exposed critical over-defense failures, and indirect-injection firewalls proposed novel approaches. Despite production viability, independent security testing continued revealing limitations and evasions, maintaining bleeding-edge classification.
2024-Q4: Vendor ecosystem maturation with AWS pricing reduction (85% cut for Bedrock Guardrails) and Lakera feature expansion (custom detectors, Citi Ventures participation). Empirical evidence of persistent vulnerabilities crystallized: systematic analysis showed 56% of 36 LLMs vulnerable to prompt injection with correlation to model size; jailbreak-tuning research exposed data poisoning vectors bypassing existing defenses; Emergent Mind flagged general-purpose defense as unresolved challenge. Microsoft invested in robustness evaluation (Adaptive Prompt Injection Challenge). Despite widening adoption, research demonstrated that no single defense solved the problem across model families or attack vectors, maintaining bleeding-edge classification.
2025-Q1: Market consolidation with Check Point's acquisition of Lakera (March 2025) signalling major infrastructure vendor entry. Lakera achieved Gartner AI TRiSM recognition; independent Palit Benchmark validated Lakera Guard and ProtectAI as leading solutions. Dropbox case study confirmed production viability (7x latency improvement). Novel CaMeL defense proposed provably-secure architecture. Concurrent Mindgard disclosure exposed Meta Prompt Guard evasion, reinforcing that no single defense prevented adaptive attacks. Market matured operationally while technical limitations persisted, maintaining bleeding-edge positioning.
2025-Q2: Deployment maturity deepened with research advances (JailbreaksOverTime continuous learning reducing false negatives 4%→0.3%) and enterprise adoption validation (Lakera customers Dropbox, COEUS Health via CB Insights). Independent assessments revealed persistent limitations: Palo Alto's Unit 42 analysis showed guardrails defeated by evasion tactics across platforms; Azure OpenAI documentation exposed false positive issues blocking legitimate agentic prompts; HiddenLayer's dataset critique highlighted evaluation methodology gaps; Lakera's AI Model Risk Index confirmed no model achieved universal security (Claude Sonnet 23.86% vs Llama 4 91.88% risk). Practice remained operationally viable yet fundamentally unresolved against adaptive adversaries, sustaining bleeding-edge tier.
2025-Q3: Research maturity accelerated with novel defense architectures (PromptSleuth using semantic intent invariance, FrameShield via activation disentanglement, AlignSentinel reducing false positives via attention-based classification). Vendor productization continued (Glean GA with 97.8% direct-injection, 90% indirect-injection accuracy). Critical assessments confirmed persistent vulnerabilities in production: ABV field report documented real-world Perplexity Comet data exfiltration (August 2025); VerSprite's multi-platform testing found no model completely immune to document-embedded injection across NotebookLM, Gemini, ChatGPT-4o, Copilot. Despite advanced research and widening deployment, evidence showed defenses remained evadable by determined adversaries, sustaining bleeding-edge classification and signalling arms race dynamics.
2025-Q4: Year-end consolidation revealed maturation paradox: peer-reviewed research confirmed limitations in existing defenses across GPT-3.5, GPT-4, Llama, and Vicuna; web agent benchmarking (WAInjectBench) exposed detector blind spots against imperceptible perturbations; yet Microsoft's Prompt Shield integration into Global Secure Access signalled major infrastructure vendor commitment. Systematic evaluation of jailbreak attacks showed safety filters detect nearly all synthetic attacks but optimization gaps remain in production systems. Independent security analysis emphasized persistent real-world agent compromises via indirect injection despite favorable synthetic benchmark results, reframing prompt injection defense as a risk management and privilege-scoping problem rather than a technical elimination challenge. Practice remained operationally deployed at scale while fundamental technical resolution remained unsolved, with industry consensus shifting toward acceptance of residual risk and defense-in-depth strategies. Bleeding-edge classification sustained, reflecting mature operational deployment with unresolved technical foundations.
2026-Jan: Academic research consolidation advanced with systematic literature review (88 studies) extending NIST taxonomy and causal analysis identifying direct jailbreak drivers across 35k attempts. Novel procedural detection approaches (RLM-JB) achieved 92.5-98% recall with <2% false positives. Industry analysis confirmed prompt injection remained OWASP#1 vulnerability (76-90% ASR) with security researcher consensus that fool-proof prevention remains unsolved. Critical assessments intensified: Bruce Schneier argued fundamental unsolvability due to LLM architectural constraints; multi-lab research (OpenAI, Anthropic, Google DeepMind) demonstrated adaptive attacks bypassing 12 published defenses at >90% rates, prompting CISA and NCSC advisory actions. Trend reflected mature operational deployment coexisting with deepening recognition of architectural rather than tactical limitations—continued bleeding-edge positioning reflecting technical stalemate despite intensive research effort.
2026-Feb: Comprehensive systematic review of 128 studies confirmed attack success >90% with defenses effective up to 95% against known patterns, but highlighted standardized evaluation gaps. Microsoft expanded Prompt Shields to Azure AI Foundry and Global Secure Access (February 2026). Lakera Guard maintained 98%+ detection, <50ms latency, <0.5% false positives across 100+ languages. Vectra AI documented prompt injection as OWASP LLM01 with 50-84% success rates and critical CVEs (Microsoft Copilot CVSS 9.3, GitHub CVSS 9.6, Cursor CVSS 9.8). Jailbreak generalization testing showed GPT-5 and Grok 4 both evadable within 30 minutes. Critical analyses emphasized fundamental architectural unsolvability. Market consolidation matured with Check Point Lakera integration completed. Despite widened operational deployment and security vendor commitment, architectural limitations persisted—sustaining bleeding-edge classification.
2026-Mar: Systematic evaluation of frontier model robustness via large-scale arena testing (464 participants, 272K attacks across 41 real-world agent scenarios) revealed significant variance—Claude Opus 0.5% ASR vs Gemini 8.5%—and critical finding that capability does not correlate with safety. Comprehensive SoK (UCLA/NTU/NVIDIA) analyzing 78 papers established that no single defense achieves simultaneous trustworthiness, utility, and low-latency operation; identified critical gaps in context-dependent agent task coverage. Real-world telemetry from Unit 42 documented first confirmed AI ad-review evasion via indirect injection; 22 distinct attack techniques identified in production, confirming active weaponization beyond theoretical PoCs. Novel research advances proposed MCP-level supply-chain defenses (ShieldNet with 0.995 F1 on 10K+ malicious tool variants) and inference-time jailbreak techniques exploiting geometric constraints in model alignment rather than surface-level prompting.
2026-Apr: Fundamental theoretical constraints crystallized through peer-reviewed research establishing the "defense trilemma"—no continuous, utility-preserving input preprocessing can achieve simultaneous safety and performance; formally verified in Lean 4 and empirically validated. Empirical attack taxonomy across 250 crafted prompts showed composite obfuscation+semantic attacks achieving 97.6% success against intent-aware defenses, while inference-time jailbreak research demonstrated surgical removal of refusal patterns from model hidden states, exposing RLHF alignment as structurally fragile rather than merely tactically deficient. A countervailing architectural signal emerged: symbolic/runtime guardrails moving enforcement out of the prompt achieved near-zero violation rates (vs. 20-62% for prompt-based policies), validated by ClawGuard (4-30x attack reduction at tool-call boundaries) and peer-reviewed symbolic guardrails research (74% of real-world policies encodable deterministically); RSAC 2026 demonstrated 76% on-device bypass success against Apple Intelligence, while Google's large-scale Common Crawl scan of 2-3 billion web pages found prompt injection attack content present but not yet productized at scale—confirming active threat presence with contested organizational defenses.