The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI using reinforcement learning or adversarial techniques to generate edge-case and fault-finding test scenarios. Includes fuzz testing augmented with LLMs and RL-based test case evolution; distinct from standard test generation which aims for coverage rather than fault discovery.
Adversarial test generation applies reinforcement learning, mutation, and adversarial techniques to systematically discover faults and edge cases that coverage-oriented testing misses. The practice spans LLM red teaming, AI-augmented fuzzing, and RL-driven test case evolution. It is advancing at accelerating pace, with production deployments from enterprises (Fortune 500 in finance, healthcare) and investment validation (OpenAI's $86M acquisition of Promptfoo in March 2026). Research results remain impressive: IEEE S&P 2026 accepts PILOT, which discovered 51 CLI vulnerabilities with 33 already patched; GoldenFuzz found 5 new hardware vulnerabilities; CVPR 2026 papers advance multimodal model fuzzing with 36% improvement over prior methods. However, a critical gap persists between vendor maturity and practitioner capability. Most organisations lack the operational expertise, threat models, and deployment patterns to embed continuous adversarial testing into CI/CD. Regulatory pressure is rising: EU AI Act Article 15 now mandates resilience testing for high-risk systems. Yet threat actors are moving faster—AI-powered fuzzing delivers 400% coverage improvements—making the adoption urgency both organisational and strategic.
Autonomous adversarial testing entered production at scale in June 2026. Autonomous AI agents deployed as security tools—depthfirst discovered 21 confirmed zero-day vulnerabilities in FFmpeg (1.5M lines of code) at $1,000 total cost with reproducible proofs-of-concept (9 CVEs assigned: CVE-2026-39210 through 39218), signalling that cost-effective autonomous vulnerability discovery is no longer a research prototype but an operational capability. CovRL combined LLM-based mutation with coverage-guided reinforcement learning on JavaScript engines, discovering 48 real bugs (39 novel, 11 CVEs). FireCompass achieved Gartner analyst recognition for autonomous AI platforms, with Fortune 500 adoption and documented performance: agents outperform manual red teams 60–70% of the time on internal evaluations, benchmark accuracy at 100%, and cost reduction from $2,400–$10,000 per manual pentesting engagement to ~$1,000 per autonomous scan.
Vendor maturity accelerated in Q1 2026. F5's AI Red Team (January 2026) now deploys to Fortune 500 enterprises in regulated sectors with 10,000+ attack techniques. More significantly, OpenAI's $86M acquisition of Promptfoo in March 2026 signals mainstream platform consolidation — Promptfoo reached 350,000 developers and 25%+ Fortune 500 adoption within two years, with automated red-teaming of 50+ vulnerability types built into CI/CD workflows. Specialist vendors (PyRIT, Robust Intelligence, HiddenLayer) proliferate; analyst reports project the market expanding from $680M in 2025 to $8.92B by 2034 at 34% CAGR, with prompt injection attacks surging 340% in enterprise deployments.
Real-world deployments show operational maturity across domains. Multi-agent systems now discover protocol-level bugs (Agora framework found 15 previously unknown vulnerabilities in production consensus implementations). Multi-agent adversarial arenas run continuously in production (15 agents on one system with 91.8% detection rates across 3,200+ attempts and cryptographic proof-of-integrity). High-stakes red-teaming of LLM applications—legal brief validators, therapeutic agents—demonstrates the practice operationalised with explicit safety thresholds. Infrastructure wins accumulate: IEEE S&P 2026 accepted PILOT (51 CLI vulnerabilities across 43 real-world programs, 33 patched); Anthropic's Frontier Red Team discovered 500+ zero-days including 22 Firefox vulnerabilities in 2-week collaboration with Mozilla; Google's OSS-Fuzz discovered 3,818 vulnerabilities across major open-source projects, driving active remediation across the ecosystem. Academic researchers using fuzzing independently discover critical vulnerabilities (e.g., Chrome WebNN GPU vulnerability in March 2026) that escaped years of normal development and security processes.
Critical capability gaps persist despite market expansion and operational advances. Same-model code generation and same-model test generation suffer systematic blind spots: empirical research (SAGA) showed 50% of AI-generated test suites failed to detect known errors, with 84% of verifiers themselves flawed—validating the motivation for adversarial/mutation-guided testing approaches. Cisco's multi-turn adversarial evaluation of 15 frontier models (36,076 attacks total) found every model failed meaningful share of sustained attacks, with success rates 7.89–88.30% on multi-turn compared to 2.19–64.91% single-turn, revealing vulnerabilities benchmark-based safety claims miss. Empirical study of 13 open-source AI pentesting frameworks found 8 frameworks hallucinate results—stopping at decodable strings and never reaching actual vulnerability chains, producing false security findings without ground-truth validation. Meanwhile, only 16% of organisations have ever red-teamed AI models, yet 74% experienced AI security breaches. EU AI Act Article 15 mandates resilience testing for high-risk systems—a regulatory forcing function—but deployment guidance remains sparse and tool reliability is empirically questionable. Enterprise adoption is shifting from static review cycles to continuous adversarial testing (Microsoft's RAMPART in CI/CD, OpenAI's EVMbench), but orchestration complexity—state management, tool integration, and evidence validation—remains the primary gap between research capability and enterprise-ready deployment.
— depthfirst autonomous AI agent discovered 21 confirmed zero-day vulnerabilities in FFmpeg (1.5M LOC) with reproducible PoCs and 9 CVE assignments at $1,000 total cost. Demonstrates cost-effective autonomous adversarial test generation at scale.
— Gartner-recognized autonomous AI platform for adversarial exposure validation with Fortune 500 adoption, 100% benchmark accuracy, and agents outperforming manual red teams 60-70% of the time.
— Enterprise shift from static to continuous adversarial testing (Microsoft RAMPART in CI/CD, OpenAI EVMbench). Documents operational maturation and tool integration patterns for agentic AI security.
— LLM+RL coverage-guided fuzzing for JavaScript engines discovered 48 real bugs (39 novel, 11 CVEs) without post-processing for syntax errors. Evidence of production-ready adversarial test generation.
— LLM framework alternates between code generation and adversarial test generation (targeting runtime failures, not coverage); 3-7% improvements on CodeContests, MBPP, LiveCodeBench benchmarks via execution-derived signals.
— Neuro-symbolic pipeline combining LLM extraction, Datalog reasoning, and SMT solving re-discovered CVE-class vulnerabilities (including CVSS-9.8 curl bug) with reproducible PoCs and 4-5 novel bugs in libarchive.
— Multi-agent system with dedicated TestGen agent discovered 15 previously unknown protocol-level logic bugs in production consensus implementations (Raft, EPaxos, HotStuff, BullShark).
— Cisco evaluated 36,076 multi-turn vs single-turn attacks on 15 frontier models; multi-turn ASR 7.89-88.30% vs single-turn 2.19-64.91%. Validates need for adversarial testing beyond benchmark-style evaluation.
2024-Q3: OWASP initiates standardized red teaming methodologies; Miami University releases AiR-TK open-source toolkit with 25+ adversarial attacks; HARM framework advances automated RL-based test generation for LLMs. Regulatory mandates (Biden EO, EU AI Act) drive industry adoption. Tools for fuzzing and vulnerability assessment (ZAP add-on) enable practical adversarial testing workflows.
2025-Q1: Mutation-based fuzzing achieves 95%+ jailbreak success rates on production LLMs (TurboFuzzLLM). LLM-assisted fuzzer generation for automotive protocols advances domain-specific application (SAE International research). Adversarial testing frameworks extend to industrial control systems (AAG). Critical voices highlight rising evaluation rigor challenges in the field.
2025-Q2: Ecosystem maturation accelerates: CyberArk releases FuzzyAI (1.3k GitHub stars); Meta publishes AutoPatchBench (136 fuzzing-discovered vulnerabilities) as part of CyberSecEval 4. RL-augmented fuzzing extends to robotics (GzFuzz, 25 crashes detected) and autonomous agents (RedTeamCUA, 60% attack success on computer-use agents). LLM-directed fuzzing advances efficiency (RandLuzz, 2.1x-4.8x speedups). Adoption barriers persist: practitioners report traditional testing fails on AI systems, and gap between research results and deployment guidance remains the core blocker.
2025-Q3: Research methodologies mature across domains: LLAMAFUZZ extends LLM-augmented fuzzing to structured data; AdverTest demonstrates two-agent adversarial RL for fault detection (8.56%+ improvements); MetAdv brings hybrid virtual-physical testing to autonomous driving (ACM recognition); LLAMA targets smart contract security (91% coverage). Novel training advances: UTRL outperforms frontier models on test quality. Enterprise adoption signals: Pentera (1200+ customers) commits to agentic red teaming. Core tension remains: tools advance but deployment guidance gap persists.
2025-Q4: Market validation accelerates with Gartner recognition of Adversarial Exposure Validation ($2.5B projected by 2026, 45% adoption). Real-world deployments documented: FuzzyAI used in AWS Bedrock security assessments; ATGen RL framework achieves 60% improvements over baseline LLM test generation. Threat actor adoption surfaces: AI-powered fuzzing shows 400% coverage and 280% bug discovery improvements. Tool ecosystem matures: specialized AI pentesting vendors (PyRIT, Robust Intelligence, HiddenLayer) gain visibility. Challenge persists: despite research advances and market momentum, deployment guidance for CI/CD integration remains the adoption bottleneck.
2026-Jan: Enterprise adoption accelerates: F5 releases AI Red Team with 10,000+ attack techniques and deploys to Fortune 500 enterprises in regulated sectors (finance, healthcare). Research advances continue with frequency-aware adversarial perturbations for vision system testing (IFAP). Threat landscape solidifies: practitioners assess adversarial ML attacks as operational risks today with escalating sophistication; deployment maturity follows market demand.
2026-Feb: Research methodologies mature across domains: SAFuzz advances semantic-guided fuzzing for detecting vulnerabilities in LLM-generated code (85.7% precision); AdverTest introduces two-agent adversarial loop for unit test generation (8.56% improvement over LLMs). Test suite robustness elevated: SWE-ABS framework strengthens benchmarks via mutation-driven adversarial testing, exposing inflated success metrics. Production CI/CD integration: Wireshark's automated fuzz job discovers memory safety bugs in real-world code. Practice transitions from research validation to operationalized methodology with deployment guidance patterns emerging.
2026-Mar: Vendor maturity and real-world deployment validate market category. OpenAI acquires Promptfoo (350K developers, 25% Fortune 500 adoption) for $86M; platform integrates 50+ adversarial test types into CI/CD. Research breakthroughs across domains: PILOT (IEEE S&P) discovers 51 CLI vulnerabilities; GoldenFuzz (NDSS) finds 5 critical hardware flaws; VIPL publishes 10 CVPR papers on vision-language adversarial attack generation (36% SOTA improvement); EACL 2026 demonstrates adaptive black-box optimization raising danger scores from 0.09 to 0.79 on production LLMs. Production systems demonstrate operational maturity: multi-agent adversarial arenas achieve 91.8% detection rates with continuous evolution; DeepTeam framework handles high-stakes multi-agent red-teaming (legal, therapeutic); AdvJudge-Zero fuzzer bypasses AI-judge safety mechanisms with 99% success rate via logit-gap analysis. Enterprise operationalization documented: internal adversarial simulation labs with CI/CD integration using CleverHans, Torchattacks, and IBM ART frameworks. Regulatory drivers surface: EU AI Act Article 15 mandates resilience testing. Critical perspective emerges: 540% year-over-year surge in prompt injection exploits; traditional security testing fails on non-deterministic AI systems; deployment guidance gap remains primary adoption barrier despite vendor proliferation.
2026-Apr: Market category confirmed at scale ($680M expanding to $8.92B by 2034 at 34% CAGR) as production red-teaming reaches landmark results—Anthropic Frontier Red Team, AISLE, and XBOW collectively discovered 500+ zero-days and 1,000+ vulnerabilities across major organisations, while a solo PhD researcher using fuzzing uncovered a critical CVSS-rated Chrome WebNN GPU vulnerability. Technical breakthroughs accumulate across the month: TEMPLATEFUZZ achieves 98.2% attack success on 12 open-source and 5 commercial LLMs; MASFuzzer demonstrates multidimensional API fuzzing for deep vulnerability discovery; CrowdStrike advances feedback-guided fuzzing methodology; ARES adaptive red-teaming framework achieves 0.97 safety rate on StrongReject using compositional attack generation. AI security agents crossed from assistants to autonomous hackers—Project Glasswing identified thousands of zero-days, and Claude Opus 4.6/Kimi K2.5 generate working exploits autonomously. Gartner recognizes Adversarial Exposure Validation as a mature category; BreachLock reports 40,000+ engagements with Fortune 100 adoption. However, tool-reliability remains problematic: empirical study of 13 open-source AI pentesting frameworks found 8 hallucinate results, stopping at decodable strings without reaching actual vulnerability chains, undermining trust in automated adversarial testing outputs. The adoption gap persists: only 16% of organisations have red-tested AI systems despite 74% having experienced AI security breaches, and prompt injection attacks surged 340% in enterprise deployments.
2026-May: Operational maturity solidifies with large-scale production deployments. Mozilla's agent-based fuzzing with Claude Mythos Preview discovered 271 Firefox vulnerabilities (180 sec-high); continuous adversarial pentesting across 28 companies found 2,000 vulnerabilities (44.6% critical/high); AdvNet exposed critical kernel bugs across 27 protocol implementations. Votal AI launched an RLHF-trained adversarial attacker with 100K+ attack prompts across 185+ named techniques. Multi-agent LLM fuzzing systems now deliver production results at scale: FuzzingBrain V2 found 29 zero-days with 2 assigned CVEs; a four-principles harness generation framework confirmed 42 bug reports and 3 CVEs across 23 OSS projects; FuzzAgent reported 102 confirmed vulnerabilities with 78 upstream fixes; semantic fuzzing for agent skill specifications uncovered 26 previously unknown exploitables in production systems. OWASP released its AI Testing Guide v1 with methodologies covering evasion, poisoning, extraction, and prompt injection, providing the first authoritative practitioner standard. Orchestration complexity — state management, tool integration, and evidence validation — remains the primary gap between research capability and enterprise-ready deployment.
2026-Jun: Autonomous adversarial testing crossed a cost threshold: depthfirst discovered 21 confirmed zero-day vulnerabilities (9 CVEs) in 1.5M-LOC FFmpeg at $1,000 total cost, while CovRL's LLM+RL coverage-guided fuzzer found 48 real JavaScript engine bugs (11 CVEs) without post-processing. FireCompass reached GA with Gartner recognition and Fortune 500 adoption, documenting autonomous agents outperforming manual red teams 60–70% of the time. Cisco's multi-turn adversarial evaluation of 15 frontier models (36,076 attacks) confirmed ASR of up to 88.3% on sustained multi-turn sequences — validating that adversarial testing must go beyond single-turn benchmarks to surface real vulnerability chains.