Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Penetration testing assistance

GOOD PRACTICE

TRAJECTORY

Advancing

AI that assists penetration testers by suggesting attack vectors, automating reconnaissance, and identifying exploitation paths. Includes AI-guided vulnerability exploitation and attack chain planning; distinct from vulnerability scanning which identifies weaknesses without attempting exploitation.

OVERVIEW

AI-assisted penetration testing has crossed into mainstream operational practice, moving from point-in-time engagements into continuous, agentic validation architectures. Frontier LLMs (Gemini 3 Pro, Claude Opus 4.5) achieve ~70% autonomous exploitation success on diverse targets, with peer-reviewed research isolating specific capability boundaries: exploitation reaches 90% with ground-truth reconnaissance, but autonomous reconnaissance plateaus at 50%, limiting end-to-end autonomy. Market adoption signals are unambiguous: 87% of security leaders actively planning/piloting agentic AI pentesting, 95% expect displacement of traditional manual services, and YesWeHack's June 2026 launch shows enterprise customers (Dassault Systèmes, Sanofi, multiple CAC 40 firms) in production with same-day autonomous testing. The structural tension is not whether AI adds value but where the autonomy boundary lies and how to embed it sustainably. Full end-to-end automation without human validation remains infeasible: detection capability now outpaces organizational remediation velocity (AI findings resolve at 38.4% versus 77.3% for traditional vulnerabilities), and reconnaissance gaps require hybrid architectures. The practice's maturity inflection is evident but constrained by a critical gap: security tooling has accelerated beyond organizational capacity to remediate AI-discovered findings. Production success requires human-in-the-loop orchestration, continuous revalidation, and treated deployment as a governance problem, not a tooling problem.

CURRENT LANDSCAPE

The vendor ecosystem has consolidated around established platforms shipping production autonomous pentesting. Pentera (938+ enterprise customers, Gartner Representative Vendor, 525-600% documented ROI) expanded in June 2026 with MCP (Model Context Protocol) server enabling AI agent orchestration to trigger pentesting directly in SecOps workflows. YesWeHack launched Agentic Pentest (June 2026 GA) with same-day autonomous testing already deployed to Dassault Systèmes, Sanofi, and multiple CAC 40 companies. RidgeBot v7.0 (AWS and Azure Marketplaces) added Windows Active Directory autonomous compromise simulation; AWS Security Agent GA (31 March 2026, $50/task-hour) extended in May 2026 to repository code review and June to threat modeling. CyCognito expanded continuous AI pentesting to 60+ AI infrastructure categories (MCP servers, RAG systems, Ollama, MLflow), documenting attack chains across AI tools and physical security systems. FireCompass deployed to Fortune 500 technology firm with 11x cost reduction ($5K→<$1K per app), 2+ weeks compressed to 1 day, and coverage expansion 10%→99%.

Frontier LLM capability has matured measurably. Peer-reviewed benchmarking shows Gemini 3 Pro and Claude Opus 4.5 achieving ~70% autonomous exploitation success on diverse 300-server environments; however, empirical decoupling of reconnaissance from exploitation reveals the hard constraint: with ground-truth vulnerability context, agents reach 90% exploitation success, but autonomous reconnaissance alone plateaus at 50% due to telemetry parsing and tool-output interpretation failures. Stanford research documents 80% of human testers finding critical RCEs missed by all tested AI agents, illustrating capability boundaries in novel contexts. Six-layer governance framework (ownership validation, network-level scoping, isolation, validation, observability, data residency) has emerged as production requirement, not guideline, reflected in Cloud Security Alliance 2026 agentic pentesting best practices.

Structural remediation gap has emerged as the limiting factor in adoption. Cobalt's PTaaS data from thousands of engagements reveals a 2:1 remediation deficit: AI/LLM vulnerability resolution at 38.4% versus 77.3% for traditional web vulnerabilities, indicating that detection at scale now outpaces organizational capacity to remediate AI-specific findings. Large-scale deployment data (6.8M findings across 1,000+ organizations) shows cloud vulnerability growth at 44x versus testing coverage growth at 1.23x, creating structural supply-demand imbalance. OWASP Autonomous Penetration Testing Standard (APTS v0.1.0) codifies four autonomy levels with explicit human-oversight requirements, signaling industry consensus that full autonomy remains infeasible. The practice's maturation is evidenced not by capabilities (which have crossed into production effectiveness) but by recognition that autonomous pentesting is a governance and orchestration problem, not a raw capability problem.

TIER HISTORY

ResearchJan-2023 → Jan-2023
Bleeding EdgeJan-2023 → Apr-2025
Leading EdgeApr-2025 → Jul-2025
Good PracticeJul-2025 → present

EVIDENCE (121)

— YesWeHack Agentic Pentest GA launch with named enterprise customers (Dassault Systèmes, Sanofi, multiple CAC 40 companies) delivering same-day autonomous testing across web, mobile, APIs with zero-false-positive triage option and EU/APAC region support.

— Large-scale Cobalt PTaaS remediation data reveals critical adoption barrier: AI/LLM vulnerability resolution at 38.4% versus 77.3% for APIs—a 2:1 deficit indicating detection capability outpaces organizational remediation capacity despite tool maturity.

— Empirical two-stage evaluation framework isolates exploitation success (90% with ground-truth context) from autonomous reconnaissance (50% success), identifying telemetry parsing and tool-output interpretation as critical bottlenecks limiting end-to-end autonomy.

— Practitioner analysis of three AI pentesting market segments (autonomous platforms, AI-native web testers, BAS) with critical assessment: Stanford study shows 80% of human testers found critical RCE missed by all tested AI agents, underscoring hybrid human-AI model necessity.

— Fortune 500 technology company deployment: cost reduced 11x ($5K→<$1K per app), lead time compressed from 2+ weeks to 1 day, coverage expanded 10%→99%; demonstrates quantified ROI of continuous autonomous pentesting at scale with <2% false positive rate.

— Structured vendor analysis (Simbian, XBOW, Horizon3, Pentera, Sprocket, BreachLock, NetSPI, Bishop Fox, Praetorian, Synack) evaluated on autonomy depth, surface breadth, reasoning transparency, cadence, pricing, and closed-loop defense integration—mapping market consolidation and adoption drivers.

— CSA/Synack governance framework for agentic pentesting identifying six technical requirements (ownership validation, network-level scoping, isolation, validation, observability, data residency) and organizational guardrails reflecting maturity of human-in-the-loop production deployment patterns.

— CyCognito continuous AI pentesting expansion to AI-native infrastructure (60+ model categories: MCP, RAG, Ollama, MLflow) with documented attack chains showing exposure across AI tools, security systems, and physical infrastructure—evidence of practice expanding beyond traditional network pentesting.

HISTORY

  • 2023-H1: Initial research prototypes (PentestGPT, ChatGPT-based studies) and early commercial offerings (RidgeBot, vPenTest) emerged; academic and vendor exploration alongside practitioner critique of limitations; LLMs showed promise for vulnerability identification (20/28 in academic testing) but struggled with context persistence and data confidentiality; analyst consensus positioned automated pen testing as supplementary to manual testing rather than replacement.
  • 2023-H2: Research-backed systems (AutoPT, PentestGPT peer-reviewed publication) demonstrated quantified improvements (228.6% completion gains, 41% benchmarks); Pentera scaling to 800+ customers and $1B valuation; RidgeBot GA on Azure Marketplace. False positive burden documented (81% of IT pros report >20% cloud false positives). Deployment remained on-premise-focused due to data sensitivity and provider constraints. Automated tools confirmed as augmentation, not replacement.
  • 2024-Q1: Market expanded with new LLM-based tools (PentestAI, ZeroThreat) and comparative studies (GPT-4o vs GPT-4 Turbo on real-world exploitation). RidgeBot showed active production deployment against real vulnerabilities (Ivanti CVEs). New entrants claimed significant performance gains (98% accuracy, 10x speedup) but fundamental constraints persisted: on-premise-only deployment, false positive burden, and consensus that human expertise remains essential for complex attack chains.
  • 2024-Q2: Empirical research (AutoPenBench, June 2024) quantified limits of autonomous agents—21% success on simple tasks, 1/33 real-world—validating human-in-the-loop architectures (64% success). Peer-reviewed studies tested full pentesting workflows with mixed risk/benefit signals. Vendor ecosystem expanded geographically (RidgeBot in Japan). Critical assessments from established firms (NCC Group) reinforced that AI augments but cannot replace human judgment. Architecture and deployment constraints remained unchanged.
  • 2024-Q3: Vendor ecosystem matured with product integrations (RidgeBot 4.3.3 with Tenable/Rapid7, Bugcrowd CASPT launch). Market validation continued: MarketsandMarkets projected PTaaS market growth to $301M by 2029 (20.5% CAGR) with AI/ML as key driver. Community-driven benchmarking efforts (AI-Pentest-Benchmark) provided open-source evaluation tools. Critical assessments documented persistent limitations: GPT-4 success rates at 42.7% on web vulnerabilities. Consensus held: AI augments pentesting but human expertise essential for complex attack planning and contextual judgment.
  • 2024-Q4: USENIX Security 2024 published peer-reviewed PentestGPT paper demonstrating 228.6% task-completion gains and real-world effectiveness with 6,500+ GitHub stars confirming community adoption. Ethical hacker adoption surged: Bugcrowd survey of 1,300 practitioners showed 77% AI integration and 71% perceive value increase (vs 21% in 2023). RidgeBot 5.0 GA introduced Web API testing capabilities, expanding vendor ecosystem. However, organizational integration gaps persisted: ISACA survey found only 35% of cybersecurity teams involved in enterprise AI implementation, and benchmark research (Drexel/arxiv) confirmed both GPT-4o and Llama 3.1 fall short of autonomous end-to-end pentesting. Market maturation evident but human-in-the-loop architecture remained dominant constraint.
  • 2025-Q1: Analyst recognition accelerated: Pentera achieved Gartner Representative Vendor status in 2025 Adversarial Exposure Validation (AEV) market guide, signaling mature analyst coverage. Vendor ecosystem continued product evolution: RidgeBot 5.2 launched RidgeGen, a specially trained GenAI module for enhanced validation. Real-world deployment metrics emerged from production environments: Penligent.ai documented 2.8-day MTTR (vs 7-day industry average) with sub-3% false positive rates in CI/CD pipelines. Gartner predicted 60% organizational adoption of automated pentesting tools by 2025, yet Horizon3 survey of 50,000+ real penetration tests revealed persistent barriers: 36% of CISOs delay patching due to inability to distinguish exploitable vulnerabilities; 41% report pentest report unreliability. Ethical hacker adoption remained high (77% using AI tools) but skepticism persisted: only 22% believe AI outperforms humans, 30% doubt AI replicates human creativity. Architecture remained human-in-the-loop; full automation remained unachieved.
  • 2025-Q2: Vendor ecosystem matured toward scale and orchestration: RidgeSphere GA enabled centralized management of hundreds of RidgeBot deployments for MSSPs, while Pentera 7 GA introduced distributed attack orchestration across remote sites with AI-based pattern identification for recurring weaknesses. Research advanced: PentestGPT v2 achieved 91% task completion on CTF benchmarks and 4/5 host compromise on GOAD Active Directory (39-49% relative improvement) through Tool and Skill Layer with 38 typed security tools. Enterprise adoption metrics strengthened: Pentera survey of 500 CISOs showed 50%+ now use software-based pentesting as primary method for uncovering exploitable gaps, averaging $187,000 annual spend. Practitioner methodologies evolved: EPAM published comprehensive guide documenting shift toward self-hosted local models due to third-party AI data risks, addressing key deployment constraint. Critical assessments remained balanced: A16z analysis highlighted Unpatched AI autonomous tool discovering 100+ Microsoft vulnerabilities while questioning whether current platforms adequately address cloud-native environments. Human-in-the-loop architecture solidified as standard; full autonomy remained unfeasible.
  • 2025-Q3: Government adoption accelerated with NSA/Horizon3.ai deploying NodeZero to 200 defense contractors, conducting 20,000+ pentesting hours and identifying 50,000 vulnerabilities (70% mitigated); single test breached file share with 3M+ sensitive nuclear files in 5 minutes, demonstrating both capability and false-positive hazard. Commercial adoption solidified: Pentera reached 1200+ enterprise customers; vendor ecosystem articulated vision for natural language-driven and agentic testing. Critical assessments reinforced limitations: Outpost24 documented AI's role in triage/validation/reporting versus human-essential functions (threat modeling, creative design, ethics); autonomous agents remained far from end-to-end pentesting. Human-in-the-loop model confirmed as industry standard; full automation remained unrealistic.
  • 2025-Q4: Peer-reviewed research presented landmark evidence: ARTEMIS multi-agent framework outperformed 9 of 10 human professionals in live enterprise pentesting with 82% valid vulnerability discovery rate, demonstrating human-competitive capabilities in controlled environments. Vendor ecosystem matured toward multi-cloud and orchestration: RidgeBot achieved GA on AWS and Azure Marketplaces; Aikido Attack launched autonomous pentesting with AI-driven remediation. Analyst validation strengthened: 525-600% ROI documented for Pentera across 1,000+ enterprise customers. Skepticism persisted alongside hype: vendor critical assessments argued current tools function as "expensive vulnerability scanning" rather than true pentesting; false positives and automation bias remained deployment barriers. Named customer adoption broadened: Sycuan Casino Resort deployed Pentera in regulated hospitality sector. Full autonomy achieved only in benchmarks; real-world complexity confirmed human-in-the-loop as structural requirement.
  • 2026-Jan: Venture capital momentum accelerated: Novee Series B launch ($51.5M) introduced new AI pentesting platform claiming 55% advantage over frontier LLMs on web exploitation; Google Cloud AI Agent Trends showed 52% of execs have agents in production with 46% adoption in security operations, but education sector lagged at 6% red-teaming adoption. Pentera expanded geographically into Asia-Pacific (938+ customers reported in Japan). Industry maturation reflected in continuous testing shift: PlexTrac adoption by Fortune 500 companies (Expedia, Mandiant, Deloitte, KPMG) signaled platform ecosystem consolidation. Deployment barriers (false positives, data sensitivity, on-premise-only constraints) remained unchanged, confirming human-in-the-loop as persistent structural requirement.
  • 2026-Feb: Research advanced technical foundations: systematic literature review of 28 LLM-based pentesting systems introduced Task Difficulty Assessment (TDA) mechanism to distinguish capability gaps (Type A) from complexity barriers (Type B), signaling maturation toward architectural solutions beyond simple prompt engineering. Practitioner safety thinking crystallized around six concrete requirements for autonomous agents (ownership validation, network-level scoping, isolation, validation, observability, data residency), highlighting operational barriers to unrestricted deployment. Vendor discourse shifted toward evidence-driven workflows and scoping discipline—distinguishing academic breakthroughs from production-ready commercial tools. Deployment constraints persisted: false positives, data sensitivity, on-premise-only architecture remained structural requirements for human-in-the-loop model.
  • 2026-Mar/Apr: Autonomous capabilities reached production scale: XBOW published results from 1,060 fully autonomous vulnerabilities (#1 HackerOne leaderboard), 48-step exploit chains, cryptographic breaks in 17.5 minutes; Wiz Research + Irregular empirical study documented AI solving 9/10 real-world-inspired CTF challenges at <$12K cost with strong pattern recognition but limitations in enumeration; AWS Security Agent achieved GA (31 March) with multi-step agentic attacks priced at $50/task-hour, with named customers (LG CNS, HENNGE, Wayspring) reporting 50%+ faster testing and ~30% cost savings. Independent practitioner validations emerged: Anthropic-Claude red team discovered 11 high-severity Firefox vulnerabilities; Shannon AI pentester demonstrated 72% accuracy on SaaS production with honest assessment of business logic gaps; CREST-certified practitioners noted compliance implications and quality gaps relative to manual testing. Market adoption accelerated: SANS survey shows 67% of red team operators now use AI tools (up from 18% in 2023), a 3.7x adoption increase; Synack + Omdia survey of 200 U.S. security leaders showed 87% actively planning/using agentic AI, 95% expect displacement of traditional services, 93% emphasize guardrails needed, but only 32% of attack surfaces are currently tested. OWASP published structured Q2 2026 AI and agentic red teaming landscape framework; Pentera earned Frost Radar Leader designation for automated security validation; analysis of 39+ open-source AI pentesting agents revealed a critical lab-to-real gap (GPT-4 exploits 87% of one-day CVEs but only 13% of real CVEs), with ARTEMIS and XBOW named as top performers. Deployment barriers and human-in-the-loop architecture remain unchanged as structural requirements.
  • 2026-Late Apr: Major vendor ecosystem maturity signals confirmed: Microsoft released AI Red Teaming Agent as GA in Azure Foundry with NIST-aligned governance framework integrating PyRIT; CERT-EU documented internal deployment of AI-powered pentesting pipeline with concrete output (exploitation timeline now negative seven days). Independent research (Escape.tech, April 30) benchmarked five AI pentesting tools, finding tool orchestration matters more than model choice—detection rates 1–9 vulnerabilities on identical test app. Attacker operationalization confirmed: Google Threat Intelligence documented APT31 operational use of AI-driven vulnerability discovery (HexStrike with Gemini, February 2026). Critical maturity gap identified: LangWatch documented that existing automated red teaming tools (PyRIT, PAIR, TAP, Crescendo) fail in production due to shallow multi-turn attack simulation—0% vulnerability detection in banking agent testing vs 50-turn approaches revealing system prompt leakage and auth flaws. Infrastructure focus emphasized: Strobes and Hadrian research confirms system architecture (tool execution, context management, validation guardrails) is 80% of pentesting effectiveness; orchestration matters more than model capability. Market consolidation: Pentera at $100M ARR with $250M total capital; autonomous offensive security testing market forecast $2.1B (2025) → $15.8B (2034) at 27% CAGR. Human-in-the-loop architecture remains structural requirement despite increasing vendor capability; framework for integrating AI-augmented tools into continuous validation workflows emphasizes evidence-driven scoping over full autonomy.
  • 2026-May: Infrastructure and orchestration emerge as the dominant deployment theme: IBM X-Force Red's X-Frame introduces human-in-the-loop AI-augmented adversary simulation to match AI-enabled threats, while Strobes engineering confirms system architecture accounts for 80% of AI pentesting effectiveness versus 20% for model choice. LangWatch Scenario open-source framework ships CI/CD-integrated red teaming addressing the production gap where existing tools (PyRIT, PAIR, TAP) returned 0% detection rates on real banking agents. AWS Security Agent extends beyond task-level testing to full repository code review (GA May 2026), enabling context-aware analysis of entire codebases for systemic design-phase vulnerabilities, and launches automated verification script generation (May 22, 2026) to streamline remediation validation. Capability limits quantified: EPAM hands-on evaluation shows AWS Security Agent detecting only 35-38% of known vulnerabilities on realistic targets (Shannon 17-33%), with three primary gaps in custom logic understanding, multi-step exploit execution, and real-world error handling; frontier LLMs show 10-50% false positive rates on vulnerability detection and only 4-8% coverage on black-box testing (Agentic Security Newsletter analysis, May 2026), keeping autonomous claims well below production thresholds. Architectural differentiation proven: Ken Huang benchmark isolates system design—RidgeGen achieves 0% hallucination versus Shannon's 63% unconfirmed findings on identical Juice Shop target using identical LLM backend, proving orchestration matters more than model. Large-scale deployments confirm capability-at-scale: Anthropic's Project Glasswing (Mythos Preview) across ~50 named organizations scanning 1,000+ projects discovered 10,000+ high/critical vulnerabilities with 90.6% independent validation accuracy, shifting narrative from discovery bottleneck to remediation bottleneck; Doyensec's independent side-by-side comparison of Aikido Attack and XBOW establishes a gold-standard validation methodology for AI pentesting maturity. Real-world autonomous pentesting demonstrated: Secure.com autonomous pipeline discovered 21 vulnerabilities (7 critical) across 3 live production environments in one weekend at $18/hour continuous cost. Standards maturation: OWASP publishes Autonomous Penetration Testing Standard (APTS) v0.1.0 with 173 requirements across 8 governance domains and four autonomy levels, codifying transition from research to operational deployment requiring formal assurance frameworks. Governance barriers surface: CyberCX analysis of 7,500+ pentesting engagements shows AI systems deployed with 2× the severe vulnerability rate of web applications, revealing governance-pace misalignment driving adoption risk; only 38% of AI-discovered vulnerabilities achieve remediation (vs broader metrics), creating structural bottleneck despite detection capability at scale. Academic advancement: APT-Agent (University of Queensland, CSIRO Data61) achieves 84.29% end-to-end exploitation success on Metasploitable 2 by addressing hallucination and context memory, advancing technical foundations. Market scale confirmed at 27% CAGR with autonomous offensive security testing market projected to reach $15.8B by 2034; attacker AI adoption accelerates with 70+ open-source offensive AI tools now catalogued (versus fewer than 5 pre-2023) and APT31 confirmed using AI-driven vulnerability discovery operationally.
  • 2026-Jun: Enterprise adoption reaches a new production scale: YesWeHack Agentic Pentest GA launches with named enterprise customers (Dassault Systèmes, Sanofi, multiple CAC 40 companies) achieving same-day autonomous testing across web, mobile, and APIs. Empirical capability mapping advances with a peer-reviewed two-stage framework isolating exploitation success (90% with ground-truth reconnaissance context) from autonomous reconnaissance (50%), confirming telemetry parsing and tool-output interpretation as the primary bottlenecks to end-to-end autonomy; a 19-LLM benchmark (300 diverse servers) shows frontier models (Gemini 3 Pro, Claude Opus 4.5) at ~70% autonomous exploitation success with detailed failure-mode analysis. FireCompass Fortune 500 deployment documents 11x cost reduction ($5K→<$1K per app), 2-week-to-1-day lead time compression, and coverage expansion from 10%→99%. Critical remediation deficit quantified: Cobalt PTaaS data shows AI/LLM vulnerability resolution at 38.4% versus 77.3% for traditional vulnerabilities—a 2:1 deficit confirming detection capability now outpaces organizational remediation capacity, and CSA governance framework formalizes six technical requirements (ownership validation, scoping, isolation, validation, observability, data residency) as production prerequisites for agentic deployment. Narrative solidifies: scope clarity (attack-path validation versus control-effectiveness validation) and human-in-the-loop orchestration are validated as the non-negotiable production standard.

TOOLS