Penetration testing assistance

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

GOOD PRACTICE

TRAJECTORY↑ Advancing

AI that assists penetration testers by suggesting attack vectors, automating reconnaissance, and identifying exploitation paths. Includes AI-guided vulnerability exploitation and attack chain planning; distinct from vulnerability scanning which identifies weaknesses without attempting exploitation.

OVERVIEW

AI-assisted penetration testing has crossed from experimental tooling into a proven practice with mainstream adoption: 87% of security leaders are actively planning, piloting, or using agentic AI for pentesting, and 95% expect it will displace traditional services. The category -- distinct from vulnerability scanning in that it actively plans and executes exploitation paths -- now supports human testers with automated reconnaissance, attack vector suggestion, and multi-step attack chain discovery across established platforms. Pentera serves 938+ enterprise customers; AWS launched Security Agent GA (31 March 2026) with context-aware multi-step agentic attacks; and XBOW demonstrated end-to-end autonomous pentesting at production scale (1,060 real vulnerabilities on HackerOne, 48-step exploit chains). Peer-reviewed research confirms both promise and limitations: Wiz Research documented AI reliably solving 9 of 10 realistic challenges with strong multi-step reasoning, while also showing blind spots in creative enumeration and business-logic exploitation. The defining tension is no longer whether AI adds value but where the autonomy boundary lies. Full end-to-end automation remains infeasible: false positive rates, business-logic complexity, and data sensitivity keep human-in-the-loop architectures as the production standard. The question facing security teams is not adoption but orchestration—integrating AI pentesting into continuous validation workflows without overestimating what automation can deliver.

CURRENT LANDSCAPE

The vendor ecosystem has consolidated around a handful of platforms with real production footprints. Pentera, now a Gartner Representative Vendor for Adversarial Exposure Validation, reports 938+ customers with geographic expansion into Asia-Pacific and analyst-documented ROI of 525-600%. RidgeBot has reached GA on both AWS and Azure Marketplaces; AWS Security Agent achieved GA (31 March 2026) with pricing at $50/task-hour and multicloud support, expanded in May 2026 to include full repository code review capability; and newer entrants like Synack's Sara and Aikido Attack ship autonomous pentesting with human-expert validation or integrated AI-driven remediation. Venture funding continues: Novee emerged from stealth with $51.5M Series B capital. Independent practitioners report production deployments with mixed signals: XBOW achieved #1 HackerOne leaderboard ranking with 1,060 fully autonomous vulnerabilities; Shannon AI pentester validated on SaaS production with 72% accuracy and honest assessment of business-logic gaps; Anthropic-Claude red team discovered 11 high-severity vulnerabilities in Mozilla Firefox.

Government and enterprise adoption confirms operational maturity. The NSA's rollout of NodeZero across 200 defense contractors yielded 20,000+ pentesting hours and 50,000 identified vulnerabilities (70% mitigated). Fortune 500 firms including Expedia, Mandiant, Deloitte, and KPMG have adopted platforms for continuous testing workflows. Synack + Omdia survey of 200 U.S. security leaders found 87% actively planning/piloting agentic AI, 95% expect displacement of traditional services, and 93% emphasize that comprehensive guardrails are critical.

Deployment barriers remain structural despite adoption momentum. EPAM's May 2026 hands-on evaluation of six commercial AI pentesting agents revealed a critical capability gap: AWS Security Agent (best performer) found only 35-38% of known vulnerabilities on realistic targets, while Shannon found 17-33%; practitioners consistently characterize current outputs as "expensive vulnerability scanning" rather than true pentesting. Architecture has emerged as the primary differentiator—Ken Huang's benchmark showed RidgeGen's 0% hallucination versus Shannon's 63% unconfirmed findings on identical applications, proving system design (belief state, validation pipelines, orchestration) matters more than underlying model capability. Practitioners have codified six safety requirements -- ownership validation, network-level scoping, isolation, validation, observability, data residency -- that current tools do not fully satisfy. Research into 28 LLM-based systems confirms a split between transient capability gaps and deeper complexity barriers around business logic and context that automation cannot yet overcome. Critical remediation gap: only 21.1% of serious AI/LLM pentest findings are resolved (versus 73.5% for web), indicating that detection at scale now outpaces organizational ability to fix AI-specific vulnerabilities. Human-in-the-loop architecture is not a transitional compromise; it is the operational standard, formalized by OWASP's May 2026 Autonomous Penetration Testing Standard (APTS) which codifies four autonomy levels with explicit human-oversight and governance requirements for production deployment.

TIER HISTORY

ResearchJan-2023 → Jan-2023

Bleeding EdgeJan-2023 → Apr-2025

Leading EdgeApr-2025 → Jul-2025

Good PracticeJul-2025 → present

EVIDENCE (95)

Agentic AI (Sara)Product Launches2026-05-14

— Synack launched Sara autonomous red agent for continuous vulnerability discovery with human expert validation reducing false positives and confirming exploitability; combines AI-driven reconnaissance with 1,500+ security researcher validation layer.

AWS Security Agent now supports full repository code reviewsProduct Launches2026-05-12

— AWS Security Agent added full repository code review capability, performing context-aware analysis of entire codebases and surfacing systemic vulnerabilities beyond pattern-matching scope; GA release extends autonomous pentesting to design-phase validation.

Manual vs Automated Penetration Testing: Which is Better?Opinion2026-05-12

— Netragard critical assessment: AI pentesting relies on pre-existing tools and data and cannot think, adapt, or create novel attack paths; distinguishes human novelty discovery and contextualized threat intelligence from automated tool-based approaches.

OWASP APTS Marks a Turning Point for Autonomous PentestingIndustry Reports2026-05-08

— OWASP Autonomous Penetration Testing Standard (APTS) v0.1.0 published May 2026; 173 requirements across 8 governance domains with four autonomy levels; marks transition from research to operational deployment requiring formal assurance frameworks.

120+ Penetration Testing Statistics for 2026Adoption Metrics2026-05-07

— Critical maturity signal: only 21.1% of serious AI/LLM pentest findings are resolved (vs 73.5% web, 75.5% API); global market $2.74B (2025)→$7.41B (2034) at 11.60% CAGR; 70%+ adoption of PTaaS; shows strong adoption but remediation gap for AI-specific findings.

Automated Penetration Testing: Are AI Agents Ready?Research Papers2026-05-06

— EPAM hands-on evaluation of six AI pentesting agents against realistic targets: AWS Security Agent found 35-38%, Shannon 17-33%, others found fewer; identifies three primary capability gaps (custom logic, multi-step exploits, real-world inconsistencies) contradicting vendor hype with concrete evidence.

Why Your Agentic AI Pentester Is Probably Just a Fancy ScannerOpinion2026-05-04

— Ken Huang benchmark isolates architecture as differentiator: RidgeGen 0% hallucination rate vs Shannon 63% unconfirmed findings on identical Juice Shop target; system design (belief state, verification, orchestration) drives performance gap more than underlying model.

Benchmarking AI Pentesting Tools: A Practical ComparisonResearch Papers2026-04-30

— Independent benchmark of five AI pentesting tools (Escape, Claude, Shannon, Strix, PentAGI) against 20-vulnerability web app; detection rates 1–9 vulnerabilities, shows tool orchestration matters more than model choice.

HISTORY

2023-H1: Initial research prototypes (PentestGPT, ChatGPT-based studies) and early commercial offerings (RidgeBot, vPenTest) emerged; academic and vendor exploration alongside practitioner critique of limitations; LLMs showed promise for vulnerability identification (20/28 in academic testing) but struggled with context persistence and data confidentiality; analyst consensus positioned automated pen testing as supplementary to manual testing rather than replacement.
2023-H2: Research-backed systems (AutoPT, PentestGPT peer-reviewed publication) demonstrated quantified improvements (228.6% completion gains, 41% benchmarks); Pentera scaling to 800+ customers and $1B valuation; RidgeBot GA on Azure Marketplace. False positive burden documented (81% of IT pros report >20% cloud false positives). Deployment remained on-premise-focused due to data sensitivity and provider constraints. Automated tools confirmed as augmentation, not replacement.
2024-Q1: Market expanded with new LLM-based tools (PentestAI, ZeroThreat) and comparative studies (GPT-4o vs GPT-4 Turbo on real-world exploitation). RidgeBot showed active production deployment against real vulnerabilities (Ivanti CVEs). New entrants claimed significant performance gains (98% accuracy, 10x speedup) but fundamental constraints persisted: on-premise-only deployment, false positive burden, and consensus that human expertise remains essential for complex attack chains.
2024-Q2: Empirical research (AutoPenBench, June 2024) quantified limits of autonomous agents—21% success on simple tasks, 1/33 real-world—validating human-in-the-loop architectures (64% success). Peer-reviewed studies tested full pentesting workflows with mixed risk/benefit signals. Vendor ecosystem expanded geographically (RidgeBot in Japan). Critical assessments from established firms (NCC Group) reinforced that AI augments but cannot replace human judgment. Architecture and deployment constraints remained unchanged.
2024-Q3: Vendor ecosystem matured with product integrations (RidgeBot 4.3.3 with Tenable/Rapid7, Bugcrowd CASPT launch). Market validation continued: MarketsandMarkets projected PTaaS market growth to $301M by 2029 (20.5% CAGR) with AI/ML as key driver. Community-driven benchmarking efforts (AI-Pentest-Benchmark) provided open-source evaluation tools. Critical assessments documented persistent limitations: GPT-4 success rates at 42.7% on web vulnerabilities. Consensus held: AI augments pentesting but human expertise essential for complex attack planning and contextual judgment.
2024-Q4: USENIX Security 2024 published peer-reviewed PentestGPT paper demonstrating 228.6% task-completion gains and real-world effectiveness with 6,500+ GitHub stars confirming community adoption. Ethical hacker adoption surged: Bugcrowd survey of 1,300 practitioners showed 77% AI integration and 71% perceive value increase (vs 21% in 2023). RidgeBot 5.0 GA introduced Web API testing capabilities, expanding vendor ecosystem. However, organizational integration gaps persisted: ISACA survey found only 35% of cybersecurity teams involved in enterprise AI implementation, and benchmark research (Drexel/arxiv) confirmed both GPT-4o and Llama 3.1 fall short of autonomous end-to-end pentesting. Market maturation evident but human-in-the-loop architecture remained dominant constraint.
2025-Q1: Analyst recognition accelerated: Pentera achieved Gartner Representative Vendor status in 2025 Adversarial Exposure Validation (AEV) market guide, signaling mature analyst coverage. Vendor ecosystem continued product evolution: RidgeBot 5.2 launched RidgeGen, a specially trained GenAI module for enhanced validation. Real-world deployment metrics emerged from production environments: Penligent.ai documented 2.8-day MTTR (vs 7-day industry average) with sub-3% false positive rates in CI/CD pipelines. Gartner predicted 60% organizational adoption of automated pentesting tools by 2025, yet Horizon3 survey of 50,000+ real penetration tests revealed persistent barriers: 36% of CISOs delay patching due to inability to distinguish exploitable vulnerabilities; 41% report pentest report unreliability. Ethical hacker adoption remained high (77% using AI tools) but skepticism persisted: only 22% believe AI outperforms humans, 30% doubt AI replicates human creativity. Architecture remained human-in-the-loop; full automation remained unachieved.
2025-Q2: Vendor ecosystem matured toward scale and orchestration: RidgeSphere GA enabled centralized management of hundreds of RidgeBot deployments for MSSPs, while Pentera 7 GA introduced distributed attack orchestration across remote sites with AI-based pattern identification for recurring weaknesses. Research advanced: PentestGPT v2 achieved 91% task completion on CTF benchmarks and 4/5 host compromise on GOAD Active Directory (39-49% relative improvement) through Tool and Skill Layer with 38 typed security tools. Enterprise adoption metrics strengthened: Pentera survey of 500 CISOs showed 50%+ now use software-based pentesting as primary method for uncovering exploitable gaps, averaging $187,000 annual spend. Practitioner methodologies evolved: EPAM published comprehensive guide documenting shift toward self-hosted local models due to third-party AI data risks, addressing key deployment constraint. Critical assessments remained balanced: A16z analysis highlighted Unpatched AI autonomous tool discovering 100+ Microsoft vulnerabilities while questioning whether current platforms adequately address cloud-native environments. Human-in-the-loop architecture solidified as standard; full autonomy remained unfeasible.
2025-Q3: Government adoption accelerated with NSA/Horizon3.ai deploying NodeZero to 200 defense contractors, conducting 20,000+ pentesting hours and identifying 50,000 vulnerabilities (70% mitigated); single test breached file share with 3M+ sensitive nuclear files in 5 minutes, demonstrating both capability and false-positive hazard. Commercial adoption solidified: Pentera reached 1200+ enterprise customers; vendor ecosystem articulated vision for natural language-driven and agentic testing. Critical assessments reinforced limitations: Outpost24 documented AI's role in triage/validation/reporting versus human-essential functions (threat modeling, creative design, ethics); autonomous agents remained far from end-to-end pentesting. Human-in-the-loop model confirmed as industry standard; full automation remained unrealistic.
2025-Q4: Peer-reviewed research presented landmark evidence: ARTEMIS multi-agent framework outperformed 9 of 10 human professionals in live enterprise pentesting with 82% valid vulnerability discovery rate, demonstrating human-competitive capabilities in controlled environments. Vendor ecosystem matured toward multi-cloud and orchestration: RidgeBot achieved GA on AWS and Azure Marketplaces; Aikido Attack launched autonomous pentesting with AI-driven remediation. Analyst validation strengthened: 525-600% ROI documented for Pentera across 1,000+ enterprise customers. Skepticism persisted alongside hype: vendor critical assessments argued current tools function as "expensive vulnerability scanning" rather than true pentesting; false positives and automation bias remained deployment barriers. Named customer adoption broadened: Sycuan Casino Resort deployed Pentera in regulated hospitality sector. Full autonomy achieved only in benchmarks; real-world complexity confirmed human-in-the-loop as structural requirement.
2026-Jan: Venture capital momentum accelerated: Novee Series B launch ($51.5M) introduced new AI pentesting platform claiming 55% advantage over frontier LLMs on web exploitation; Google Cloud AI Agent Trends showed 52% of execs have agents in production with 46% adoption in security operations, but education sector lagged at 6% red-teaming adoption. Pentera expanded geographically into Asia-Pacific (938+ customers reported in Japan). Industry maturation reflected in continuous testing shift: PlexTrac adoption by Fortune 500 companies (Expedia, Mandiant, Deloitte, KPMG) signaled platform ecosystem consolidation. Deployment barriers (false positives, data sensitivity, on-premise-only constraints) remained unchanged, confirming human-in-the-loop as persistent structural requirement.
2026-Feb: Research advanced technical foundations: systematic literature review of 28 LLM-based pentesting systems introduced Task Difficulty Assessment (TDA) mechanism to distinguish capability gaps (Type A) from complexity barriers (Type B), signaling maturation toward architectural solutions beyond simple prompt engineering. Practitioner safety thinking crystallized around six concrete requirements for autonomous agents (ownership validation, network-level scoping, isolation, validation, observability, data residency), highlighting operational barriers to unrestricted deployment. Vendor discourse shifted toward evidence-driven workflows and scoping discipline—distinguishing academic breakthroughs from production-ready commercial tools. Deployment constraints persisted: false positives, data sensitivity, on-premise-only architecture remained structural requirements for human-in-the-loop model.
2026-Mar/Apr: Autonomous capabilities reached production scale: XBOW published results from 1,060 fully autonomous vulnerabilities (#1 HackerOne leaderboard), 48-step exploit chains, cryptographic breaks in 17.5 minutes; Wiz Research + Irregular empirical study documented AI solving 9/10 real-world-inspired CTF challenges at <$12K cost with strong pattern recognition but limitations in enumeration; AWS Security Agent achieved GA (31 March) with multi-step agentic attacks priced at $50/task-hour, with named customers (LG CNS, HENNGE, Wayspring) reporting 50%+ faster testing and ~30% cost savings. Independent practitioner validations emerged: Anthropic-Claude red team discovered 11 high-severity Firefox vulnerabilities; Shannon AI pentester demonstrated 72% accuracy on SaaS production with honest assessment of business logic gaps; CREST-certified practitioners noted compliance implications and quality gaps relative to manual testing. Market adoption accelerated: SANS survey shows 67% of red team operators now use AI tools (up from 18% in 2023), a 3.7x adoption increase; Synack + Omdia survey of 200 U.S. security leaders showed 87% actively planning/using agentic AI, 95% expect displacement of traditional services, 93% emphasize guardrails needed, but only 32% of attack surfaces are currently tested. OWASP published structured Q2 2026 AI and agentic red teaming landscape framework; Pentera earned Frost Radar Leader designation for automated security validation; analysis of 39+ open-source AI pentesting agents revealed a critical lab-to-real gap (GPT-4 exploits 87% of one-day CVEs but only 13% of real CVEs), with ARTEMIS and XBOW named as top performers. Deployment barriers and human-in-the-loop architecture remain unchanged as structural requirements.
2026-Late Apr: Major vendor ecosystem maturity signals confirmed: Microsoft released AI Red Teaming Agent as GA in Azure Foundry with NIST-aligned governance framework integrating PyRIT; CERT-EU documented internal deployment of AI-powered pentesting pipeline with concrete output (exploitation timeline now negative seven days). Independent research (Escape.tech, April 30) benchmarked five AI pentesting tools, finding tool orchestration matters more than model choice—detection rates 1–9 vulnerabilities on identical test app. Attacker operationalization confirmed: Google Threat Intelligence documented APT31 operational use of AI-driven vulnerability discovery (HexStrike with Gemini, February 2026). Critical maturity gap identified: LangWatch documented that existing automated red teaming tools (PyRIT, PAIR, TAP, Crescendo) fail in production due to shallow multi-turn attack simulation—0% vulnerability detection in banking agent testing vs 50-turn approaches revealing system prompt leakage and auth flaws. Infrastructure focus emphasized: Strobes and Hadrian research confirms system architecture (tool execution, context management, validation guardrails) is 80% of pentesting effectiveness; orchestration matters more than model capability. Market consolidation: Pentera at $100M ARR with $250M total capital; autonomous offensive security testing market forecast $2.1B (2025) → $15.8B (2034) at 27% CAGR. Human-in-the-loop architecture remains structural requirement despite increasing vendor capability; framework for integrating AI-augmented tools into continuous validation workflows emphasizes evidence-driven scoping over full autonomy.
2026-May: Infrastructure and orchestration emerge as the dominant deployment theme: IBM X-Force Red's X-Frame introduces human-in-the-loop AI-augmented adversary simulation to match AI-enabled threats, while Strobes engineering confirms system architecture accounts for 80% of AI pentesting effectiveness versus 20% for model choice. LangWatch Scenario open-source framework ships CI/CD-integrated red teaming addressing the production gap where existing tools (PyRIT, PAIR, TAP) returned 0% detection rates on real banking agents. AWS Security Agent extends beyond task-level testing to full repository code review (GA May 2026), enabling context-aware analysis of entire codebases for systemic design-phase vulnerabilities. Capability limits quantified: EPAM hands-on evaluation shows AWS Security Agent detecting only 35-38% of known vulnerabilities on realistic targets (Shannon 17-33%), with three primary gaps in custom logic understanding, multi-step exploit execution, and real-world error handling. Architectural differentiation proven: Ken Huang benchmark isolates system design—RidgeGen achieves 0% hallucination versus Shannon's 63% unconfirmed findings on identical Juice Shop target using identical LLM backend, proving orchestration matters more than model. Standards maturation: OWASP publishes Autonomous Penetration Testing Standard (APTS) v0.1.0 with 173 requirements across 8 governance domains and four autonomy levels, codifying transition from research to operational deployment requiring formal assurance frameworks; Netragard's critical practitioner assessment reinforces that AI tools remain unable to think, adapt, or discover novel attack paths—distinguishing tool-based automation from human creativity. Synack launches Sara with human-expert validation layer addressing false positive reduction. Critical remediation signal: Bright Defense analysis shows only 21.1% of serious AI/LLM pentest findings are resolved (73.5% for web), indicating detection outpaces remediation capacity. Market scale confirmed at 27% CAGR with autonomous offensive security testing market projected to reach $15.8B by 2034; attacker AI adoption accelerates with 70+ open-source offensive AI tools now catalogued (versus fewer than 5 pre-2023) and APT31 confirmed using AI-driven vulnerability discovery operationally.

TOOLS

AWS Security Agent Microsoft PyRIT Scenario (LangWatch)Pentera Palo Alto Prisma AIRS