Security-focused code review

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

LEADING EDGE

TRAJECTORY↑ Advancing

AI augmenting static and dynamic application security testing to identify vulnerabilities in code before deployment. Includes LLM-augmented SAST/DAST tools and AI-powered vulnerability explanation; distinct from general code review which focuses on quality rather than security.

OVERVIEW

AI-augmented security code review has crossed from experiment to real deployment at forward-leaning organisations, but the bottleneck has shifted from tooling to organisational capacity. LLM-enhanced SAST/DAST tools now layer contextual understanding and automated remediation on top of rule-based scanners, targeting vulnerability detection rather than general code quality. Empirical evidence from April 2026 validates the hybrid approach: Google's peer-reviewed study of LLM+SAST achieved 89.5% precision with 91% false-positive reduction on 25 production projects, and independent testing of 7 AI code review tools confirmed that context-aware tools (CodeRabbit, Greptile) catch 11-12 of 14 planted bugs versus 7 for context-blind approaches. The vendor ecosystem is genuinely mature: multiple GA products, analyst recognition, and documented remediation speedups measured in multiples, not percentages.

The defining tension at the leading edge is a velocity mismatch tied to deployment reality. Real production deployments show AI-generated code carries 2.74x higher vulnerability rates than human-written code, with practitioners reporting 67% increases in code review time and 118% spikes in security findings despite 31% productivity gains. A practitioner case study documents an Amazon 6-hour production outage (6.3M lost orders) traced directly to inadequate security code review of AI-generated code, affecting one in five organisations. Independent benchmarks confirm both sides of the paradox: hybrid SAST-LLM architectures cut false positives dramatically, yet neither AI nor rule-based tools catch 100% of runtime-validated vulnerabilities. Organisations with dedicated AppSec programmes and tiered review governance (low-risk vs high-risk code paths) are extracting measurable value. Most lack the institutional capacity to absorb AI-accelerated code velocity, leaving tool maturity stranded ahead of organisational readiness.

CURRENT LANDSCAPE

The vendor ecosystem has consolidated around a handful of scaled players. Checkmarx One serves 865+ large enterprises with documented 50%+ vulnerability density reductions; GitHub Copilot Autofix reports 3x faster remediation overall, reaching 12x for SQL injection fixes, and has achieved roughly 80% adoption among new GitHub Advanced Security customers. Snyk, DeepSource, Anthropic, and Datadog have all shipped GA or preview security-review capabilities in early 2026, with Datadog releasing an open-source AI-native SAST tool in April, broadening the competitive field. Forrester's Q3 2025 SAST Wave recognised this maturity, rating Checkmarx 5/5 for AI-powered risk prioritisation.

Independent testing from April 2026 shows heterogeneous tool quality. Context-aware tools (CodeRabbit, Greptile) caught 11-12 of 14 planted bugs in an 80K-line production codebase; tools lacking full-codebase understanding caught 5-9. The hybrid SAST-LLM approach achieves 89.5% precision and 91% false-positive reduction on 25 real projects, but remains the exception. However, all tested tools missed runtime-validated vulnerabilities, and practitioners documented nested validation failures: code approved by Copilot, CodeRabbit, and Snyk together contained XXE vulnerabilities. Tenzai's January 2026 audit found 69 vulnerabilities across five AI coding platforms with zero flagged by conventional scanners. The AI tools themselves are attack surface: a CVSS 9.6 prompt-injection flaw in GitHub Copilot Chat enables data exfiltration via PR descriptions.

Organisational absorptive capacity remains the binding constraint. Deployment data shows 46% AI-generated code in production led to 23.7% vulnerability increases, concentrated in AI-written modules. Code review velocity mismatches accelerate: practitioners report 67% increases in review time per PR despite productivity gains, and security findings per review jumped 118% (2.74x vulnerability rate in AI code). OX Security's analysis of 300+ repositories surfaced 500,000+ security alerts, far exceeding AppSec team triage capacity. Developers review 200-400 lines per hour; AI-generated pull requests routinely exceed this threshold. The tooling works. Governance and organisational capacity remain the constraint.

TIER HISTORY

ResearchJan-2023 → Jan-2023

Bleeding EdgeJan-2023 → Jan-2026

Leading EdgeJan-2026 → present

EVIDENCE (94)

Snyk Embeds Claude: When AI-Powered Vulnerability Detection Meets Agentic RemediationProduct Launches2026-05-09

— Snyk integrates Claude into AI Security Platform for discovery, prioritization, and automated remediation; deployed to Glasswing orgs May 2026, broad rollout through 2026 confirms ecosystem maturity.

One of the Largest Online Retailers Lost 6.3 Million Orders in One Day. The Cause? AI Code Nobody TestedCase Studies2026-05-07

— Amazon 6-hour outage (6.3M lost orders, March 5 2026) traced to AI-generated code deployed without security review; triggered 90-day code safety reset across 335 critical systems requiring 2 reviewers and senior sign-off for AI code.

AI Code Review: Does It Actually Help? (Data from 100 Teams)Adoption Metrics2026-05-07

— Large-scale empirical study (100 B2B teams, 23,847 PRs, 12 months) finds AI-only approval increases defect escape 46%; hybrid (AI comments + mandatory human review) achieves best outcomes, validating need for human oversight.

False Positive Reduction Tools: Comparing Approaches for SAST, DAST, and SCAIndustry Reports2026-05-05

— OX Security analysis shows 865K+ alerts/year (71-88% false positives), engineers spend 6.1h/week on findings (~$20k/dev/year wasted), 22% of teams disabled tools due to alert fatigue.

How Secure Is an AI-Generated App? 2026 Benchmark of Lovable, Bolt, Cursor, Replit, and V0Adoption Metrics2026-05-01

— VibeEval security assessment of 1,514 live AI-generated applications shows 81% contain critical/high vulnerabilities, demonstrating systemic security gaps that security-focused code review must address.

89% of Enterprise Engineering Teams Have Experienced an AI-Generated Code Incident. The Data Explains Why.Adoption Metrics2026-04-30

— Enterprise survey (500 engineers/leaders, March 2026) documents 89% experienced AI code incidents, 25% suffered complete outages, 79% adopted automated security gates; validates scale of deployment and adoption need.

Contrarian View: You Should Not Use GitHub Copilot 2.1 and SonarQube 10.5 for 2026 Code Reviews – Human Reviewers Are More AccurateCase Studies2026-04-29

— 12-month production benchmark (47 repos, 12.4M LOC) shows human reviewers caught 41% more critical security bugs (17.2 per 1000 LOC vs 12.2) with 0% false positives vs 12% for AI toolchain.

Developers Spend 11.4 Hours/Week Reviewing AI CodeAdoption Metrics2026-04-29

— Developer survey (2,847 respondents) shows review now exceeds writing (11.4h/week reviewing vs 9.8h/week writing), establishing security verification bottleneck as operational reality at scale.

HISTORY

2023-H1: AI-augmented SAST tools (DeepCode AI, AI-CodeWise) launched and gained adoption; concurrent research documented vulnerabilities in AI-generated code, establishing the maturity paradox: tools work but users must understand limitations.
2023-H2: Real-world deployment challenges surfaced: practitioner research revealed critical blind spots in SAST tools (false negatives underestimated), empirical studies confirmed 29.5% of Copilot Python code and 24.2% of JavaScript contained security weaknesses, and adoption surveys showed 40% of organizations avoid SAST tools due to false positives. Vendor perspectives shifted toward risk-based approaches as legacy tool limitations became undeniable.
2024-Q1: GitHub launches Code Scanning Autofix (public beta), Snyk releases DeepCode AI hybrid system targeting AI-generated code. Vendor consolidation tightens (Snyk/GitHub dominate). Critical adoption gap emerges: 75% of developers falsely believe AI-generated code is more secure than human-written, yet 80% bypass security policies to use AI tools, suggesting tools create compliance friction rather than security confidence.
2024-Q2: Production autofix tools enter GA (Snyk Agent Fix, GitHub Autofix); independent research confirms vulnerability persistence: 62% of C code vulnerable across 9 LLMs, 7,703 real-world GitHub files show 4,241 CWE instances. Enterprise adoption stalls: only 20% ran POCs, 58% cite security as barrier, AppSec teams 5x more risk-aware than developers. Maturity asymmetry confirmed: tools advance technically but organizational readiness and institutional confidence remain weak.
2024-Q3: GitHub expands Copilot Autofix free to all public repositories; Snyk Code confirmed as market leader in developer surveys; Checkmarx announces AI-specific IDE security tools. Yet institutional skepticism deepens: Checkmarx survey finds 80% of AppSec managers concerned AI introduces more threats than it fixes. Vendors document the paradox: autofix tooling advances while organizational risk tolerance lags, creating a net-negative security posture despite feature maturity.
2024-Q4: Snyk Code reaches $100M ARR with 3,100+ customers; IDE integration matures (Snyk DeepCode AI Fix, GitHub REST APIs for Autofix). Yet institutional confidence regresses: tool adoption falls 11.3% YoY, training investment down 17.8%. Independent analysis debunks vendor claims: GitHub quality study tested only simple CRUD tasks; developers using AI tools become 19% slower due to verification costs. The bleeding-edge plateau is confirmed—mature feature parity but fragile organizational readiness.
2025-Q2: GitHub GA of security campaigns with Copilot Autofix (April 2025) achieves 10% to 55% remediation rate improvement; Snyk reports 245% QoQ DAST ARR growth post-Probely acquisition. Yet independent analysis (RedMonk, Ghost Security) documents persistent skepticism: AI code review tools lack project context and produce 91%+ false positives in traditional SAST, raising questions about whether tool sophistication translates to real security gains. Vendor feature maturity continues but institutional confidence remains qualified by unresolved effectiveness questions.
2025-Q3: Ecosystem consolidation continues: Snyk maintains leadership (governance enhancements September 2025), GitHub expands Copilot code review to Xcode, Forrester recognizes AI-native SAST maturity. Yet independent evidence documents persistent vulnerabilities: Veracode confirms 45% of AI-generated code contains flaws (Java 71%), Stack Overflow survey shows developer trust collapsed to 29% despite 80% adoption, production incidents reveal hardcoded secrets and compliance gaps. Organizational adoption barriers persist: only 20% conduct POCs, security fears cited by 58%, developer-AppSec misalignment deepens. Feature maturity and adoption skepticism coexist at Q3 2025.
2025-Q4: Vendor consolidation crystallizes: Checkmarx One scales to 865+ enterprises ($150M ARR), GitHub Copilot Autofix reaches ~80% newcomer adoption with 3-12x remediation speedups across vulnerability types. Forrester Wave Q3 2025 recognizes AI SAST maturity; analyst consensus confirms vendor leadership. Yet independent validation reveals dual reality: InfoWorld's November study shows hybrid SAST-LLM achieves 89.5% precision (91% false positive reduction), validating AI enhancement potential; simultaneously, Legit Security's CVE-2025-62453 disclosure exposes GitHub Copilot Chat vulnerability itself (CVSS 9.6), proving AI assistants require equivalent security hardening. OX Security's "Army of Juniors" analysis documents real-world friction: 300+ repositories show AI velocity outpaces review capacity (500K+ alerts), non-technical deployments lack security knowledge, ten anti-patterns recur consistently. Feature maturity is genuine and quantified; adoption friction and governance constraints remain the limiting factors.
2026-Jan: Market adoption reaches mainstream (84% of developers using AI-assisted code review per Zylos), yet critical vulnerabilities persist undetected: Pixee.ai/Tenzai testing finds 69 vulnerabilities across 5 AI coding platforms with zero detection by traditional scanners, contradicting vendor comprehensiveness claims. Copilot Autofix deployment velocity confirms (3-12x remediation acceleration), but practitioner analysis documents code review breakdown at AI velocity: AppSec teams overwhelmed by 500K+ alerts in 300-repository analyses, with organizational readiness limiting adoption despite mature tooling. ICSE 2026 research provides rigorous empirical evaluation of SAST tools. AppSec stakeholder sentiment (StackHawk survey) documents adoption challenges. The practice exhibits deployed feature maturity (remediation automation, IDE integration, governance workflows) but faces organizational capacity constraints as AI code generation outpaces security review, remediation, and governance infrastructure.
2026-Feb: Vendor ecosystem expands with GA announcements (Snyk reachability analysis, DeepSource AI Review Engine, Anthropic Claude Code Security) while security vulnerabilities in AI tools surface (CVE-2026-21516 in Copilot for JetBrains, CVE-2026-21257 in Visual Studio integration). Research documents AI code review limitations: ProjectDiscovery benchmark shows 24 vulnerabilities missed by AI-only static review, validated by runtime testing. Multi-model iterative review shows promise (3-5x bug detection improvement), suggesting architectural advances in AI review workflows. Tension persists between tool maturity and deployment velocity—organizational security review capacity remains the bottleneck.
2026-Mar: Deployment evidence crystallizes the maturity paradox. Atlassian's peer-reviewed study of 1,900+ production repositories documents 5.75% performance gap favouring human reviewers (44.45% vs 38.7% resolution rate), revealing fundamental limitations in business logic and architecture-level assessment. DryRun Security's controlled testing of Claude Code, OpenAI Codex, and Google Gemini found 87% of generated PRs shipped vulnerabilities (143 total across 30 PRs, zero fully secure applications), with systemic failures in access control and state validation. Endor Labs achieved $15M ARR (131% YoY growth) with multi-agent SAST filtering 92% false positives, serving OpenAI, Atlassian, Snowflake alongside AI-native customers. Real-world deployment data: organizations reaching 46% AI code generation saw 23.7% vulnerability increases with 62% of findings concentrated in AI-written code; Veracode's controlled testing of 100+ LLMs on 80 security tasks documented quantified failure rates by language. Snyk Code confirmed GA availability with SaaS and local engine deployment options and Jira/Slack integrations, signalling broad ecosystem embedding. The practice shows deployed adoption alongside documented limitations: tools work but organizational capacity and tool completeness remain mismatched against AI code generation velocity.
2026-Apr: Vendor ecosystem broadened with Datadog releasing an open-source AI-native SAST tool, and peer-reviewed research (Google SAST-Genius) confirming hybrid LLM+SAST achieves 89.5% precision with 91% false-positive reduction on 25 production projects. Independent empirical testing of 7 AI code review tools showed context-aware tools (CodeRabbit 12/14, Greptile 11/14 planted bugs) outperform context-blind approaches, while production deployment data reinforced the velocity paradox: AI-generated code carries 2.74x higher vulnerability rates, driving 67% increases in review time and 118% spikes in security findings despite productivity gains—with a documented Amazon 6-hour outage (6.3M lost orders) attributed to inadequate review of AI-generated code. Uber's uReview case study confirmed production-scale security review is achievable (90% of 65K weekly diffs, 75% comment usefulness, multi-stage false-positive filtering), but simultaneously, prompt injection vulnerabilities were disclosed in Claude Code, Gemini CLI, and GitHub Copilot enabling credential exfiltration—proving AI review tools require equivalent security hardening as the code they review. Semgrep shipped AI-powered detection for logic flaws (IDORs, broken authorization) previously invisible to pattern-matching SAST, while industry surveys document the trust paradox at scale: 52% developer adoption but only 4% trusting AI output.
2026-May: Deployment reality and organizational constraint evidence solidifies. Large-scale empirical study (PanDev, 100 B2B teams, 23,847 PRs) finds AI-only approval increases defect escape rate 46%, while hybrid approaches (AI comments + mandatory human review) achieve best outcomes, validating the necessity of human oversight. VibeEval's assessment of 1,514 live AI-generated applications shows 81% contain critical/high vulnerabilities. Enterprise survey (Qodo/Censuswide, 500 engineers) documents 89% experienced AI incidents, 25% suffered complete outages—validating the maturity paradox at scale. ByteIota's developer survey (2,847 respondents) shows review now exceeds writing (11.4h/week reviewing AI code vs 9.8h/week writing), establishing verification bottleneck as operational reality. Snyk integrates Claude into security platform (May 9 announcement), embedding AI reasoning across vulnerability detection-remediation pipeline for broad rollout through 2026. False positive crisis persists: OX Security data shows 865K+ alerts/year (71-88% false positives), engineers waste ~$20k/dev/year on false positive triage, 22% of teams disabled tools due to alert fatigue. Independent benchmark (contrarian case study) shows human reviewers caught 41% more critical security bugs (17.2 per 1000 LOC) than Copilot + SonarQube combined with 0% false positives vs 12% for AI tools. Practice maturity is unambiguous (mature tooling, ecosystem integration, documented remediation gains), but organizational capacity constraints—review velocity mismatches, alert fatigue, bottlenecked AppSec teams—remain binding. The transition to leading-edge is confirmed by deployment evidence and scale, but sustainable adoption awaits governance and organizational readiness advances.

TOOLS

GitHub Copilot Snyk Code / DeepCode AI Endor Labs VirusTotal Code InsightDeepVulGuardCodeRabbit Greptile Semgrep