AI-assisted code review with suggestions

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

LEADING EDGE

TRAJECTORY— Stalled

AI that reviews pull requests and annotates code with improvement suggestions for human reviewers to accept or reject. Includes PR review bots and automated code quality comments; distinct from auto-approve which removes the human decision step.

OVERVIEW

AI-assisted code review uses language models to annotate pull requests with quality, correctness, and style suggestions while keeping approval authority with human reviewers. The practice occupies a leading-edge position defined by adoption ubiquity offset by narrow value and persistent barriers to deployment success.

When properly configured and filtered for signal quality, suggestion-based code review can improve review time by 35% while maintaining defect escape rates (PanDev's analysis of 23,847 PRs across 100 teams found 2.4% defects with −38% review time in suggestion-only mode). Hyperscale deployments at Uber and GitHub have proved architecture can work at tens of thousands of PRs per week. Yet deployment at scale reveals hard constraints: comment signal-to-noise ratio drives adoption (75% usefulness needed to sustain developer engagement); human reviewers still identify 41% more critical bugs than AI toolchain; and code churn is doubling despite adoption. The binding constraint is not tooling availability but organizational workflow redesign, quality filtering infrastructure, and verification governance—particularly now that code-generation velocity has outpaced review capacity and created systemic risk (Amazon's March 5 outage traced to unreviewed AI code, forcing mandatory senior sign-off on 335 critical systems).

CURRENT LANDSCAPE

The vendor ecosystem has consolidated around platform incumbents, with specialist tools reaching hyperscale. GitHub's Copilot code review has processed 60 million reviews (10× growth since April 2025 launch), now handling more than one in five reviews on the platform. CodeRabbit reached 10,000+ customers with doubled revenue (Series C planned) and announced enterprise deployments across US, Japan, and India. Uber deployed uReview analyzing 90% of ~65,000 weekly diffs at hyperscale, implementing multi-stage filtering to suppress false positives; 75% of posted comments are marked useful by engineers, with 65%+ addressed. GitHub shipped metrics API breakdowns for Copilot code review by comment type (security, bug_risk), enabling outcome measurement per suggestion category at organizational scale.

Yet May 2026 evidence reveals persistent gaps in capability and deployment economics. PanDev's 12-month empirical study of 100 B2B teams (23,847 PRs) found that suggestion-only code review achieved −38% review time and 2.4% defect escape, outperforming AI-only and hybrid-strict modes—but this favorable outcome required explicit configuration and false-positive suppression. Independent benchmark research found human reviewers identify 41% more critical bugs than AI toolchain (Copilot 2.1 + SonarQube combined), achieve 0% false positives on critical issues versus 12% for AI, and cover 94% of OWASP Top 10 versus 66% for AI. Code churn is doubling despite adoption (GitClear: up to 9× higher code churn with AI tools), and MSR 2026's analysis of 33,707 AI-authored PRs showed agents merge simple changes quickly but fail to converge in iterative review. A critical new threat emerged: academic researchers disclosed prompt-injection vulnerabilities in Claude Code, Copilot, and Gemini code review agents, enabling credentials harvest via crafted PR titles and comments—with zero infrastructure requirements and severity (CVSS 9.4) exceeding vendor security bounty floors by orders of magnitude.

Organizational responses reflect deployment friction. Amazon's March 5 outage (6.3 million lost orders traced to unreviewed AI code) triggered formalized governance: mandatory senior engineer sign-off on all AI-assisted code across 335 critical systems, 90-day safety reset, stricter review gates. Individual practitioner deployments document the verification overhead: Jesse Hopkins' month-long local model experiment showed 93% false positives week 1, improved to 3-5 per 71 valid flags by week 4 with prompt retuning, but required mandatory human sign-off before production. Market penetration remains high (80% developer adoption of AI code review), but effectiveness remains bounded: tools detect only 42-48% of runtime bugs, false-positive rates range 5-15%, and suggestion adoption depends entirely on signal quality (developers ignore feedback at 29-45% hallucination rates). Architecture and business-logic validation remain gaps AI review cannot fill, and security blind spots persist (hardcoded secrets, missing input validation, CORS misconfigurations) despite prompt tuning across model scales.

TIER HISTORY

ResearchSep-2023 → Sep-2023

Bleeding EdgeSep-2023 → Apr-2025

Leading EdgeApr-2025 → present

EVIDENCE (99)

AI Writes 41% of Code Now — But Code Churn Is Doubling in 2026Research Papers2026-05-09

— Synthesis of MSR 2026, GitClear, and DORA datasets: AI code velocity creates quality-velocity paradox. MSR analysis of 33,707 agent PRs shows high merge rate for simple changes but iterative review failures; GitClear finds 9× higher code churn. Core constraint: code generation nearly free, but review and maintenance remain expensive. Establishes post-merge rework and churn problem that code review suggestion tools must address.

uReview: Scalable, Trustworthy GenAI for Code Review at UberCase Studies2026-05-08

— Uber's production uReview system processes 90% of ~65,000 weekly diffs (hyperscale deployment). Multi-stage architecture with pluggable assistants, post-processing filters to suppress false positives, and feedback loop: 75% of posted comments marked useful; 65%+ addressed by engineers. Demonstrates industrial code review suggestion system with quality controls for false-positive management at scale.

Copilot code review comment types now in usage metrics APIProduct Launches2026-05-08

— GitHub extended metrics API with Copilot code review suggestion breakdowns by type (security, bug_risk) and adoption rates (suggestions applied vs. posted). Enables enterprises to measure outcome adoption per suggestion category, signaling product maturity and outcome tracking for suggestion-based review workflows.

AI Code Review: Does It Actually Help? (Data from 100 Teams)Adoption Metrics2026-05-07

— PanDev's 12-month empirical study across 100 B2B teams (23,847 PRs) comparing review configurations: AI-assisted suggestion mode achieved −38% review time with 2.4% defect escape (vs 2.8% baseline), outperforming hybrid-strict and AI-only modes. Key finding: suggestion-based review adds value only when defect escape is tracked alongside time metrics; context-dependent bugs (architecture, business logic) remained gaps.

Testing AI code | Enterprise Quality Assurance for AI Code — Amazon Lost 6.3 Million OrdersCase Studies2026-05-07

— Named Fortune 5 outage (March 5, 2026): AI-assisted code shipped without proper review triggered 6-hour North American checkout failure, 6.3M lost orders. Amazon's structural response: mandatory senior sign-off on AI code, stricter review gates across 335 critical systems, 90-day safety reset. Direct evidence of organizational deployment challenges and policy response to code review capacity gaps.

Developer Ran AI Code Reviews for 30 Days—Here's What BrokeCase Studies2026-05-06

— Jesse Hopkins' month-long deployment of 7B-parameter local model: week 1 showed 93% false positives, improved to 3-5 per 71 valid flags by week 4 with rewritten prompts. Model caught cross-file logic gaps humans missed but required validation before production. Final pattern: renamed tool from 'code reviewer' to 'static analysis assistant,' enabling senior sign-off as mandatory step. Documents practical adoption feedback loop and false-positive reduction pattern.

CodeRabbit×Claude Codeで自動コードレビューを仕組み化するCase Studies2026-05-01

— Classmethod's practitioner integration of CodeRabbit with Claude Code: quick setup (5 min), suggestion-based workflow with `/coderabbit:review` command, automatic fixes via Claude. Detects security/concurrency/error-handling/design issues. Honest limitation: per-file context only; review time 7-30 min/PR slower than lighter tools. Documents practical deployment pattern and tradeoffs of tool integration.

Contrarian View: You Should Not Use GitHub Copilot 2.1 and SonarQube 10.5 for 2026 Code Reviews – Human Reviewers Are More AccurateResearch Papers2026-04-29

— Independent benchmark across 47 production repos (12.4M LOC): human reviewers identified 41% more critical bugs (17.2 vs 12.2 per 1000 LOC), achieved 0% false positives vs 12% for AI toolchain, covered 94% OWASP vs 66%. AI achieves 60% faster review but misses 34% of vulnerabilities. Critical limitation evidence on AI code review accuracy, compliance risk, and hidden remediation costs.

HISTORY

2023-H2: First generation of LLM-powered PR review tools deployed in production (Beko, Ant Group research). GitHub, Amazon, and IDE vendors introduced or previewed code review features. Industry surveys showed 23% current adoption vs. 90% planned adoption, signaling high growth intent but current-stage friction. Case studies documented both capability (73.8% useful reviews, improved bug detection) and cost (increased closure time, token consumption, false positives).
2024-Q1: Ecosystem matured with new tooling (ThinkReview, Greptile, CodiumAI PR-Agent refinements) and GitHub Copilot Enterprise enhancements for PR context. Organic GitHub adoption visible in developer conversations. Research challenged capability claims: LLMs handle simple changes but fail on semantic complexity; reviewers anchored to AI suggestions, missing bugs elsewhere; noise-to-signal ratios remained high. Controlled experiments with developers found no time savings despite high comment acceptance rates. Critical bottleneck shifted from feasibility to signal quality and practical utility in real review workflows.
2024-Q2: Vendor ecosystem expanded with Amazon Q Developer GA (April), signaling major cloud platform commitment. Academic research consolidated the landscape (TOSEM roadmap on MCR's AI evolution), identifying human-AI symbiosis as the future model while acknowledging persistent capability gaps. Open-source tooling matured (PR-Agent active development, 3,729+ commits). Industry adoption surveys showed 63% of enterprises piloting or deploying by mid-2023, trending toward 75% by 2028, but analyst caution remained about actual productivity impact (coding is only 20% of development lifecycle, not the 50% vendors claim). Early-stage research continued exploring LLM-based agents for comprehensive risk prediction. The practice remained in equilibrium: rising adoption driven by vendor integration and ecosystem maturity, but value proposition contested by capability research and practitioner friction on signal quality.
2024-Q3: Code review adoption shifted from experimental to normalized, with >97% of enterprise developers reporting AI tool use and GitHub Copilot ranking #2 globally. Independent tooling achieved venture scale: CodeRabbit raised $16M Series A with 600+ paying organizations and Fortune 500 pilots. However, independent evaluations revealed accuracy gaps (CodeWhisperer 31.1%, Copilot 46.3%, ChatGPT 65.2% correct code generation). Duolingo case study showed 67% reduction in review time with Copilot, but enterprise surveys revealed adoption-confidence gap: 38% of US developers report active organizational encouragement despite near-universal individual adoption. Critical journalism and vendor-agnostic assessments highlighted persistent false-positive rates and signal quality concerns. The practice consolidated around normalized integration into enterprise tooling, but value proposition remained contested—velocity gains were balanced by noise, accuracy concerns, and unvalidated productivity claims at scale.
2024-Q4: Major cloud vendors moved from preview to GA: AWS launched Amazon Q Developer code review capability (December 2024), marking end-of-year consolidation of enterprise platform commitment. Empirical evidence from peer-reviewed ICSE 2025 study (published December 24) documented real production deployments across 238 practitioners: 73.8% comment resolution but +2.5h PR closure time, providing balanced evidence of adoption with quantified trade-offs. GitHub Copilot adoption remained strong (64.5% user retention in JetBrains survey), with Copilot and Amazon Q at 31.1% accuracy on code generation. Critical research on LLM code verification (December 2024) revealed fundamental limitations: models incorrectly validate faulty and vulnerable code, with only 25-69% improvement via guided intervention. Specialized tooling ecosystem matured (CodeAnt, CodeRabbit) with claimed large-scale deployments, but technical evaluation of Amazon Q showed strength in security detection with gaps in broad quality analysis. The practice entered end-year in normalized, mature equilibrium: ubiquitous adoption and vendor integration alongside persistent signal quality concerns and empirical evidence that velocity gains trade off PR closure time, accuracy gaps remain unresolved, and autonomous code assessment reliability is limited.
2025-Q1: Vendor platform expansion continued: AWS released new /review agent features in Amazon Q Developer (Feb 2025). Real-world case studies from independent organizations (Apriorit) documented 20% development cycle improvements, but critical Q1 2025 evidence mounted: industry analysis showed AI code produces 1.7x more issues with net -12% productivity impact; practitioner reports documented consistently incorrect AI suggestions and confidence-accuracy mismatches; peer-reviewed research (CHASE 2025) confirmed adoption barriers around trust and context limitations. Practice remains normalized with broad adoption but with mounting evidence that velocity gains trade off against review cycle friction and signal quality gaps.
2025-Q2: Deployment reached 51% of enterprise pull requests (Jellyfish analysis of 2M+ PRs, 260% YoY growth from 14% baseline). GitHub expanded platform features with custom-instructions for Copilot code review (June 2025). However, Q2 evidence revealed emerging deployment-confidence gaps: 84% of developers use AI code review but only one-third trust accuracy; 42% of developers report AI generates half or more of their code, yet one-third don't review pre-deployment (Cloudsmith report). Practitioner evidence showed velocity gains offset by verification overhead: experienced developers 19% slower with early-2025 AI tools (Metr study). Adoption surveys showed 92% face pressure to adopt AI tools with 66% concerned about job displacement. The quarter solidified market normalization but highlighted production-safety blind spots: AI-generated code volume now outpaces review capacity and human verification costs.
2025-Q3: Adoption plateaued while effectiveness concerns intensified. GitHub shipped copilot-instructions.md GA for organizational customization (Aug 2025); AWS released interactive Amazon Q Developer code review features (Sep 2025), signaling vendor focus on feature parity and enterprise configuration. However, deployment data reversed earlier headlines: Jellyfish analysis of 400 companies showed agents on only 22% of reviews with 18% resulting in code changes. Large-scale empirical study of 16 AI code review tools found 0.9-19.2% effectiveness vs. 60% for humans. Google DORA survey revealed adoption-trust paradox: 90% use AI but only 24% trust it "a lot" (Sep 2025). Bain report documented only 10-15% productivity gains with METR evidence showing developers 5-19% slower due to verification overhead. The quarter marked consolidation around platform ubiquity without commensurate effectiveness gains. Code review effectiveness, not capacity, emerged as the binding constraint.
2025-Q4: Adoption surge met deepening trust crisis. Code review agent adoption climbed to 51.4% in engineering teams (Oct 2025, from 14.8% at year start), with 90% of teams using AI assistance and 41% of code output AI-generated/assisted. Vendor consolidation accelerated: AWS deprecated CodeGuru Reviewer and consolidated code review into Amazon Q Developer GA (Nov 2025). Yet sentiment data painted a concerning picture: Stack Overflow's 49,000-developer survey (Dec 2025) revealed 80% adoption but trust fell to 29%, with 45% frustrated by "almost right" AI code requiring rework. Practitioner evidence documented unintended consequences: GitClear analysis found 8x code duplication and 37.6% vulnerability increase when same AI model reviewed its own output (Qodo, Dec 2025); team case study showed initial metric gains evaporated as AI review eroded mentorship and code homogeneity (Dec 2025). The quarter crystallized the effectiveness paradox: infrastructure and tooling achieved ubiquity, but signal quality, trust, and verification overhead remained unresolved. The binding constraint shifted from availability to deployment maturity.
2026-Jan: Bottleneck revealed in first-month research cascade. Independent empirical studies quantified the scalability crisis: MSR 2026 analysis of 33,707 AI-authored PRs found 28.3% instant-merge rate but sustained high review effort in iterative cases, with top 20% of highest-effort PRs consuming 69% of total labor (mining-challenge.2026). MSR 2026 peer-reviewed study showed AI PRs generate positive reviewer sentiment but carry higher redundancy and lower code reuse, masking technical debt (arxiv.2601.21276). Sonar developer survey (1,100+, Jan 2026) documented verification paralysis: 72% daily AI tool use, but 96% doubt correctness and only 48% verify before commit; 38% report AI review verification harder than human review. Baytech synthesis (Jan 2026) quantified productivity paradox: METR 2025 RCT shows experienced developers 19% slower due to verification overhead despite psychological belief in 20% gains. Market analysis (Zylos, Jan 2026) confirmed mainstream penetration: 84% developer adoption, 20% of enterprises using AI to review 10-20% of PRs, but leading tools detect only 42-48% of runtime bugs with 5-15% false-positive rates. Practitioner frameworks emerged (Sancho, Jan 2026) proposing hybrid three-confidence-dimension model to address review overload (AI PRs regularly exceed 2,000 LOC, far above human cognitive limits of 200-400 LOC/hour). Month consolidated the mature-stage diagnosis: infrastructure ubiquity achieved, but signal quality, review capacity scaling, and workflow adaptation remain unresolved. Binding constraint identified as hybrid-model organizational design, not tool capability.
2026-Feb: Systematic reliability research published in arXiv preprint (Feb 28) reveals fundamental failure modes: LLMs frequently misclassify correct code as non-compliant, with higher misjudgment rates under detailed prompts requiring explanations. Real-world deployment data continues to show paradoxical outcomes: AWS product page reports BT Group at 37% and NAB at 50% suggestion acceptance in production (escalating to 60% with codebase customization), yet industry aggregation (10x.pub synthesis, Feb 11) quantifies the "40% code review quality deficit" with 1.7x higher issue density in AI-reviewed code, PR sizes 18% larger, and incidents 24% higher. Senior engineers report 3.5x longer verification cycles when reviewing AI suggestions. Code review time increased ~91% despite AI adoption (Faros AI, Feb 2026 analysis), contradicting DORA improvement claims. Comparative tool testing (Manus, Feb 13) across 9 platforms found significant variance in detection quality on security-critical logic (RBAC, auth, middleware). Month crystallized the maturity plateau: tooling ubiquity (AWS, GitHub, CodeRabbit all scaling) coexists with unresolved signal-quality gaps and emerging organizational concern about unintended consequences—mentorship erosion, junior pipeline collapse (60% drop in entry-level hiring since 2022), and senior engineer burnout from extended verification cycles. The core tension remains unresolved: code generation velocity now outpaces both review capacity and organizational ability to maintain code quality standards and engineering culture.
2026-Mar: Market maturation and organizational response converge. GitHub reported 60 million Copilot code reviews since April 2025 launch (10x growth), handling >20% of all reviews on platform with agentic memory and repository context, while CodeRabbit announced 10,000+ customers with doubled revenue and $86M total funding (Series C planned). Large-scale empirical research definitively quantified the bottleneck: LinearB analysis of 8.1M PRs from 4,800 teams showed AI code waits 4.6x longer for review start but reviews 2x faster once started (net slowdown); acceptance gap of 51.7 percentage points (32.7% AI vs 84.4% human) directly mirrors trust gap (96% distrust, 48% verify). Independent tool comparison (Cotera) found Copilot achieves 64% actionable suggestion rate, CodeRabbit 58%, with critical limitation: lack of codebase context awareness. HubSpot case study documented organizational adaptation: internal Sidekick agent evolved from infrastructure-heavy Kubernetes approach to focused "Judge Agent" filtering low-value feedback before publication, reducing latency 90% and establishing feedback quality (not volume) as constraint. Amazon formalized verification response: mandatory senior engineer sign-off on all AI-assisted code following March 5 outage, reflecting 96% correctness distrust and 48% verification gap. Production incident case study documented 12 subtle vulnerabilities missed by traditional review, requiring AI-specific security checklist to achieve 94% detection at 47% review time cost. Atlassian peer-reviewed study of 1,900+ repos found AI tools resolve only 38.70% of security issues vs 44.45% human, with critical blind spots in business logic and architecture-level risks. Quality-speed tradeoff crystallized: cycle time dropped 24% but defect density increased 1.7x. Month confirmed leading-edge diagnosis: tooling ubiquity and vendor consolidation achieved, but review capacity saturation, quality-speed tradeoffs unresolved, and organizational adaptation (workflow redesign, verification infrastructure, governance) critical differentiator between successful and failing deployments. The binding constraint shifted from tool availability to organizational design capacity.
2026-Apr: Tool capability evaluation reaches maturity with convergent benchmarking and incident evidence. Independent benchmarks quantified effectiveness limits: Martian's 200,000+ PR analysis across 17 tools shows 50-60% F1 scores with CodeRabbit leading at 51.2%; Entelligence's evaluation on 67 real production bugs finds even best tools (Entelligence 47.2%, CodeRabbit 33%, Copilot 22.6%) miss >50% of real bugs—establishing that current tools cannot be relied upon for comprehensive review. Formal verification research (Z3 SMT solver on 3,500 artifacts) reveals critical generation-review asymmetry: models catch their own vulnerabilities 78.7% of the time in review mode despite generating them 55.8% by default, validating AI code review value but also confirming organizational need for multi-layer verification. Real-world deployment metrics widen the concern: Fortune 500 financial services (40+ engineers with Claude Code and Copilot) reported 52% code review time increase despite 30% PR volume increase, with senior engineers spending 6-8 hours/week on AI-generated reviews (up from 4-5 hours), forcing fundamental workflow redesign. GitHub shipped new API metrics (total_merged_reviewed_by_copilot, median_minutes_to_merge) signaling production maturity and enabling independent ROI measurement. Trust erosion incidents mounted: Copilot injected promotional text into 1.5M+ PR descriptions without developer control, documenting integrity risks in code review surfaces; Branch8's managed engineering firm (200+ teams) published post-incident governance requiring 3% engineering capacity overhead to safely operate AI-assisted review. CodeRabbit analysis of 470 GitHub PRs confirmed quality paradox: AI code produces 1.7x more issues (10.83 vs 6.45 per PR), 2.74x higher security vulnerabilities, 3x worse readability, with 75% manual review adoption yet incidents still surging 23.5%. Late-month evidence (April 15-28) reinforces constraints: JetBrains empirical research (800-dev longitudinal study presented at ICSE 2026) confirms workflow shifts reveal adoption-value gaps; Google deployment case study documents 75% of new code AI-generated with engineers spending 11 minutes reviewing each changelist focused on security and architecture; Cloudflare case study shows multi-agent orchestration (7 specialized reviewers) deployed across tens of thousands of PRs; Black Duck's OSSRA report identifies critical governance failure (only 24% comprehensive review rate correlating with 107% vulnerability surge); practitioner analysis confirms effectiveness ceiling at 50-60% with material post-merge defect rates. GitHub's expanded metrics API (active/passive user tracking) signals feature maturity and enterprise adoption readiness. Month consolidated April evidence into clear picture: tooling capability hitting hard limits (50-60% effectiveness ceiling), organizational costs mounting (52% review time increase), governance gaps widening (24% review rate), and structural gaps (generation-review asymmetry, business-logic blind spots) requiring governance and workflow redesign rather than tool improvement. The practice approaches plateau: ubiquitous adoption without commensurate value delivery remains the defining constraint.
2026-May: Evidence cascade reveals both positive deployment outcomes and critical safety/capability gaps. PanDev Metrics' comprehensive empirical study of 100 B2B teams (23,847 PRs over 12 months) documented that suggestion-only code review configuration achieved −38% review time (2.6 hours vs 4.2 baseline) with 2.4% defect escape (vs 2.8% baseline, slight improvement)—providing balanced evidence that properly configured suggestion-based review can deliver value when signal quality is prioritized over volume. Hyperscale deployment confirmation: Uber's uReview system processes 90% of ~65,000 weekly diffs in production with multi-stage filtering achieving 75% comment usefulness and 65%+ action rate, demonstrating that industrial-scale code review suggestion systems can operate at hyperscale. GitHub's metrics API expansion (May 2026) added suggestion type breakdowns and adoption measurement per category, enabling organizations to measure outcome ROI per suggestion type. Critical safety disclosure: independent researchers published prompt-injection vulnerability affecting Claude Code, Copilot, and Gemini code review agents (CVSS 9.4 critical), enabling credentials harvest via crafted PR titles/comments with zero infrastructure requirements—exposing systemic vulnerability in code review agents with access to repository secrets. Negative signal reinforced: independent benchmark across 47 production repos showed human reviewers identify 41% more critical bugs than AI toolchain (Copilot 2.1 + SonarQube), achieve 0% false positives vs 12% for AI on critical issues, and cover 94% OWASP Top 10 vs 66% for AI—establishing lower capability ceiling than tool vendors claim. Code churn signal worsened: MSR 2026 analysis of 33,707 AI-authored PRs combined with GitClear dataset (up to 9× higher code churn) and practitioner reports (66% of developers report AI output "almost correct but still flawed") establish post-merge rework burden offsetting pre-merge time savings. Practitioner deployment patterns: Jesse Hopkins' month-long local model experiment improved false-positive rate from 93% (week 1) to 3-5 per 71 valid flags (week 4) with prompt tuning, but established mandatory human sign-off as required gate; CodeRabbit×Claude Code integration case study documents practical workflow (setup 5 min, review 7-30 min/PR) with honest limitation acknowledgment (per-file vs project-wide context, slower than lighter tools). Month consolidated May evidence into clear picture: suggestion-based review can work when signal quality is ruthlessly prioritized (filtering, tier-appropriate deployment, verification gates), hyperscale operations are achievable (Uber, GitHub), but baseline AI code review effectiveness remains below human reviewers and security vulnerabilities in code review agents themselves pose new risks. The binding constraint shifted from tooling availability to organizational deployment sophistication: teams must implement quality filtering, governance overlays, and verification infrastructure to extract value.