Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Verification — fact-checking, citations & source quality

LEADING EDGE

TRAJECTORY

Advancing

AI that verifies citations, validates factual claims, assesses source quality, and ranks source reliability. Includes automated reference checking and misinformation detection; distinct from research retrieval which finds information rather than verifying it.

OVERVIEW

AI-powered verification has split into two sharply different realities. Specialized tools -- purpose-built for citation checking, claim detection, and source assessment -- now run in production at forward-leaning newsrooms, academic libraries, and government programmes. General-purpose LLMs, by contrast, remain systematically unreliable verifiers; independent benchmarks consistently find 30-70% of their citations distorted, fabricated, or unsupported. That gap defines the practice's leading-edge position: proven capability exists in dedicated systems, but the broader market has not yet adopted them at scale -- and major platforms have largely dismantled their fact-checking infrastructure, shifting responsibility to specialist providers.

June 2026 evidence sharpens the bifurcation further. Large-scale empirical studies document persistent failures: an international audit scanning 2.5 million medical articles identified 4,046 fabricated citations across 2,810 papers, with publishers taking action in fewer than 2% of flagged cases. Stanford researchers measured chatbot performance on 2,100 factual questions and found retrieval failures drive 70%+ of errors. An independent benchmark of election fact-checking (3,136 prompts, expert-judged) found major chatbots fail verification 90% of the time, cite state-controlled media, and misattribute claims. Yet specialized tools continue proving themselves: Columbia-led verification system deployed at scale; Amazon's adaptive fact-checking protocol lifts accuracy from 60.8% to 90.9%; multi-model verification in regulated sectors reduces hallucination 61% (8.3% → 3.2%); Factiverse operates multilingual fact-checking across 114 languages. Google and OpenAI announced mainstream verification infrastructure (SynthID watermarking, C2PA Content Credentials) reaching billions of Search users. The maturity question is no longer whether these tools work -- deployment evidence is now substantial and quantified -- but whether resource-constrained organizations and developing markets can access the specialized tooling that has proven effective, and whether the broader market will adopt verification as table-stakes rather than optional enhancement.

CURRENT LANDSCAPE

Specialized verification vendors continue expanding institutional footprint while demonstrating measurable deployment results. An international audit (Columbia University, University of Eastern Finland, Tel Aviv Sourasky Medical Center) scanning 2.5 million PubMed articles identified 4,046 fabricated citations across 2,810 papers—demonstrating both the system's detection capability and the critical adoption gap: publishers had taken action on fewer than 2% of flagged articles by scan time. Factiverse deployed multilingual fact-checking across 114 languages with fine-tuned compact models outperforming LLM baselines; maintains operational deployments with NRK and Viestimedia achieving 95% transcription accuracy in live broadcast analysis. In academic libraries, Scite continues spreading (Old Dominion University, University of Pretoria) while specialized tools reach ~87% accuracy on scientific claim verification. Full Fact processes hundreds of thousands of sentences daily across 30 countries and 40+ fact-checking organizations. INRA.AI deployed citation validation claiming less than 0.1% hallucination through multi-layer verification. As of June 2026, Full Fact continues detecting AI-generated disinformation at scale including deepfakes, synthetic media, and fabricated narratives.

Major platforms shifted verification infrastructure mainstream. Google announced (May 2026) SynthID watermarking and C2PA Content Credentials integration into Search, Lens, AI Mode, and Chrome, reaching billions of users globally. OpenAI became a C2PA Conforming Generator and adopted SynthID watermarking for ChatGPT outputs; multiple vendors (Kakao, ElevenLabs, NVIDIA) integrated SynthID. Google launched AI Content Detection API for partners (Shutterstock, Snap, Canva). Amazon Science developed adaptive verification methodology (audit-then-score protocol) that lifts fact-checking accuracy from 60.8% to 90.9% by treating ground truth as an evolving process rather than a fixed dataset. These infrastructure advances signal industry-wide recognition that verification infrastructure is essential rather than optional.

Quantified verification ROI emerges in regulated sectors. A large-scale enterprise study (480 million outputs across legal, financial services, healthcare, Jan-Apr 2026) documented that multi-model verification architectures reduce hallucination rates 61% (from 8.3% to 3.2%), with highest impact in legal document processing. This contrasts sharply with the verification failure documented in civic fact-checking: an independent benchmark (Forum AI, 3,136 election-related prompts, expert judges) found major chatbots fail verification 90% of the time, with 36% containing factual errors and pervasive citation of state-controlled media. Stanford's evaluation of six chatbots on 2,100 factual questions revealed that retrieval failures -- not reasoning failures -- drive 70%+ of errors.

The broader adoption constraint remains fundamentally unchanged: single-model fact-checking is unreliable (67% disagreement among frontier models on the same 1,000 claims), and general-purpose LLMs continue producing lower-quality outputs than specialized tools. A Cornell study of 2+ million papers found that LLM-written papers achieve higher writing complexity scores but lower journal acceptance rates—demonstrating the verification signal problem: polished prose no longer correlates with research quality, forcing peer reviewers and journal editors to invest heavily in manual verification. Where verification succeeds, the pattern is consistent: human review in the loop, bounded domain expertise, and dedicated tooling rather than general-purpose LLMs. The adoption barrier is not technical capability—it is market literacy and resource access in developing regions.

Academic and corporate publishers have pivoted to deploying AI verification infrastructure in response to a quantified research integrity crisis. In 2025, publishers documented 80,000+ fabricated citations inserted into metadata, 129 paper retractions due to AI-generated content, and widespread undisclosed AI use in manuscript generation. By April 2026, Wiley, Elsevier, and Springer Nature have integrated AI-powered detection tools into standard manuscript screening and peer review workflows. The crisis extends beyond publishing: independent researchers conducting cross-team citation verification exercises (April 2026) found only 20% of AI-generated citations verifiable, with patterns including source fabrication, context misattribution, and blended hallucinations where real fragments are incorrectly combined.

The case against relying on general-purpose models for verification keeps getting stronger. A January 2026 Verifing benchmark found only 61% of ChatGPT citations verifiable. An Enago meta-analysis of biomedical literature documented one in five AI-generated references as entirely fabricated, with nearly half containing serious bibliographic errors. PAN's analysis of 11,000 ChatGPT-generated citations found a 31% error rate. A Washington State University peer-reviewed study (2026) tested ChatGPT on 719 scientific hypotheses and found only 60% above-chance accuracy, with only 16.4% accuracy identifying false statements and 73% consistency on repeated queries. These are not outlier findings; they converge with Stanford's EMNLP audit showing only 51.5% citation recall across four generative search engines and NeurIPS peer-review audits. Google's April 2026 FACTS Benchmark found even the best-performing model (Gemini 3 Pro) at only 69% factual accuracy; Google AI Overviews serving 2 billion monthly users show 91% answer accuracy but over 50% of those answers are "ungrounded"—meaning citations do not actually support the claims.

The high-stakes consequences of unverified citations are accelerating enforcement. A global database now documents 1,227 cases of AI hallucination in courts across 30+ countries (811 in US alone), with 1,022 fabricated citations, 323 false quotes, and 492 misrepresented holdings. Courts are imposing penalties ranging from $15K to $109K (highest ~$110K penalty in March 2026), disqualifications, and fee sanctions. Q1 2026 saw at least $145,000 in documented court sanctions, with named cases including Couvrette v. Wisnovsky (Oregon: $15.5K), William Ghiorso (Oregon Court of Appeals: $10K), and Whiting v. City of Athens (Sixth Circuit: $15K per attorney). Attorney AI adoption has tripled (11% to 30% year-over-year), forcing institutional adoption of verification workflows. New Orleans City Attorney implemented department-wide AI disclosure and annual compliance certification policy (March 2026), signaling that verification infrastructure is now table-stakes for legal compliance.

Broader adoption faces structural headwinds. Major platforms dismantled their fact-checking infrastructure in early 2025, shifting verification responsibility onto specialist providers and end users. Fact-checking interventions that do work—human review, accuracy prompts, fact-check labels—reduce false belief by 25-28% and misinformation sharing by 2.6-6.3%, but only for specific user segments and at the cost of sustained institutional investment. Enterprise and government buyers often lack the literacy to distinguish precision verification tools from general-purpose chatbots. Resource constraints in developing markets limit access to the specialized tooling that has proven effective elsewhere; funding cuts (e.g., Google's 2025 funding withdrawal from fact-checking projects) can rapidly reverse deployment progress.

TIER HISTORY

ResearchJan-2023 → Jan-2023
Bleeding EdgeJan-2023 → Jan-2025
Leading EdgeJan-2025 → present

EVIDENCE (130)

— Cornell study of 2M+ papers shows LLM-written papers achieve higher writing complexity but lower journal acceptance—demonstrates verification gap: traditional quality signals no longer distinguish AI-assisted content.

— Independent benchmark (6,000 questions, 42 topics): measures factuality and hallucination across frontier models; top performer (Claude Fable 5) achieves 61% accuracy, reveals persistent factuality weaknesses.

— Peer-reviewed Factiverse deployment (114 languages claim detection, 28 languages veracity): fine-tuned compact models outperform LLM baselines on multilingual verification at scale with strong efficiency gains.

— AI.cc study (480M outputs, legal/financial/healthcare): multi-model verification reduces hallucination from 8.3% to 3.2% (61% reduction). Demonstrates quantified ROI of verification architecture in regulated sectors.

— Flagship deployment: Columbia University-led international team deployed AI verification system scanning 2.5M PubMed articles, identified 4,046 fabricated citations across 2,810 articles, demonstrates adoption gap (no publisher action in 98.4% of flagged cases).

— Stanford study (2,100 factual questions, 6 chatbots): retrieval failures drive 70%+ of errors; identifies systematic verification gaps in production AI systems including false-premise detection paradox.

— Amazon Science: audit-then-score protocol for AI-generated research reports lifts verification accuracy from 60.8% to 90.9%, demonstrates dynamic adaptive verification outperforms traditional fact-checking systems.

— Google mainstream GA: SynthID watermarking and C2PA Content Credentials in Search, Lens, Chrome; verification reached consumer scale (billions of Search users); AI Content Detection API launched for partners.

HISTORY

  • 2023-H1: AI fact-checking tools show significant limitations with general-purpose LLMs (ChatGPT 50% accuracy, phantom citations). Dedicated fact-checking organizations adopt AI for workflow acceleration. Specialized tools (Scite, Factiverse) emerge for citation analysis and claim verification, but human review remains essential.

  • 2023-H2: Specialized tools advance toward ecosystem integration: ScienceOpen deploys Scite badges across 75M+ scholarly articles; Factiverse secures R&D funding for explainability. Multimodal fact-checking emerges as research frontier. Practitioner skepticism documented in developing-market newsrooms. Ghost citation problem in scientific publishing becomes mainstream concern.

  • 2024-Q1: Fine-tuned models demonstrate superiority over GPT-4 on claim detection in multilingual contexts; Factiverse production pipeline operational across 90+ languages on Google Cloud Platform. Institutional adoption accelerates (Purdue University reference checking). Production tool failures persist: Google Fact Check shows only 15.8% retrieval on real claims. RAG systems continue generating unfaithful citations. Research benchmarks (FEVER) show steady progress with 63% top accuracy. Domain-specific approaches (Climinator for climate claims) show promise.

  • 2024-Q2: Full Fact and Factchequeado deploy AI-assisted fact-checking at scale in high-stakes elections (UK general election, US political events). Factiverse expands FactiSearch to 52 languages and ecosystem partnerships. Citation accuracy research reveals critical gap: frontier LMs achieve only 4–18% accuracy on scientific citations vs. 70% human. Qualitative study across 29 global fact-checking organizations identifies persistent barriers: tool transparency, resource constraints, platform dependencies. Specialized tools continue maturing (Scite integration in Article Galaxy platform) while production reliability gaps remain.

  • 2024-Q3: Specialized verification tools demonstrate production readiness: Factiverse deployed live fact-checking in US presidential debate (flagging 234 claims in 90 minutes) and expanded beyond newsrooms to analyst and consultant workflows; Scite adoption spreads geographically with institutional trials in Central Europe. Research benchmarks (CiteME) quantify the LM citation accuracy gap at 4–18% versus 70% human baseline. Nature publishes systematic analysis of LLM factuality challenges. Bifurcation between specialized tools and general-purpose models sharpens—production systems show deployable capability in bounded domains, but continued structural failures in citation accuracy, coverage, and knowledge currency persist.

  • 2024-Q4: Crisis in LM-generated citations becomes quantified: GhostCite analysis of 2.2M citations finds hallucination rates of 14–95% across domains, with 80.9% increase in fabricated citations in 2025. CiteME benchmark confirms frontier LMs at 4.2–18.5% accuracy vs. 69.7% human on scientific citations. ChatGPT citation accuracy at 76.5% inaccuracy rate. Specialized tools scale globally—Factiverse expands to major financial institutions (Norwegian bank partnerships) with 80% claim veracity accuracy; CheckMate launches WhatsApp-based public fact-checking in Singapore (2,700 users). MIT CSAIL develops ContextCite for source attribution tracing. Structural bifurcation now complete: specialized fine-tuned tools show production-grade deployment in bounded verification workflows; general-purpose LLMs confirm unreliability for citation and claim verification, forcing practitioners to design tool-augmented human-review systems rather than autonomous AI verification.

  • 2025-Q1: Specialized verification tools continue institutional adoption (Scite at major universities, Factiverse with 5,000+ users) while platforms dismantle fact-checking: Meta replaces fact-checkers with community notes (January 2025), Google removes credibility warnings from search. Tow Center study documents systemic citation failures across eight AI search engines (>60% inaccurate). LLM performance on citation screening shows high variance. Full Fact's production platform operates at scale, explicitly rejecting autonomous "truth-determination" by AI. Platform rollback driven by regulatory fear and business incentives for unmoderated AI—marking sharp divide between committed verification practitioners and platform optimization for engagement over accuracy.

  • 2025-Q2: Specialized verification tools demonstrate sustained deployment: Klaipeda University and Albert Einstein College of Medicine deploy Scite institutional subscriptions; Factiverse enters government contracts for disinformation detection via NATO accelerator. Research advances (FactIR benchmark from production logs, SemanticCite citation verification) provide empirical grounding for next-generation systems. Practitioner reviews identify persistent limitations: missing studies, centrist bias in academic tool results. Newsroom integration across Der Spiegel, Full Fact, Maldita continues at scale with emphasis on evidence retrieval and verdict transparency. Field trajectory confirms: specialized tools operationalize successfully in bounded domains; general-purpose LLM verification remains unreliable; AI operates best as human-review acceleration, not autonomous authority.

  • 2025-Q3: Production deployments expand and citation hallucination crisis intensifies. Full Fact processes 300,000+ sentences daily, supporting 40+ fact-checking organizations across 30 countries and 12 national elections; Factiverse achieves 95% transcription accuracy in live broadcast analysis with NORDIS partnership. Simultaneously, Harvard Kennedy School formalizes AI hallucinations as a distinct misinformation category, documenting 46% of Americans using AI for information seeking. NewsGuard audit finds 11 leading chatbots repeating false claims 20% of the time; NeurIPS 2025 peer-review crisis surfaces 100+ hallucinated citations across 53 papers. Biomedical research quantifies ChatGPT citation accuracy at 7.5% initially (rising to 77.5% post-verification). The window consolidates the bifurcation: specialized tools demonstrate production-scale real-world deployment with measurable impact on electoral integrity and newsroom efficiency; general-purpose LLM citation and claim verification shows persistent and documented systemic failure. Adoption boundary remains clear: tools operationalize within bounded verification workflows supported by human review; autonomous AI verification continues to be unreliable.

  • 2025-Q4: Specialized verification tools announce product expansion and demonstrate comparative accuracy advantages. INRA.AI deployed citation validation achieving <0.1% hallucination through six-layer verification, directly outperforming frontier LMs (18–55% hallucination); Factiverse announced spring 2026 product launches (Web, Live, API) with production deployments across NRK, Viestimedia, and NATO governments. JMIR Mental Health peer-reviewed study (November 2025) quantified GPT-4o citation fabrication at 6–29% with inverse correlation to topic familiarity. The window strengthens the leading-edge boundary: specialized tools show measurable product maturity and competitive metrics; general-purpose LLM verification documented with higher precision, confirming structural unreliability. Platform-level fact-checking remains dismantled; bottleneck to broader adoption is not technical but market education and resource constraints in developing regions.

  • 2026-Jan: Institutional adoption of specialized verification tools accelerates while convergent evidence quantifies citation hallucination crisis. Old Dominion University Libraries launches Scite and Consensus trials through VIVA consortium, expanding library-level adoption. Verifing benchmark (Jan 2026) shows only ~61% ChatGPT citations verifiable vs. 39% unverifiable; Enago meta-analysis documents 19.9% complete fabrication and 45.4% serious errors in AI-generated references across PMC studies. Originality.ai benchmark shows specialized tools achieving ~86.7% accuracy on scientific claims, outperforming GPT-4o. User survey (Eight Oh Two) finds 85% of AI users double-check answers but demand stronger verification and citation integrity. NeurIPS 2025 confirms 100+ hallucinated citations across 53 papers. Evidence convergence across multiple independent sources reinforces leading-edge tier: specialized tools show production deployment and measurable accuracy advantages, while general-purpose LLM verification remains systematically unreliable with high-precision quantification.

  • 2026-Feb: Research advances and institutional adoption reinforce bifurcation. CiteAudit introduces comprehensive benchmark and multi-agent framework for detecting hallucinated citations in scientific writing. PAN analysis of 11,000+ ChatGPT-generated citations quantifies 31% error rate (19% misattribution, 12% hallucination) across business research contexts. CoreProse synthesis documents citation fabrication drivers and mitigation strategies including RAG limitations. Scite adoption expands to University of Pretoria library integration. These developments show specialized verification tools continuing production deployment while general-purpose LLM citation failures remain quantified across independent benchmarks—reinforcing the leading-edge boundary of production deployment in bounded workflows with acknowledged limitations in autonomous verification.

  • 2026-Mar: Legal citation hallucination reached enforcement scale: practitioners documented 1,100+ AI-fabricated citations in court proceedings with courts imposing $15K-$31K fines and case disqualifications; attorney AI adoption tripled (11% to 30% YoY), intensifying the compliance pressure for verification workflows. Stanford EMNLP audit found only 51.5% citation recall across four generative search engines (nearly half of claims unsupported), and a Washington State University peer-reviewed study confirmed ChatGPT at only 60% above-chance accuracy on scientific hypotheses with just 16.4% accuracy identifying false statements. ACL FEVER 2026 advanced multimodal fact-checking via the REVEAL framework; Promptfoo integrated factuality evaluation as standard development infrastructure in healthcare, finance, and legal domains; and Full Fact's multi-year Arab Fact-Checking Network deployment (25+ organizations, 145+ fact checks published) demonstrated scaled adoption in resource-constrained markets while highlighting funding-dependency risk. Specialized tooling continues to mature and widen its lead over general-purpose LLMs for citation verification.

  • 2026-Apr: Citation hallucination crisis now quantified globally and driving institutional policy. Damien Charlotin (HEC Paris) database documents 1,227 hallucination cases across courts in 30+ countries (811 US), with precise taxonomy: 1,022 fabricated citations, 323 false quotes, 492 misrepresented holdings, accumulating ~5-6 new cases daily. Penalties escalated with $109K sanction in March 2026 marking highest AI-related legal penalty in US history. New Orleans City Attorney issued department-wide disclosure and annual certification policy. An independent cross-team exercise (April 2026) found only 20% of AI-generated citations verifiable, with patterns including Chimera, Context Swap, and Complete Fiction hallucinations. Google AI Overviews — serving 2B+ monthly users — showed 91% answer accuracy but over 50% of answers ungrounded, with citations not supporting stated claims. Full Fact continued producing large-scale operational fact-checks of AI-generated disinformation including deepfakes, synthetic videos, and fabricated imagery, demonstrating production-scale verification capability against emerging threats. Verification research advances clarify critical gaps: Clemson audit of 10 commercial LLMs shows 5-fold variation in citation fabrication (11.4%-56.8%); Kamiwaza AI study (172B tokens, 35 models) quantifies context-length effects (1.19% best-case at 32K, >10% at 200K); ACL EACL 2026 research isolates harmful factuality hallucination (HFH) treatable via prompt engineering (50% mitigation). World Bank case study demonstrates modular validation + citation requirements achieve perfect faithfulness in synthesis; Stanford evaluation across 15 LLMs shows 233% improvement with curated evidence RAG. The window consolidates: general-purpose LLM verification remains systematically unreliable with quantified hallucination across independent benchmarks; specialized tools demonstrate measurable detection and mitigation capability; institutional demand for verification infrastructure accelerates.

  • 2026-Jun: Multiple independent large-scale studies converged to sharpen the bifurcation between general-purpose LLMs and specialized verification tooling. An international audit (Columbia University-led, scanning 2.5M PubMed articles) identified 4,046 fabricated citations across 2,810 papers — with publishers taking action on fewer than 2% of flagged cases; a Cornell study of 2M+ papers showed LLM-written papers achieve higher writing complexity but lower journal acceptance, undermining traditional quality signals. Lenz.io's study of 1,000 real fact-checking claims across five frontier LLMs found 67% disagreement (Krippendorff's alpha 0.639, below reliability threshold) with 34% showing polar opposition — confirming single-model verification as structurally unreliable. AA-Omniscience independent benchmark (6,000 questions, 42 topics) found the top performer (Claude Fable 5) at only 61% factual accuracy. Stanford researchers tested six chatbots on 2,100 factual questions and found retrieval failures drive 70%+ of errors; Forum AI's election benchmark (3,136 prompts, expert-judged) found major chatbots fail verification 90% of the time. On the solution side: Google launched SynthID watermarking and C2PA Content Credentials into Search, Lens, and Chrome at billions of users; OpenAI became a C2PA Conforming Generator; Amazon's adaptive audit-then-score protocol lifted fact-checking accuracy from 60.8% to 90.9%; multi-model verification (480M outputs across legal, financial, healthcare) reduced hallucination 61% (8.3%→3.2%); Factiverse deployed fine-tuned compact models across 114 languages, outperforming LLM baselines on multilingual verification. AFP's InVID-WeVerify plugin reached 159,000+ users across 224 countries. The verification pattern that consistently works — specialized tooling, bounded domain, human review in the loop — remains consistent; the adoption gap is organizational and resource-based, not technical.