Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Domain-specific RAG & cross-corpus question answering

LEADING EDGE

TRAJECTORY

Stalled

AI that performs retrieval-augmented generation over proprietary domain-specific corpora and answers questions across multiple knowledge bases. Includes specialised embedding and retrieval for technical domains; distinct from enterprise search which targets general internal documentation.

OVERVIEW

Domain-specific RAG applies retrieval-augmented generation to specialized knowledge corpora—research databases, technical documentation, proprietary knowledge bases, domain-specific research, and cross-corpus question answering. Unlike general-purpose QA, which retrieves from broad web indexes, domain-specific RAG requires precise embedding and ranking tailored to technical terminology, domain conventions, and structured data formats. The core tension is irreducible engineering complexity: domain-specific systems deliver higher accuracy and relevance for expert queries but demand custom embedding models, corpus curation, meticulous deployment discipline, and continuous production monitoring. By mid-2026, the field has achieved operational maturity: major cloud vendors ship production infrastructure with agentic retrieval for cross-corpus reasoning, practitioners have documented critical failure modes with mitigation patterns, and multiple regulated-industry deployments (legal, medical, financial) confirm viability for carefully scoped, continuously curated applications. However, no generic solution exists. Cross-domain generalization remains structurally brittle—vector search dilution in large corpora, domain-specific architectures that fail on adjacent domains, and retriever-generation misalignment persist as unresolved challenges. Production deployments succeed through discipline: domain-specific embeddings, hybrid retrieval (sparse+dense), cross-encoder reranking, rigorous evaluation frameworks, and acceptance of per-domain tuning burden. Broader adoption remains blocked by this irreducible per-domain engineering investment and the architectural brittleness that affects all approaches equally.

CURRENT LANDSCAPE

By mid-June 2026, domain-specific RAG has consolidated into mature production practice with proven deployments in regulated domains but no universal solution. Vendor infrastructure matured significantly: Azure Build 2026 released Foundry IQ Serverless with agentic retrieval achieving 46–54% evidence recall improvement and 34% token cost reduction; Google Gemini Enterprise Platform shipped agentic RAG in public preview with 34% factuality accuracy improvement over standard RAG via iterative multi-agent workflows and "sufficient context" classification layers (93% accuracy in determining whether retrieved evidence suffices); Amazon Bedrock and Microsoft continue expanding knowledge sources (Work IQ, Fabric Ontology, OneLake integration) enabling cross-source orchestration. Architectural consensus shifted from single-pass RAG toward domain-aware adaptive strategies: research demonstrates that query decomposition improves structured domains (DevOps +0.04 overall, +0.17 MRR) but degrades multi-hop reasoning precision, validating necessity of domain-specific architecture selection rather than universal approaches. Vector search at scale emerged as critical failure mode: June 2026 research identified vector search dilution—degradation from 75% accuracy (54 documents) to <40% (1,128 documents, 88,907 chunks)—solved via domain-scoped retrieval using organizational metadata. Cross-corpus robustness research introduced cross-query consistency (CQC-RAG) framework showing semantically equivalent queries retrieve different results in multi-document corpora, causing 15%–20% retrieval performance degradation (extreme >40%), addressable via query rewriting and confidence stability filtering. Production deployments confirm domain-specific viability with rigorous implementation: LinkedIn Hiring Assistant (1.3B+ profiles) achieved +24% Kappa quality improvement via custom MUSE embeddings; health insurance RAG showed 91% attribution accuracy and 6% claims accuracy improvement; Russian corporate document system improved Top-1 hit rate from 62% to 88% via hybrid search (BM25+vector RRF) and cross-encoder reranking. However, architectural pathologies block broader adoption: legal domain analysis identified three structural mismatches (mereological blindness treating flat chunks instead of legal hierarchy, diachronic blindness ignoring temporal ordering and precedent supersession, causal opacity losing institutional provenance), applicable to any regulated domain (finance, medicine) requiring hierarchical and temporal reasoning. Enterprise RAG failures continue despite infrastructure maturity: Gartner tracks 50% abandonment rate for GenAI projects, with root cause traced to retrieval layer collapse (not model), driven by multi-hop dependencies, conflicting regional policies, and structured data trapped in PDFs. Field consensus by June 2026 remains unchanged: domain-specific RAG delivers measurable results in narrowly-scoped, continuously curated applications with meticulous implementation (domain-adapted embeddings, hybrid retrieval, reranking, evaluation discipline), but broader adoption remains blocked by per-domain engineering investment, architectural brittleness, and unresolved retriever-generation misalignment in multi-step reasoning.

TIER HISTORY

ResearchJun-2023 → Apr-2024
Bleeding EdgeApr-2024 → Oct-2025
Leading EdgeOct-2025 → present

EVIDENCE (137)

— Production domain-specific RAG at billion-scale (1.3B+ profiles) using MUSE custom domain embeddings; achieves +4% HRR, −5% false positive rate, +24% Kappa quality improvement—evidence of leading-edge deployment maturity.

— Cross-query consistency framework addressing cross-corpus robustness: demonstrates semantically equivalent queries retrieve different results in multi-document corpora; solution achieves +4.76 pp EM on TriviaQA, +9.12 pp on MuSiQue.

— Multi-turn RAG evaluated across four distinct domains (finance, cloud docs, government, Wikipedia) with significant cross-domain performance variation; demonstrates domain-specific retrieval strategy selection is essential rather than universal.

— Peer-reviewed research identifying vector search dilution as critical failure mode in large cross-corpus RAG systems; proposes domain-scoped retrieval solution validated across 5 LLM backbones, 6 corpora, and named Wyoming DOTD deployment.

— Landmark analysis identifying three architectural pathologies (mereological, diachronic, causal blindness) showing RAG failures in legal domain are structural mismatches, not confabulation—directly applicable to regulated domains requiring hierarchy and temporal dynamics.

— Google Agentic RAG now in public preview on Gemini Enterprise Agent Platform with iterative multi-agent workflow; 34% factuality accuracy improvement vs standard RAG, FramesQA benchmark 90.1% on multi-corpus scenarios.

— Critical assessment of enterprise RAG failures: identifies failure occurs in retrieval layer (not generation); documents Enterprise RAG Gold Standard benchmark for realistic failure modes (multi-hop dependencies, conflicting regional policies, structured data in PDFs).

— Empirical comparison across domain-specific RAG (DevOps KB) and multi-hop reasoning benchmarks showing architecture effectiveness depends on domain characteristics; validates necessity of adaptive retrieval strategies rather than universal approaches.

HISTORY

  • 2023-H1: Foundational RAG research (parametric + non-parametric memory hybrids) gains traction. DeepPavlov releases production-grade KBQA system for structured knowledge bases. ACL 2023 publications showcase dense retrieval specialization and cross-encoder knowledge distillation. Early empirical studies on knowledge-graph RAG effectiveness across domains. No major enterprise deployments yet; practice remains in research and proof-of-concept phase.

  • 2023-H2: RAG field consolidates with comprehensive surveys and benchmarks. PrimeQA (IBM), RobustQA (multi-domain benchmark), and MiRAGE (evaluation framework for specialized corpora) advance tooling ecosystem. Research reveals persistent gaps: domain adaptation remains challenging across finance, medicine, law; corpus incompleteness limits retrieval-only approaches; embedding models struggle with domain terminology. Practitioner adoption hindered by testing complexity, immature tooling, and high customization costs. Cloud vendors provide RAG templates but target general documentation; specialized domain applications remain high-engineering-overhead endeavours.

  • 2024-Q1: Cloud vendors (Microsoft Azure) release production RAG tooling and agentic retrieval features, signaling ecosystem maturity. Research community consolidates foundations with comprehensive 353-paper surveys mapping RAG taxonomy. Practitioners immediately encounter real-world deployment barriers: vector search returns inconsistent results; indexing large documents fails at token limits; enterprise systems at scale show degraded accuracy. Domain-specific financial RAG research demonstrates techniques (chunking, query expansion, embedding fine-tuning) but requires significant customization. Signal balance: positive news on tooling and research consolidation offset by concrete evidence of deployment blockers—parameter tuning cannot resolve fundamental retrieval brittleness at enterprise scale.

  • 2024-Q2: Cloud vendors scale infrastructure aggressively: Azure announces 11x index capacity, 6x storage, 2x throughput improvements with Fortune 500 production adopters (OpenAI, KPMG, PETRONAS). Fine-tuning approaches gain traction—financial RAG and Adobe products demonstrate improved accuracy via embedding fine-tuning, iterative reasoning, and query expansion. However, cross-domain generalization remains poor (41.3% RAG answers beat human refs on multi-domain benchmark). Scalability barriers emerge: Azure vector search fails beyond 1000 items; domain-specific fine-tuning doesn't transfer; retrieval degrades with corpus size. Signal balance: vendor investment and targeted successes offset by evidence of persistent cross-domain brittleness and hidden scalability cliffs.

  • 2024-Q3: Practitioner deployments mature in niche, well-scoped domains. Ophthalmology case study demonstrates 70K clinical documents with 54.5% accuracy and hallucination reduction via domain-specific RAG; vocal training system shows successful application to ultra-specialized corpora. SMART-SLIC framework advances KG+vector-store approaches to reduce LLM hallucinations without fine-tuning. However, fundamental failure modes persist: eight critical KG-RAG failure points identified (intent understanding, context alignment); EuroPython talk argues RAG success is highly domain-dependent with no universal solution; production deployments reveal seven recurring failure points (missing content, ranking, format, extraction) requiring extensive testing and optimization. By Q3 2024, domain-specific RAG is operationally viable for narrowly-scoped, curated corpora but remains fragile, parameter-sensitive, and failure-prone for broader applications or cross-domain generalization.

  • 2024-Q4: Cloud vendors release advanced RAG infrastructure—Microsoft announces multimodal RAG, agentic retrieval features, and expanded knowledge base sources (November 2024). Practitioner case studies show strong results in bounded domains: CMU/Pittsburgh domain case study achieves 42.21% F1 (vs 5.45% baseline) with 1,800+ documents; domain-adapted hotel customer service system reduces hallucinations significantly through fine-tuning. Research confirms persistent cross-domain brittleness: EMNLP 2024 benchmark (LFRQA, 26K queries, 7 domains) again finds only 41.3% of RAG answers preferred over human baselines. Critical analysis documents nine RAG limitations (retrieval quality, latency, scalability, transparency, bias, domain brittleness). By year-end, domain-specific RAG achieves reliable results only in narrowly-scoped, carefully curated use cases; broader enterprise deployment remains blocked by parameter sensitivity, scalability cliffs, and fundamental retrieval brittleness. No novel solutions emerged in Q4; vendors invested in infrastructure while practitioners continued workarounds.

  • 2025-Q1: Ecosystem maturity consolidates with AWS RAG Evaluation GA (March 2025) enabling systematic assessment of domain-specific applications. Real-world deployments expand across domains: automotive industry (CIKM 2025) achieves +1.79 factual correctness improvement from pilot RAG system; financial, biomedical, and cybersecurity corpora demonstrate 31-42% precision gains via token-aware chunking and domain-specific embeddings. Field-service management and business intelligence production deployments confirm practical adoption. However, comparative evaluations reveal limitations: large-scale study (20K queries, 400K KB) shows RAG underperforms DoRA on accuracy and latency in accuracy-critical domains. Practitioner reports document persistent barriers: semantic mismatches, outdated knowledge bases, domain-specific fine-tuning complexity. Optimal chunk sizing varies dramatically by domain (5 tokens for cybersecurity vs. 20+ for finance), requiring extensive per-domain experimentation. Signal balance: platform maturity and demonstrated production deployments offset by evidence that broader enterprise adoption remains constrained by customization intensity and parameter sensitivity. Domain-specific RAG capability proven for carefully scoped, expert-managed domains; generalization barriers unresolved.

  • 2025-Q2: Enterprise benchmarking validates domain-specific RAG applicability: EKRAG benchmark (May 2025) spans diverse corporate documents; SimRAG (April 2025) demonstrates self-training domain adaptation for science and medicine without labeled data. Competition-validated SIGIR 2025 LiveRAG winner achieves first place on 15M-document cross-corpus evaluation. However, Q2 2025 again surfaces persistent limitations: multi-document evaluation analysis documents hallucinations at scale and faithfulness-helpfulness tensions; KG-RAG failure modes emerge under knowledge incompleteness; production knowledge-base systems show semantic mismatch gaps (knowledge base searches returning 15 irrelevant results for specific queries). Signal balance: enterprise applicability for well-scoped domains confirmed through research benchmarks and vendor feature expansion, but production implementation friction and multi-domain generalization barriers persist. Broader enterprise adoption remains blocked by parameter sensitivity, evaluation complexity, and domain-specific tuning requirements.

  • 2025-Q3: Domain-specific RAG achieves production maturity with measurable deployment gains: embedding fine-tuning (Coxwave Align, July 2025) delivers 12% accuracy improvement and 6x training speedup; research frameworks (RAGen, QuARK, healthcare KG-RAG) advance automated domain adaptation via semantic chunking and knowledge graph integration (+13% on financial domain). However, critical accuracy plateau emerges: production systems stall at 75% accuracy (25% error unacceptable in finance/telecom/healthcare); vector-only RAG insufficient, graph-based semantic layers required (>95% vs. vector-only on policy domain). Cross-corpus studies reveal model-dependent effectiveness: small LLMs benefit (+22.87%) while large models underperform; no universal corpus selection; intelligent routing remains unsolved. Signal balance: embedding fine-tuning and KG integration enable gains in narrowly-scoped domains, but production accuracy ceiling and corpus selection brittleness block broader adoption in accuracy-critical applications.

  • 2025-Q4: Platform vendors scale infrastructure investment: Azure AI Search releases agentic retrieval (public preview), AWS GA's RAG Evaluation. Adoption reaches 60% of production AI applications. However, real deployments expose persistent friction: Azure AI Search text splitting silent failures, GPT-4.1 agents inconsistently invoke retrieval tools, configuration precision remains critical. Evaluation becomes tier-1 concern: RAGalyst framework demonstrates context-dependent performance across domains with no universal configuration. Production deployments show architectural variation: local-first RAG (SalesWorx, October 2025) using LlamaIndex and Qdrant achieves stability without vendor lock-in. Hallucination reduction claims (70-90%) coexist with evidence of narrower practical applicability; cross-domain generalization poor; accuracy plateau at 75% persists. Domain-specific RAG achieves mainstream tooling and enterprise scale, but production readiness demands meticulous tuning, careful corpus curation, and acceptance of deployment complexity.

  • 2026-Jan: Critical architectural limitations emerge at scale: CorpusQA benchmark reveals standard RAG collapse on 10M token corpora (1.20% vs. 55.85% at 128K), driving bifurcation into caching for small corpora and agentic hypergraphs for complex reasoning. Domain-specific deployments (immunogenicity, policy QA) expose persistent retriever-generation misalignment despite coherent LLM outputs. Transferable RAG frameworks demonstrate cross-domain gains via domain routing but confirm domain-specific fine-tuning remains essential. Field consensus solidifies: monolithic RAG is obsolete; architecture must match corpus scale and reasoning complexity. Evaluation and risk assessment become mission-critical in high-stakes domains (healthcare, finance) where "lost in the middle" and outdated knowledge bases create tangible deployment risks.

  • 2026-Feb: Platform vendors accelerate infrastructure: Azure adds agentic retrieval portal support with expanded knowledge sources (OneLake, SharePoint, Web). Practitioner deployments succeed in narrowly-scoped domains with specialized architectures: product catalogs, resume search, field service management via caching/vector search; security maturity advances with sparse attention defenses against knowledge poisoning attacks. However, critical limitations surface: inferential QA research reveals current pipelines fail at indirect evidence reasoning; documented real-world failures expose tangible risks—healthcare harmful guidance from outdated KBs, finance compliance violations, telecom $2.3M in service credits for incorrect RAG outputs. Retriever-generation misalignment persists as unresolved deployment challenge. Field consensus crystallizes around realistic assessment: domain-specific RAG delivers value only in carefully curated, continuously managed applications; broader adoption remains blocked by architectural brittleness, retriever reliability gaps in specialized domains, and irreducible per-domain tuning burden.

  • 2026-Mar: EACL 2026 introduced the UDCG evaluation metric achieving 36% improvement in cross-domain QA correlation, while CrossRAG advanced multilingual retrieval across fragmented corpora. Amazon Bedrock Knowledge Adaptive QA and Meta AI/Google DeepMind's ART unsupervised dense retriever training pushed deployment infrastructure forward in regulated domains (medical, legal, financial). A production GraphRAG deployment at scale reduced hallucination by 72% serving millions of interactions; a multi-stage memory retrieval system achieved 96-100% single-session accuracy on cross-corpus QA at 115K+ token scale. These advances address architectural brittleness incrementally but do not resolve the core production accuracy ceiling; per-domain tuning burden and retriever-generation misalignment remain the primary adoption constraints.

  • 2026-Apr: Microsoft released Octen-Embedding-0.6B, a domain-adapted model for legal, finance, healthcare, and code that outperforms larger generic embeddings (0.7241 vs. 0.7139 on domain benchmarks), signalling an ecosystem shift toward specialized embeddings as standard practice. BEIR benchmark data quantified the domain-specialization advantage: fine-tuned embeddings improve retrieval by medical +29% (48%→62%), code +34% (44%→59%), legal +24% (46%→57%)—confirming general models systematically fail on specialized domains. A practitioner postmortem found that document quality scoring alone improved search accuracy from 62% to 89% with no embedding or retrieval changes, establishing corpus curation as a first-order variable. Production failure modes gained sharper documentation: Azure AI Search analysis named three causes of silent vector drift (embedding model version mismatch, incremental corpus updates without re-embedding, inconsistent chunking), and a cross-industry postmortem of 12+ RAG deployments showed naive chunking and wrong embeddings as primary precision failures—fixed from 54% to 81% precision with domain-adapted approaches. Practitioner evidence confirmed the viability threshold: RAG is justified for corpora above 500-1000 items with domain-specific entities but overengineered for smaller homogeneous datasets, reinforcing that domain-specific RAG succeeds only with meticulous specialization and continuous index governance. Peer-reviewed production deployments demonstrate viability in high-stakes domains: Uber's Genie on-call copilot for security/privacy policies achieved 27% relative improvement in acceptable answers and 60% reduction in incorrect advice via agentic RAG; legal RAG system in Brazilian Portuguese (PROPOR 2026) deployed with metrics across 184,895 audited answers (81.7% legislation resolution, 47.1% jurisprudence resolution, 6.5% hallucination correction rate); systematic evaluation of medical RAG pipeline components (retrieval strategies, domain-specific embeddings) published peer-reviewed findings. SIGIR 2026 benchmark (ConflictQA) exposed cross-corpus QA brittleness: knowledge conflicts between textual documents and knowledge graphs reduce accuracy without explicit handling. April 2026 evidence synthesis: domain-specific RAG deployment succeeds across legal, medical, and security domains when architecture, embeddings, and corpus curation are meticulous; production readiness demands continuous monitoring (index drift), systematic evaluation (domain-specific benchmarks), and acceptance that generalization failures are structural, not fixable via retrieval optimization alone. Field consensus: domain-specific RAG is operationally mature for narrowly-scoped, high-stakes domains but remains a disciplined engineering practice, not a generic solution.

  • 2026-May: Retrieval infrastructure research and documented failure modes reinforced the discipline-over-defaults thesis. Biomedical RAG benchmarking (BioASQ, 5 retrieval strategies) confirmed cross-encoder reranking as the empirically superior retrieval approach (0.827 composite score, 0.852 contextual precision), establishing domain-specific strategy selection as mandatory rather than optional. Health-system-scale semantic search deployment—1.68M patients, 166M clinical notes, Qwen3 embeddings, 94.6% clinical QA accuracy at 237ms latency and $4K/month—demonstrated production viability at genuine enterprise scale. May 2026 research synthesis reinforced critical findings: (1) peer-reviewed clinical embedding benchmark showed domain context variables (corpus type, query format) explain 49% variance in retrieval performance, rivaling model choice (47.6%) and revealing that MTEB leaderboard rankings do not transfer to specialized clinical domains; (2) industrial automotive QA study (AAAI 2026) confirmed RAG as the most cost-efficient and effective adaptation method for domain-specific closed-domain QA across both premium and open-source models; (3) medical domain case study across 6 specialties demonstrated hybrid BM25+dense retrieval achieving 86% recall@5 vs 71% for BM25 alone (p<0.001 statistical significance), confirming complementary strengths; (4) leading-edge agentic retrieval research (BRIGHT-Pro benchmark) introduced aspect-aware evaluation and multi-step evidence portfolio construction for cross-corpus reasoning, shifting paradigm from single-shot relevance matching to iterative evidence gathering. Simultaneously, OCR robustness benchmarking revealed that high-accuracy OCR still causes downstream RAG failures due to structural and semantic errors across 11 document types, while a practitioner case study documented silent embedding model mismatch (ONNX-quantized indexing vs. API queries) degrading retrieval below similarity thresholds for months without detection. Critical production audit of 50 live deployments exposed seven failure modes with high prevalence (context gaslighting 76% finance, citation fabrication 81% legal, low-confidence drift 62% medical) under adversarial testing, reinforcing that sophisticated architectures and evaluation frameworks remain insufficient without rigorous deployment discipline. The consistent signal: domain-specific RAG delivers measurable results in industrial, medical, and financial domains when retrieval strategy, embedding consistency, corpus structure, and agentic reasoning patterns are treated as co-equal engineering concerns rather than defaults; production readiness demands meticulous tuning, continuous evaluation, and acceptance of fundamental cross-corpus brittleness.

  • 2026-Jun: Domain-specific RAG maturation reaches full operational consensus. Multiple production deployments confirm specialized domain viability: materials science RAG achieved 85.6% accuracy (4x baseline improvement); clinical off-guideline QA showed 56%→82% accuracy jump; Goodyear's agentic TechGraphRAG integrated 2,100+ academic papers with internal knowledge graphs; LinkedIn Hiring Assistant (1.3B+ profiles) achieved +24% Kappa quality improvement and −5% false positive rate via custom MUSE domain embeddings at billion scale; a Russian corporate document system improved Top-1 from 62% to 88% via hybrid BM25+vector RRF search and cross-encoder reranking. Vendor infrastructure advanced materially: Azure Build 2026 Foundry IQ Serverless agentic retrieval delivered 46–54% evidence recall improvement and 34% token cost reduction; Google Gemini Enterprise shipped agentic RAG in public preview achieving 34% factuality improvement over standard RAG with FramesQA benchmark at 90.1% on multi-corpus scenarios. Critical benchmarking and failure-mode research sharpened the discipline-over-defaults thesis: CQC-RAG demonstrated semantically equivalent queries retrieve different results in multi-document corpora (15–40% degradation), addressable via cross-query consistency filtering; vector search dilution at corpus scale (75% accuracy at 54 documents degrading to <40% at 1,128 documents) validated domain-scoped retrieval as solution; legal domain analysis identified three structural architectural pathologies (mereological, diachronic, causal blindness) that are not fixable via retrieval tuning alone—applicable to any regulated domain requiring hierarchy and temporal reasoning. Negative signals matched positive: Stanford documented legal hallucinations persisting at 17-34% despite RAG (58-82% without); Fortune 500 post-mortems revealed catastrophic production failures ($12.2M trading loss, $4.1M legal settlement) when RAG removed; enterprise root-cause analysis traced failures to retrieval layer collapse under multi-hop dependencies and conflicting regional policies, not generation model quality. Consensus unchanged: no generic solution exists; success requires domain-adapted embeddings, hybrid retrieval, cross-encoder reranking, and continuous corpus governance as co-equal engineering concerns.

TOOLS