Research & Knowledge — AI Maturity

Pick a role above to explore practices

BLEEDING EDGE

⌨️ SOFTWARE ENGINEERING

✍️ CONTENT & MARKETING

🔬 RESEARCH & KNOWLEDGE

⚖️ LEGAL, COMPLIANCE & RISK

🎧 CUSTOMER OPERATIONS

🏛️ AI GOVERNANCE & SAFETY

📊 DATA & ANALYTICS

🛡️ IT OPERATIONS & SECURITY

🎯 PRODUCT & DESIGN

💼 SALES & REVENUE

🎬 CREATIVE & GENERATIVE MEDIA

👁️ COMPUTER VISION & SENSING

💹 FINANCE & ACCOUNTING

🔄 OPERATIONS & PROCESS AUTOMATION

🚗 AUTONOMOUS SYSTEMS & VEHICLES

🦾 PHYSICAL AI & ROBOTICS

🎓 EDUCATION & LEARNING

✨ PERSONAL EFFECTIVENESS

LEADING EDGE

⌨️ SOFTWARE ENGINEERING

✍️ CONTENT & MARKETING

🔬 RESEARCH & KNOWLEDGE

⚖️ LEGAL, COMPLIANCE & RISK

🎧 CUSTOMER OPERATIONS

🏛️ AI GOVERNANCE & SAFETY

📊 DATA & ANALYTICS

🛡️ IT OPERATIONS & SECURITY

🎯 PRODUCT & DESIGN

💼 SALES & REVENUE

🎬 CREATIVE & GENERATIVE MEDIA

👁️ COMPUTER VISION & SENSING

💹 FINANCE & ACCOUNTING

🔄 OPERATIONS & PROCESS AUTOMATION

👥 PEOPLE & TALENT

🚗 AUTONOMOUS SYSTEMS & VEHICLES

🦾 PHYSICAL AI & ROBOTICS

🎓 EDUCATION & LEARNING

✨ PERSONAL EFFECTIVENESS

GOOD PRACTICE

⌨️ SOFTWARE ENGINEERING

✍️ CONTENT & MARKETING

🔬 RESEARCH & KNOWLEDGE

⚖️ LEGAL, COMPLIANCE & RISK

🎧 CUSTOMER OPERATIONS

🏛️ AI GOVERNANCE & SAFETY

📊 DATA & ANALYTICS

🛡️ IT OPERATIONS & SECURITY

🎯 PRODUCT & DESIGN

💼 SALES & REVENUE

🎬 CREATIVE & GENERATIVE MEDIA

👁️ COMPUTER VISION & SENSING

💹 FINANCE & ACCOUNTING

🔄 OPERATIONS & PROCESS AUTOMATION

👥 PEOPLE & TALENT

🚗 AUTONOMOUS SYSTEMS & VEHICLES

🦾 PHYSICAL AI & ROBOTICS

🎓 EDUCATION & LEARNING

✨ PERSONAL EFFECTIVENESS

ESTABLISHED

⌨️ SOFTWARE ENGINEERING

✍️ CONTENT & MARKETING

🛡️ IT OPERATIONS & SECURITY

🎯 PRODUCT & DESIGN

💹 FINANCE & ACCOUNTING

👥 PEOPLE & TALENT

🔬 Research & Knowledge

AI for finding, synthesising, verifying, and preserving organisational knowledge. Mostly leading-edge: literature review, competitive intelligence, and knowledge management tools are maturing quickly with five practices actively advancing. The main constraint is hallucination risk — fact-checking and source verification still require human oversight in high-stakes contexts.

14 practices: 2 good practice, 11 leading edge, 1 bleeding edge

Where AI Stands in Research & Knowledge

Research and knowledge management is the domain where AI's promise and its limits are most nakedly visible. The fourteen practices we track span the full maturity spectrum -- from enterprise search and meeting intelligence, which have crossed into proven organizational infrastructure, to multi-step autonomous deep research, which remains stubbornly experimental despite hundreds of millions in venture funding. The arc across all of them bends the same way: retrieval and triage work; synthesis and verification do not yet clear the bar for unsupervised use.

The numbers tell a story of adoption without resolution. Enterprise search and RAG now underpins 70-80% of large organizations' AI deployments, with 92% of adopters reporting ROI within twelve months. Meeting intelligence has reached $300M+ ARR at Gong and 25 million users at Otter.ai. Email summarization ships to three billion Gmail users by default. Perplexity alone has 50 million monthly active users and $450M ARR. These are not pilots. They are infrastructure.

Yet the reliability ceiling has not moved. Microsoft Research's DELEGATE-52 benchmark shows every frontier model corrupts roughly 25% of content in extended document workflows. A new AutoResearchBench evaluation finds Claude Opus 4.6 at 9.39% accuracy on multi-step research tasks, GPT-5.4 at 7.44%, and Gemini 3.1 Pro at 7.93% -- not on obscure challenges, but on basic closure and state-tracking operations that research workflows demand. Citation accuracy remains the worst-performing capability across the board: a 5,000-prompt benchmark across five frontier models measured 12.4% average hallucination on citation tasks, and retrieval grounding is the only technique that materially reduces it (75-90% reduction versus 5-15% for prompt engineering alone). The domain's central paradox is that organizations are deploying these tools at enormous scale while knowing they fail on the tasks that matter most.

What's New, 2026-04-22 to 2026-05-06

No practices changed tier or trend in this cycle. The domain's structure held: two practices at good-practice (enterprise search and meeting intelligence), eight at leading-edge, one at bleeding-edge (deep research), and the rest distributed across the leading-edge band. This stability is itself informative. Despite heavy vendor activity and substantial new evidence across all fourteen practices, nothing moved the needle on the fundamental constraints that keep most of this domain at leading-edge rather than mainstream.

The most consequential new evidence concerns the reliability floor. AutoResearchBench -- the first large-scale benchmark designed to test multi-step research end-to-end -- found all frontier models scoring below 10% on tasks requiring state tracking and constraint verification, revealing that the bottleneck in autonomous research is not retrieval access but architectural closure. Separately, a Stanford HAI analysis documented that models fail at 86% when users assert false premises, a finding with direct implications for any research tool where queries contain embedded assumptions. In domain-specific RAG, new biomedical benchmarking confirmed cross-encoder reranking as empirically superior (0.827 composite score), while a health-system deployment at 1.68 million patients demonstrated 94.6% clinical QA accuracy at 237ms latency -- proving that domain-specific RAG works at genuine enterprise scale when engineering discipline is applied.

In verification, the legal enforcement arc steepened. The Federal Court of Australia mandated AI disclosure and citation verification, joining a growing list of jurisdictions where verification is no longer optional. A global database now tracks 1,348 hallucination cases across 30+ countries, with sanctions reaching $109,700 in a single Oregon case. Sullivan & Cromwell -- one of the world's most elite law firms -- publicly acknowledged failing to catch AI hallucinations in a bankruptcy filing, demonstrating that governance maturity gaps persist even in well-resourced environments. On the tools side, Gartner published its inaugural Magic Quadrant for Competitive & Market Intelligence Platforms, positioning Crayon, Klue, and AlphaSense as Leaders -- a validation milestone for competitive intelligence as a strategic category.

Key Tensions

The reliability floor is architectural, not incremental. AutoResearchBench's finding that frontier models score 7-9% on multi-step research tasks is not a benchmarking curiosity -- it identifies fundamental defects in closure (state tracking, constraint verification) and evidence aggregation. These failures compound across steps, as the HORIZON benchmark separately confirmed through cumulative error degradation after 20+ actions. Incremental model improvements have not moved this floor. Princeton-backed analysis documented that eighteen months of capability gains yielded zero reliability improvement for production agents. Organizations planning autonomous research workflows should budget for human verification at every synthesis step, not as a temporary measure but as a structural requirement.
Citation accuracy is the domain's weakest link, and enforcement is catching up. Across five frontier models tested on 5,000 prompts, citation accuracy measured at 12.4% hallucination -- worse than factual recall or code reference tasks. In legal filings alone, 1,348 documented hallucination cases have generated sanctions up to $109,700. The Federal Court of Australia now mandates disclosure and manual verification. NIST AI 600-1 designates confabulation as a Tier 1 risk requiring pre-deployment testing. The gap between citation generation capability and citation verification infrastructure is the single largest liability exposure in the domain. Organizations deploying any research AI in regulated contexts face a rapidly narrowing window to implement verification workflows before enforcement overtakes them.
Enterprise RAG works -- but only as a discipline, not a product. The evidence is increasingly precise about what separates successful RAG deployments from failed ones. A mid-market case study showed accuracy jumping from 62% to 94% through chunking, hybrid search, and re-ranking changes -- with zero model changes. Document quality scoring alone improved search accuracy from 62% to 89% in a separate postmortem. Biomedical benchmarking confirmed cross-encoder reranking as the empirically superior retrieval strategy. A health system processed 166 million clinical notes at 94.6% accuracy. But dense retrievers drop from 66.4% to 5-27.9% recall on high-similarity enterprise corpora, and silent embedding model mismatches can degrade retrieval for months without detection. The practice rewards meticulous engineering and punishes defaults.
Meeting intelligence is legally fragile despite commercial success. Gong, Otter.ai, and Fireflies.ai collectively serve over 50 million users with documented productivity gains (77% higher revenue per rep at Gong, 18% email reading time reduction in Microsoft's trial). Yet active class-action litigation threatens the category's consent frameworks. Brewer v. Otter.ai alleges wiretapping and biometric privacy violations; Cruz v. Fireflies.ai (May 2026) targets voiceprint collection under BIPA. Stanford, Oxford, and Tufts have blocked AI bots from meetings entirely. Real-meeting transcription accuracy runs at 8-12% word error rate versus 2.7% on clean audio. The category's expansion beyond revenue operations -- into board rooms, regulated industries, multilingual teams -- runs directly into litigation risk, accuracy degradation, and institutional rejection that no vendor has structurally resolved.
Organizational readiness, not model capability, is the binding constraint. Deloitte found 60% AI adoption but only 40% data management maturity. Stanford's AI Index confirmed 88% organizational AI usage alongside an emerging "presence versus execution gap." Cisco's survey shows 85% of enterprises piloting agents but only 5% trusting them for production. Only 23% of organizations scale AI past pilot stage, and just 6% report measurable EBIT impact. Knowledge management infrastructure -- taxonomies, ontologies, governed data -- is now recognized as foundational to enterprise AI, yet 80% of enterprises planning knowledge graph adoption stall before production due to ontology design complexity and entity resolution challenges. The domain's maturity is bottlenecked not by what AI can do but by whether organizations can build the governance, data quality, and process redesign required to use it safely.

Top 10 Evidence Items

AI Hallucination Rate Benchmarks 2026: 5-Model Study (adoption-metric) — The 12.4% citation hallucination rate across five frontier models on 5,000 prompts is the single statistic that most precisely defines the domain's central liability: organizations are deploying at scale while the worst-performing capability is the one their users trust most. https://www.digitalapplied.com/blog/ai-model-hallucination-rate-benchmarks-2026-study
The Most Expensive Hallucination of 2026: A Court Filing Goes Sideways (case-study) — A $109,700 sanction for 15 fabricated case citations is not an outlier but a data point in an accelerating enforcement curve; the article's 12-line Python verifier using CourtListener also demonstrates that mechanized citation checking is already feasible, making the continued failure to deploy it an organizational choice rather than a technical constraint. https://dev.to/gabrielanhaia/the-most-expensive-hallucination-of-2026-a-court-filing-goes-sideways-1d3b
AI Hallucinations in Law Firms: What Lawyers Must Know (2026) (case-study) — The escalation from ~2 incidents/week in early 2025 to 2-3/day by late 2025 -- including Sullivan & Cromwell's acknowledged failure -- demonstrates that governance maturity gaps persist even in well-resourced environments with established verification procedures, directly supporting the summary's claim that the enforcement window is narrowing. https://www.getvoibe.com/resources/ai-hallucinations-law-firms/
Federal Court of Australia issues practice note on AI-generated submissions (case-study) — Mandatory disclosure and manual citation verification from a federal judiciary, not a regulator, signals that verification requirements are entering jurisdictions that cannot be opted out of, making this a harder constraint than the voluntary frameworks most organizations have been managing against. https://ienvi.com.au/federal-court-of-australia-issues-practice-note-on-ai-generated-submis-0d99054a/
Health System Scale Semantic Search Across Unstructured Clinical Notes (research-paper) — A children's hospital processing 166 million clinical notes at 94.6% accuracy and 237ms latency for $4K/month is the strongest available counter-evidence to the reliability narrative: domain-specific RAG does work at genuine enterprise scale when engineering discipline -- not default settings -- is applied. https://arxiv.org/abs/2604.25605
Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation (research-paper) — The empirical confirmation that cross-encoder reranking achieves 0.827 composite score versus inferior alternatives on BioASQ provides the domain-specific evidence underpinning the summary's claim that RAG rewards meticulous engineering choices; organizations choosing retrieval strategies by default rather than benchmark are accepting preventable accuracy degradation. https://arxiv.org/abs/2605.02520v1
Seventh Circuit Class Action on AI Meeting Assistants and Biometric Privacy: Governance Lessons from the Fireflies.AI Lawsuit (news-coverage) — Cruz v. Fireflies.ai (May 2026) targeting voiceprint collection under BIPA is the sharpest illustration that meeting intelligence's commercial success has outpaced its consent infrastructure, and statutory damages exposure -- not just reputational risk -- is now the category's binding legal constraint. https://natlawreview.com/article/ai-meeting-assistants-and-biometric-privacy-governance-lessons-firefliesai-lawsuit
Your AI Notetaker May Already Be Breaking the Law (news-coverage) — The Brewer v. Otter.ai consolidated class action with documented design-level consent failures in a 25-million-user product is the domain's clearest example of scale enabling rather than insulating against legal exposure; it reframes the meeting intelligence litigation arc from isolated suits to a category-level governance failure. https://www.reworked.co/digital-workplace/lawsuit-ai-notetaker-liability-risk-management/
AlphaSense Named a Leader in Inaugural Gartner Magic Quadrant for Competitive and Market Intelligence (industry-report) — Gartner's first MQ for this category is a maturity milestone that institutionalizes competitive intelligence as a buying category rather than a discretionary tool, signaling that the window for differentiation through vendor selection is closing as enterprise procurement standardizes around the Leaders quadrant. https://www.globenewswire.com/de/news-release/2026/04/24/3280934/0/en/alphasense-named-a-leader-in-inaugural-gartner-magic-quadrant-for-competitive-and-market-intelligence.html
Deloitte State of AI in the Enterprise 2026: Mid-Market Execution Gap (industry-report) — The 20-point gap between AI adoption (60%) and data management maturity (40%) across 3,235 surveyed leaders is the structural proof that the domain's constraint is organizational readiness rather than model capability -- the bottleneck the summary identifies as the binding limit on whether any of the fourteen practices can move from pilot to production. https://mybusinessfuture.com/en/deloitte-ai-enterprise-report-execution-gap/