The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.
A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.
Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail
AI that generates assessment questions, quizzes, and examinations at specified difficulty levels and covering defined topics. Includes distractor generation and difficulty calibration; distinct from interview question generation in HR which targets hiring rather than education.
AI-generated assessment questions have reached leading-edge maturity: the technical bar is cleared (peer-reviewed parity with expert items in specialized domains), vendor tooling is normalized into major platforms (78% of Canvas institutions use AI question authoring), and deployment at scale is underway (800+ schools, 500K+ questions in production). But formative adoption and high-stakes institutional assessment remain starkly bifurcated. Formative use (study aids, practice quizzes, low-stakes classroom review) has crossed into ubiquity—28-76% of educators adopt generative tools depending on context, with documented learning gains and established classroom workflows. High-stakes examination stays locked behind unresolved barriers: distractor quality flaws persist (weak distractors, design mismatches), consistency gaps limit reliability (lower discrimination indices on pediatric MCQs, inconsistency across LLM runs), domain-specific accuracy risks are documented (49.6% of medical chatbot responses problematic, neurology/dermatology gaps), and governance frameworks remain absent. The shift from bleeding-edge to leading-edge reflects proven capability and production scale; the gap to mainstream deployment awaits quality assurance standardization, domain-specific validation, and institutional governance—barriers that are tractable but not yet resolved.
Platform-level normalization is now evident: Canvas New Quizzes (78% of institutions, 32M quizzes created in 2025) integrates AI question authoring via IgniteAI; MangoApps released document-to-quiz generation in April 2026; specialized SaaS platforms (QuizMaker trusted by 10,000+ schools, QuizMagic, ConductExam, Edzo, PressPrimer, Quizify) address K-12, higher education, and self-hosted ecosystems; a June 2026 market survey documents 10 mature platforms supporting standards alignment, adaptive difficulty, multi-modal items, and LMS integration across K-12, higher education, corporate training, and professional certification contexts; enterprise platforms are embedding the capability as standard. Institutional deployments at scale are underway: AssessPrep operates across 800+ schools in 85+ countries with 4M+ assessments delivered and 500K AI-generated questions; SchoolAI shows 500K personalized learning sessions in six months with documented classroom gains (McNulty Academy students became 'more intentional with explanations'). Khan Academy's Khanmigo reached 1 million U.S. students and rolled out at Phillips Academy Andover; May 2026 announcement of new "Assessments" product signals transition to structured assessment design with psychometrics and norming. Pearson Study Prep deployed at scale (62,000+ students, Fall 2025) achieved 90% higher proficiency rates and 60% higher mastery with goal-setting, validating formative adoption with measurable learning gains. Independent adoption has reached beyond traditional education markets: Japan's AI Passport Quiz App generated 10 million uses in ~16 months (launched May 2024), demonstrating grassroots adoption of AI-generated assessment in professional certification contexts. Major testing vendor PSI (ETS subsidiary) demonstrates production-grade governance: 77.4% of AI-generated items meet psychometric thresholds at parity with human-authored items, using multi-agent AI validation with expert SME review. The U.S. Department of Education (IES) invested $3.6M over 3 years (2024-2027) in AI-enhanced scenario-based assessment authoring, signaling policy-level recognition of both opportunity and need. Practitioner workflows are now established: educators across K-12 use Gen AI tools (Conker, Quizizz, Formative, MagicSchool) in year-long classroom deployments generating pre-assessments and analyzing item performance, with the production workflow norm requiring teacher review of AI-generated questions before deployment. Speed and cost gains are proven: the OECD documents 10x reductions in question paper creation; educators report 60-80% time savings; teachers cite assessment creation as their top AI use case (76% in LATAM contexts); UK survey shows 76% of teachers use AI with quiz generation explicitly listed as common use case. But formative scaling masks persistent barriers to high-stakes deployment. A RAND survey of 4,200 K-12 teachers shows only 38% rate AI assessment questions for higher-order thinking as good/excellent; 42% need significant editing. Peer-reviewed research documents classical MCQ design flaws (weak distractors, lower discrimination indices on AI-generated items vs. human-authored); a randomized trial in medical education found AI items rated "easy" at twice the rate of expert items despite equivalent performance. Medical accuracy research (June 2026) reveals systemic quality risks: BMJ Open study found 49.6% of medical chatbot responses problematic (19.6% highly problematic); Penn State study on medical questions shows ChatGPT-4o 84.6% accuracy but other models 50%, with domain-specific weakness in neurology and dermatology. Expert critical assessment identifies architectural risks specific to medical education: fact extraction errors, hallucination, curriculum mismatches that render AI-generated high-stakes content dangerous without human review. Fundamental reliability concerns emerged: Stanford AI Index (May 2026) documents 22-94% hallucination rates across 26 frontier models, with models overconfident precisely where wrong (hard-easy effect)—a critical failure mode for assessment where human supervisors need most reliable oversight. Stanford news study (May 2026) found models collapse from >95% accuracy to 19% (GPT-5) when false premises are introduced, signaling exam validity risk—AI cannot reliably detect flawed question premises. O'Malley's synthesis of four independent 2026 peer-reviewed medical education studies shows mixed results: Claude 3.5 Sonnet achieved 86% expert evaluation, but all studies independently conclude machine judgment needs human oversight and that human-in-the-loop remains non-negotiable. A critical limitation emerged in May 2026: empirical research documents a validity trade-off where AI-assisted assessments boost observable performance 30 points but collapse assessment reliability (Cronbach's α drops from 0.87 to 0.31), degrading diagnostic validity and ability discrimination. An integrity vulnerability is documented: large-scale study of 95,000+ students at 20 universities shows 37% use AI on assignments and 9% have used it to cheat, motivating urgent assessment redesign. Research reliability concerns were exposed when a widely-cited meta-analysis (262 peer-reviewed citations claiming "large positive" ChatGPT effects) was retracted for methodological discrepancies, revealing premature claims circulating in the field. Governance remains the binding constraint: copyright liability unresolved, evaluation standards unstandardized, integrity frameworks absent, and research validation gaps undermining institutional confidence. Assessment design experts argue the solution is not surveillance or detection but structural redesign toward process portfolios, in-class components, and oral defenses that make learning visible rather than attempting to lock questions against misuse.
— 2026 market survey documenting 10 AI platforms: standards alignment, adaptive difficulty, multi-modal items, LMS integration, analytics; represents mature ecosystem across K-12, higher education, corporate training, test prep.
— K-12 teacher workflow using Gen AI to generate pre-assessment questions, design lessons via ALDO framework, and analyze item difficulty (Q1 46.67% correct, Q5 20%); demonstrates practical AI-assisted question generation in classroom practice.
— Year-long classroom deployment guide with 4 tools (Conker, Quizizz, Formative, MagicSchool); documents limitation that AI-generated questions require review, establishing production workflow for formative assessment use.
— AssessPrep deployment metrics: 800+ schools, 4M+ assessments delivered, 500K AI-generated questions, demonstrating production-scale institutional adoption across international curricula (IB DP/MYP, Cambridge IGCSE, A-Level, Edexcel).
— BMJ Open study (Tiller et al.): 49.6% of medical chatbot responses problematic (30% somewhat, 19.6% highly); hallucinated citations across all models; signals critical quality risk for AI-generated medical exam content without human review.
— Peer-reviewed study with real student performance data from Japanese junior high EFL classroom: LLM-generated grammar exercises show pedagogically sound question design; cloze tasks showed highest cognitive load, demonstrating successful formative deployment with learning outcome analysis.
— Penn State study of four chatbots on 212 medical questions: ChatGPT-4o 84.6%, Llama3-8b ~50%; domain-specific weakness in neurology/dermatology; physician review warns against overreliance, highlighting domain expertise requirement for assessment validity.
— Japanese AI Passport Quiz App reached 10 million uses in ~16 months (launched May 2024), generating true/false questions for AI literacy certification; demonstrates independent, grassroots adoption of AI-generated assessment at scale outside US/UK markets.
2023-H1: Early open-source projects (Quizify, AI-Quiz-Generator) and consumer tools (Conker.ai, QwizLab) demonstrated proof-of-concept that generative models could create quizzes and exams from text; institutional deployments remained minimal.
2023-H2: Khan Academy deployed Khanmigo with quiz generation at scale (1000s of users); research validated AI performance on degree-level mathematics exams; iQS deployed in university Moodle systems. Adoption surveys showed educators saw utility but lacked confidence; accuracy concerns and slow institutional trust remained primary barriers.
2024-Q1: Consumer adoption accelerated (Conker: 600k quizzes, QuizFlex: 10k+ educators, QuizGeniusAI: 300+ educators). Research identified specific gaps: EDM 2024 study found GPT-4 achieved 70% validity for question stems but only 37% for distractors/misconceptions. Pearson VUE's high-stakes testing research showed time-saving benefits but persistent issues with cognitive level calibration. Critical assessment from AQA highlighted unresolved IP, bias, and reliability concerns—signalling adoption barriers in formal assessment despite technical progress.
2024-Q2: Ecosystem maturation continued: Khan Academy expanded Khanmigo with free AI question generators for all teachers; forms.app's comparative analysis identified 12+ mature quiz generation platforms across the market. Questgen and other tools demonstrated multi-format generation capability. However, critical integrity research from University of Reading (June 2024) revealed AI-generated exam submissions achieved 94% undetection rate and grades 0.5 boundaries higher than real students—a stark demonstration of assessment integrity risk. Institutional resistance intensified: American Board of Radiology formally banned AI-generated exam content, citing copyright and integrity concerns. The tension sharpened: widespread consumer adoption and vendor expansion collided with accumulating evidence of both technical limitations (distractor quality, cognitive calibration) and integrity risks (undetectable submissions, grade inflation), creating a widening gap between bleeding-edge tooling and high-stakes institutional deployment readiness.
2024-Q3: Vendor consolidation and geographic scaling: Khan Academy made Khanmigo free to teachers in 49 countries via Microsoft partnership (Aug 2024), then integrated into Canvas LMS for U.S. educators (Sept 2024). Research highlighted evaluation as the key bottleneck: AAAI 2024 conference paper documented that question quality assessment methods limit full integration into education; arXiv review catalogued AI's capabilities (Bloom's-aligned questions) and gaps (ethical, accuracy, consistency). Project failure rates remained high (RAND: 80% of AI projects fail), and deployment risks intensified—documented guardrail failures in production systems, benchmark validity critiques, and evidence that 80%+ of enterprise AI projects abandoned. The practice reached a critical inflection: consumer adoption was evident and accelerating, but institutional deployment remained blocked by evaluation uncertainty, integrity concerns, and widespread deployment failures across AI systems generally.
2024-Q4: Adoption continued: HMH 2024 survey showed 50% of educators using GenAI tools (5x increase YoY) with assessment creation among top use cases; Khan Academy's efficacy study demonstrated ~350K student user base with measurable learning gains, confirming scale and real classroom impact. Vendor consolidation persisted: ABMS boards piloted AI question generation for medical certification (American Board of Anesthesiology), though paused due to copyright concerns. Real-world deployment risks materialized: NSW Education Authority's October 2024 HSC exam included AI-generated image, sparking student complaints about authenticity and suitability, illustrating quality acceptance barriers in high-stakes contexts. Critical finding emerged: only 37% of enterprises reported GenAI applications production-ready, with quality and governance as top barriers—a structural constraint directly limiting institutional adoption of question generation tools. The practice remained bifurcated: consumer and formative assessment adoption accelerated toward ubiquity (50%+ educator usage), yet high-stakes institutional assessment deployment remained blocked by unresolved evaluation methods, integrity concerns, copyright ambiguity, and enterprise-wide deployment readiness gaps.
2025-Q1: Accuracy research intensified confidence in barriers: BBC study found 51% of AI responses contained significant factual errors; Columbia study documented >60% incorrect answers to news questions, including fabricated citations. Medical education research (BMC Medical Education) provided validation of question quality measurement methods. Enterprise adoption data remained grim: 95% of AI pilots failed to deliver measurable value (MIT study); 42% of businesses abandoned AI initiatives in early 2025, up from 17% six months prior. Formative assessment adoption continued at scale, but institutional deployment remained constrained by accuracy risks, evaluation method immaturity, and enterprise-wide AI adoption failure rates exceeding 90%.
2025-Q2: Technical research advanced incrementally: NAACL 2025 introduced ConQuer framework with 4.8% quality improvements and 77% pairwise win rate over baselines. However, institutional adoption barriers deepened: Chinese exam authorities disabled AI tools during gaokao to prevent cheating, a concrete signal of assessment integrity concerns. Enterprise adoption constraints persisted: only 37% of enterprises believed GenAI applications production-ready, with quality and governance as top barriers. Formative assessment adoption remained strong (~50% educator usage), but high-stakes institutional deployment remained blocked by unresolved assessment integrity, governance, and production-readiness constraints.
2025-Q3: Real-world deployment pilots exposed persistent implementation friction: University of Iowa Khanmigo pilot (spring 2025) showed sub-weekly usage and zero teaching impact due to manual Canvas integration requirements; Michigan Virtual's large-scale K-12 pilot (1,700+ participants) documented value but emphasized need for intentional support. Technical quality validation advanced: peer-reviewed radiology education research confirmed acceptable psychometric properties of AI-generated MCQs. However, institutional barriers deepened: university unit chairs described exam design as a "wicked problem" with impossible trade-offs; student surveys showed 75% skepticism about AI accuracy. New vendor entries (Qzzr, others) signaled growing market, but formative assessment remained dominant use case; high-stakes institutional deployment remained blocked by governance ambiguity, implementation friction, and user acceptance barriers.
2025-Q4: Breakthrough technical validation confirmed question generation maturity in specialized domains: peer-reviewed Chest journal study found AI-generated MCQs (ChatGPT-o1) statistically noninferior to expert questions in mechanical ventilation education; large-scale field study across 1,700 students showed AI questions comparable to expert-created ones by psychometric analysis. Khan Academy's Khanmigo reached 1 million U.S. students, demonstrating rapid production scaling. Market adoption sustained (Quizgecko 854K+ monthly visits). However, integrity vulnerabilities persisted: criminal justice education study documented ChatGPT consistency issues (80% accuracy but unreliable across accounts), revealing assessment gaming risks. Governance, copyright, and user acceptance barriers remained binding constraints. The practice reached critical inflection: technical maturity validated, formative adoption accelerating toward ubiquity, yet institutional assessment deployment remained blocked by unresolved security and policy frameworks despite proven capability.
2026-Jan: OECD analysis documented AI "item factories" achieving 10x speed and cost reductions in exam question creation, while identifying "crutch effect" risk where AI-assisted practice improves scores but reduces independent performance when removed. Vendor ecosystem continued scaling: Eklavvya reported 99% time reduction in question paper creation and 95% reduction in paper leakage incidents; CogniGuide launched instant MCQ generation from documents. Research platforms advanced: EduQuest hybrid system achieved 82% difficulty classification accuracy and 71% higher student engagement with 78% time savings for educators. Engagement patterns documented: 40-60% higher quiz completion rates with AI-powered tools. The bifurcation persisted: formative assessment scaling with institutional adoption gains, yet high-stakes institutional deployment remained constrained by unresolved integrity, governance, and learning outcome trade-offs despite demonstrated technical capability and efficiency gains.
2026-Feb: Educator adoption surveys documented continued scaling across the globe. Macquarie University's 2026 open dataset surveyed educators on AI integration in assessment; LATAM higher education reported 92% student and 79% faculty AI engagement with 50% student support for AI-assisted feedback (though only 19% of faculty had deployed it). Systematic literature review of 103 AQG studies documented clear technical trend toward Transformer models but identified critical gap: educator acceptance remains largely understudied and evaluation methods lack standardization. Vendor ecosystem sustained momentum: TurinQ reported 50,000+ learners with multi-format generation and AI-based grading. Institutional deployment continued: Phillips Academy Andover committed to Khanmigo rollout in spring term for integrated tutoring and quiz generation, though student skepticism persisted about AI reliability. Practitioner guides enumerated 8+ mature platforms (Quizizz, Kahoot, Conker, ProProfs, QuestionWell, Twee, Gibbly, Formative) with documented adoption barriers (data privacy, bias risks). The practice remained bifurcated: technical tooling matured and vendor scaling accelerated, formative assessment adoption widened, yet institutional barriers (user skepticism, acceptance gaps, unresolved evaluation standards) continued constraining high-stakes deployment despite proven technical capability and efficiency gains.
2026-Mar: Production-scale evidence confirmed question generation maturity in enterprise contexts: Coursera's international survey (4,200 educators across 5 countries) reported 28% of faculty actively using AI to draft exams, up from isolated pockets a year prior; e-Assessment Association survey identified item generation as the most frequently used AI application in assessment across assessment organizations globally — a signal of mainstream adoption within the professional assessment sector. Peer-reviewed quality validation advanced: medical education comparative study (March 2026) found Gemini and Copilot MCQs achieved high inter-rater agreement on Bloom's taxonomy and learning outcome alignment, though pediatric study simultaneously documented AI-generated MCQs showing lower discrimination indices and higher proportion of difficulty mismatches vs. human questions. VitalSource field deployment (200+ undergraduates) confirmed classroom impact: distributed AI practice questions yielded 2% average exam score gains with letter-grade improvements at 25th percentile, validating formative assessment value in production. LATAM adoption data: 76% of teachers use AI tools for creating teaching materials (highest use case across surveyed practices); 92% student AI adoption in higher education. Production governance evidence: Khan Academy's Khanmigo optimization documented 64 completed A/B experiments (March 2026) testing iterative quiz generation improvements; University of Jyväskylä research on human-AI co-creation showed ~50% of AI-generated MCQs acceptable without editing through hybrid prompting and human revision. Persistent limitations remained: Washington State University study found ChatGPT only 60% above-chance on true/false accuracy (2025 iteration), 16.4% accuracy identifying false statements—demonstrating consistency gaps limiting high-stakes reliability. The practice showed clear bifurcation: formative assessment adoption accelerating toward ubiquity (28-76% faculty adoption depending on survey/context), with documented classroom learning gains; yet institutional assessment deployment remained constrained by unresolved consistency, governance, and integrity concerns despite production-scale optimization evidence and technical capability validation.
2026-Apr: Platform-level normalization advances with enterprise adoption milestones and persistent quality caveats. AssessPrep reports 800+ schools across 85+ countries with 5M+ student submissions and 500K AI-generated questions, providing hard deployment scale in production; MangoApps 2026 Winter Release adds document-to-quiz generation, signalling continued embedding of question generation into enterprise platforms as standard capability. A RAND survey of 4,200 K-12 teachers, however, finds only 38% rate AI assessment questions for higher-order thinking as good/excellent, with 42% requiring significant editing — confirming formative adoption breadth does not yet translate to quality confidence for complex assessment. Peer-reviewed caution on MCQ design flaws (hallucination, fact extraction errors, curriculum mismatches) reinforces that high-stakes deployment without systematic human review remains inadvisable. Governance maturation signals emerge: K-12 districts establish explicit AI assessment frameworks (Alexandria City Schools, Niles Township implement red/yellow/green policies); major testing vendor PSI/ETS releases structured AI test development product with SME review and psychometric rigor guidance. International high-stakes deployment models appear: Kazakhstan's national testing center deploys hybrid AI/expert item generation achieving 97.5% acceptance rates. Institutional productivity gains documented: NYC STEM institute reports 30 hours/week savings on question paper prep after adopting AI-assisted platform. Emerging capability: interactive assessment generation shows stronger exam performance signals and engagement gains while reducing per-instance cost to <$1. Transparency gaps persist: academic audit of 20 educational AI tools finds 80% fail to disclose generative mechanisms and 0 reveal training data sources, indicating maturation gap between capability and informed institutional decision-making.
2026-May: Production-scale deployment evidence strengthens: Pearson Study Prep (62,000+ students) documents 90% higher mastery rates; PSI Exams confirms 77.4% of AI-generated items meet psychometric thresholds at parity with human-authored content; Khan Academy announces a new "Assessments" product with psychometrics and norming, signaling transition from tutoring to structured high-stakes assessment. K-12 district pilots at scale confirm classroom deployment: Connecticut, Utah, Michigan, and NYC pilots document AI-generated exit questions and assessment feedback reaching students across named districts. A Cornell study of 95,000 students at 20 universities (published in Science) finds 37% use GenAI on assignments and 9% to cheat, motivating urgent structural redesign toward process portfolios and AI-adaptive question formats. A critical reliability finding emerged: empirical research documents AI-assisted assessments boosting observable performance 30 points but collapsing assessment reliability (Cronbach's α from 0.87 to 0.31), degrading diagnostic validity; simultaneously, a widely-cited ChatGPT learning meta-analysis (262 peer-reviewed citations) was retracted for methodological discrepancies, underscoring premature claims circulating in the field.
2026-Jun: Medical accuracy and reliability research deepens the case for mandatory human review: BMJ Open study (Tiller et al.) finds 49.6% of medical chatbot responses problematic or highly problematic with fabricated citations; Penn State study on 212 medical questions documents ChatGPT-4o at 84.6% but other models at ~50%, with domain-specific weakness in neurology and dermatology. Stanford AI Index analysis confirms a systemic reliability gap: hallucination rates 22-94% across 26 frontier models, with overconfidence on exactly the hardest items—a critical failure mode for oversight. Deployment at production scale continues to broaden: AssessPrep confirms 800+ schools and 500K AI-generated questions across international curricula; Japan's AI Passport Quiz App reached 10 million uses in 16 months; QuizMaker reports 10,000+ schools. A June 2026 market survey documents 10 mature platforms supporting standards alignment, adaptive difficulty, multi-modal items, and LMS integration across K-12, higher education, corporate training, and certification contexts; K-12 practitioner workflows using Gen AI for pre-assessment generation and item difficulty analysis (ALDO framework) are now documented as established classroom practice. The bifurcation holds: formative adoption and vendor scale are established, but medical and high-stakes domain deployment requires structural human-in-the-loop safeguards that governance frameworks have not yet standardized.