Question & exam generation

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

LEADING EDGE

AI that generates assessment questions, quizzes, and examinations at specified difficulty levels and covering defined topics. Includes distractor generation and difficulty calibration; distinct from interview question generation in HR which targets hiring rather than education.

OVERVIEW

AI-generated assessment questions have reached leading-edge maturity: the technical bar is cleared (peer-reviewed parity with expert items in specialized domains), vendor tooling is normalized into major platforms (78% of Canvas institutions use AI question authoring), and deployment at scale is underway (800+ schools, 500K+ sessions in months). But formative adoption and high-stakes institutional assessment remain starkly bifurcated. Formative use (study aids, practice quizzes, low-stakes classroom review) has crossed into ubiquity—28-76% of educators adopt generative tools depending on context, with documented learning gains. High-stakes examination stays locked behind unresolved barriers: distractor quality flaws persist (weak distractors, design mismatches), consistency gaps limit reliability (lower discrimination indices on pediatric MCQs, inconsistency across LLM runs), integrity risks are documented (AI-generated assessments indistinguishable from human work), and governance frameworks remain absent. The shift from bleeding-edge to leading-edge reflects proven capability and production scale; the gap to mainstream deployment awaits quality assurance standardization, integrity frameworks, and institutional governance—barriers that are tractable but not yet resolved.

CURRENT LANDSCAPE

Platform-level normalization is now evident: Canvas New Quizzes (78% of institutions, 32M quizzes created in 2025) integrates AI question authoring via IgniteAI; MangoApps released document-to-quiz generation in April 2026; enterprise platforms are embedding the capability as standard. Institutional deployments at scale are underway: AssessPrep operates across 800+ schools in 85+ countries with 5M+ student submissions and reports 92% outcome improvement; SchoolAI shows 500K personalized learning sessions in six months. Khan Academy's Khanmigo reached 1 million U.S. students and rolled out at Phillips Academy Andover; May 2026 announcement of new "Assessments" product signals transition to structured assessment design with psychometrics and norming. Pearson Study Prep deployed at scale (62,000+ students, Fall 2025) achieved 90% higher proficiency rates and 60% higher mastery with goal-setting, validating formative adoption with measurable learning gains. Major testing vendor PSI (ETS subsidiary) demonstrates production-grade governance: 77.4% of AI-generated items meet psychometric thresholds at parity with human-authored items, using multi-agent AI validation with expert SME review. The U.S. Department of Education (IES) invested $3.6M over 3 years (2024-2027) in AI-enhanced scenario-based assessment authoring, signaling policy-level recognition of both opportunity and need. Speed and cost gains are proven: the OECD documents 10x reductions in question paper creation; educators report 60-80% time savings; teachers cite assessment creation as their top AI use case (76% in LATAM contexts). But formative scaling masks persistent barriers to high-stakes deployment. A RAND survey of 4,200 K-12 teachers shows only 38% rate AI assessment questions for higher-order thinking as good/excellent; 42% need significant editing. Peer-reviewed research documents classical MCQ design flaws (weak distractors, lower discrimination indices on AI-generated items vs. human-authored). Expert critical assessment identifies architectural risks specific to medical education: fact extraction errors, hallucination, curriculum mismatches that render AI-generated high-stakes content dangerous without human review. A critical limitation emerged in May 2026: empirical research documents a validity trade-off where AI-assisted assessments boost observable performance 30 points but collapse assessment reliability (Cronbach's α drops from 0.87 to 0.31), degrading diagnostic validity and ability discrimination. An integrity vulnerability is documented: AI-generated assessments are indistinguishable from human work, creating systemic fairness risks in evaluation. Research reliability concerns were exposed when a widely-cited meta-analysis (262 peer-reviewed citations claiming "large positive" ChatGPT effects) was retracted for methodological discrepancies, revealing premature claims circulating in the field. Governance remains the binding constraint: copyright liability unresolved, evaluation standards unstandardized, integrity frameworks absent, and research validation gaps undermining institutional confidence.

TIER HISTORY

ResearchJan-2023 → Jan-2023

Bleeding EdgeJan-2023 → Mar-2026

Leading EdgeMar-2026 → present

EVIDENCE (92)

Publisher Withdraws Study Claiming ChatGPT Boosts LearningNews Coverage2026-05-06

— Springer Nature retracted widely-cited meta-analysis (262 peer-reviewed citations) for discrepancies in analysis; signals research reliability concerns and premature claims in AI education field, critical context for tier classification uncertainty.

Pearson Data Shows AI-Powered Practice Boosts Student Proficiency by 90%Adoption Metrics2026-05-05

— Large-scale production deployment: Pearson Study Prep's AI-adaptive practice questions with 62,000+ higher ed students achieved 90% higher mastery rates and 60% higher proficiency with goal-setting, demonstrating measurable learning gains at scale.

Scenario-Based Assessment in the Age of Generative AIIndustry Reports2026-05-01

— U.S. Department of Education (IES/NCER) awarded $3.6M, 3-year grant to develop AI-enhanced scenario-based assessment authoring tool; signals federal policy recognition of both opportunity and need for scaled AI assessment generation.

AI Test Development Platform - PSI ExamsProduct Launches2026-04-30

— Production-grade national licensure deployment: 77.4% of AI-generated items met psychometric thresholds vs. 75.5% human-authored; multi-agent AI validation with expert SME review ensures accuracy and accountability at scale.

Podcast - Managing the Future of Work - Harvard Business SchoolProduct Launches2026-04-25

— Khan Academy announces 'Assessments' product with psychometrics, norming, AI-powered question design including open-ended questions and narrative feedback; transition from tutoring focus to structured assessment capability.

AI Impact on Decision-Making: Trade-offs Between Performance and Assessment ValidityOpinion2026-04-25

— Empirical research documents critical trade-off: AI-assisted assessments boost observable performance 30 points but collapse assessment reliability (Cronbach's α: 0.87→0.31), degrading diagnostic validity and ability discrimination.

Outsmarting AI in the classroom - ASU NewsCase Studies2026-04-21

— GAMED.AI interactive assessment generation system deployed in university NLP course: students showed stronger in-class exam performance and higher engagement; games generated in <1 minute at <$1 per instance, illustrating emerging interactive assessment capability.

PrepAI | AI-Assisted Assessment Platform for EducatorsCase Studies2026-04-20

— Named institutional deployment at NYC's leading STEM institute: professors saved 30 hours/week on question paper preparation with 20% operational cost reduction; demonstrates formative assessment productivity gains in higher education.

HISTORY

2023-H1: Early open-source projects (Quizify, AI-Quiz-Generator) and consumer tools (Conker.ai, QwizLab) demonstrated proof-of-concept that generative models could create quizzes and exams from text; institutional deployments remained minimal.
2023-H2: Khan Academy deployed Khanmigo with quiz generation at scale (1000s of users); research validated AI performance on degree-level mathematics exams; iQS deployed in university Moodle systems. Adoption surveys showed educators saw utility but lacked confidence; accuracy concerns and slow institutional trust remained primary barriers.
2024-Q1: Consumer adoption accelerated (Conker: 600k quizzes, QuizFlex: 10k+ educators, QuizGeniusAI: 300+ educators). Research identified specific gaps: EDM 2024 study found GPT-4 achieved 70% validity for question stems but only 37% for distractors/misconceptions. Pearson VUE's high-stakes testing research showed time-saving benefits but persistent issues with cognitive level calibration. Critical assessment from AQA highlighted unresolved IP, bias, and reliability concerns—signalling adoption barriers in formal assessment despite technical progress.
2024-Q2: Ecosystem maturation continued: Khan Academy expanded Khanmigo with free AI question generators for all teachers; forms.app's comparative analysis identified 12+ mature quiz generation platforms across the market. Questgen and other tools demonstrated multi-format generation capability. However, critical integrity research from University of Reading (June 2024) revealed AI-generated exam submissions achieved 94% undetection rate and grades 0.5 boundaries higher than real students—a stark demonstration of assessment integrity risk. Institutional resistance intensified: American Board of Radiology formally banned AI-generated exam content, citing copyright and integrity concerns. The tension sharpened: widespread consumer adoption and vendor expansion collided with accumulating evidence of both technical limitations (distractor quality, cognitive calibration) and integrity risks (undetectable submissions, grade inflation), creating a widening gap between bleeding-edge tooling and high-stakes institutional deployment readiness.
2024-Q3: Vendor consolidation and geographic scaling: Khan Academy made Khanmigo free to teachers in 49 countries via Microsoft partnership (Aug 2024), then integrated into Canvas LMS for U.S. educators (Sept 2024). Research highlighted evaluation as the key bottleneck: AAAI 2024 conference paper documented that question quality assessment methods limit full integration into education; arXiv review catalogued AI's capabilities (Bloom's-aligned questions) and gaps (ethical, accuracy, consistency). Project failure rates remained high (RAND: 80% of AI projects fail), and deployment risks intensified—documented guardrail failures in production systems, benchmark validity critiques, and evidence that 80%+ of enterprise AI projects abandoned. The practice reached a critical inflection: consumer adoption was evident and accelerating, but institutional deployment remained blocked by evaluation uncertainty, integrity concerns, and widespread deployment failures across AI systems generally.
2024-Q4: Adoption continued: HMH 2024 survey showed 50% of educators using GenAI tools (5x increase YoY) with assessment creation among top use cases; Khan Academy's efficacy study demonstrated ~350K student user base with measurable learning gains, confirming scale and real classroom impact. Vendor consolidation persisted: ABMS boards piloted AI question generation for medical certification (American Board of Anesthesiology), though paused due to copyright concerns. Real-world deployment risks materialized: NSW Education Authority's October 2024 HSC exam included AI-generated image, sparking student complaints about authenticity and suitability, illustrating quality acceptance barriers in high-stakes contexts. Critical finding emerged: only 37% of enterprises reported GenAI applications production-ready, with quality and governance as top barriers—a structural constraint directly limiting institutional adoption of question generation tools. The practice remained bifurcated: consumer and formative assessment adoption accelerated toward ubiquity (50%+ educator usage), yet high-stakes institutional assessment deployment remained blocked by unresolved evaluation methods, integrity concerns, copyright ambiguity, and enterprise-wide deployment readiness gaps.
2025-Q1: Accuracy research intensified confidence in barriers: BBC study found 51% of AI responses contained significant factual errors; Columbia study documented >60% incorrect answers to news questions, including fabricated citations. Medical education research (BMC Medical Education) provided validation of question quality measurement methods. Enterprise adoption data remained grim: 95% of AI pilots failed to deliver measurable value (MIT study); 42% of businesses abandoned AI initiatives in early 2025, up from 17% six months prior. Formative assessment adoption continued at scale, but institutional deployment remained constrained by accuracy risks, evaluation method immaturity, and enterprise-wide AI adoption failure rates exceeding 90%.
2025-Q2: Technical research advanced incrementally: NAACL 2025 introduced ConQuer framework with 4.8% quality improvements and 77% pairwise win rate over baselines. However, institutional adoption barriers deepened: Chinese exam authorities disabled AI tools during gaokao to prevent cheating, a concrete signal of assessment integrity concerns. Enterprise adoption constraints persisted: only 37% of enterprises believed GenAI applications production-ready, with quality and governance as top barriers. Formative assessment adoption remained strong (~50% educator usage), but high-stakes institutional deployment remained blocked by unresolved assessment integrity, governance, and production-readiness constraints.
2025-Q3: Real-world deployment pilots exposed persistent implementation friction: University of Iowa Khanmigo pilot (spring 2025) showed sub-weekly usage and zero teaching impact due to manual Canvas integration requirements; Michigan Virtual's large-scale K-12 pilot (1,700+ participants) documented value but emphasized need for intentional support. Technical quality validation advanced: peer-reviewed radiology education research confirmed acceptable psychometric properties of AI-generated MCQs. However, institutional barriers deepened: university unit chairs described exam design as a "wicked problem" with impossible trade-offs; student surveys showed 75% skepticism about AI accuracy. New vendor entries (Qzzr, others) signaled growing market, but formative assessment remained dominant use case; high-stakes institutional deployment remained blocked by governance ambiguity, implementation friction, and user acceptance barriers.
2025-Q4: Breakthrough technical validation confirmed question generation maturity in specialized domains: peer-reviewed Chest journal study found AI-generated MCQs (ChatGPT-o1) statistically noninferior to expert questions in mechanical ventilation education; large-scale field study across 1,700 students showed AI questions comparable to expert-created ones by psychometric analysis. Khan Academy's Khanmigo reached 1 million U.S. students, demonstrating rapid production scaling. Market adoption sustained (Quizgecko 854K+ monthly visits). However, integrity vulnerabilities persisted: criminal justice education study documented ChatGPT consistency issues (80% accuracy but unreliable across accounts), revealing assessment gaming risks. Governance, copyright, and user acceptance barriers remained binding constraints. The practice reached critical inflection: technical maturity validated, formative adoption accelerating toward ubiquity, yet institutional assessment deployment remained blocked by unresolved security and policy frameworks despite proven capability.
2026-Jan: OECD analysis documented AI "item factories" achieving 10x speed and cost reductions in exam question creation, while identifying "crutch effect" risk where AI-assisted practice improves scores but reduces independent performance when removed. Vendor ecosystem continued scaling: Eklavvya reported 99% time reduction in question paper creation and 95% reduction in paper leakage incidents; CogniGuide launched instant MCQ generation from documents. Research platforms advanced: EduQuest hybrid system achieved 82% difficulty classification accuracy and 71% higher student engagement with 78% time savings for educators. Engagement patterns documented: 40-60% higher quiz completion rates with AI-powered tools. The bifurcation persisted: formative assessment scaling with institutional adoption gains, yet high-stakes institutional deployment remained constrained by unresolved integrity, governance, and learning outcome trade-offs despite demonstrated technical capability and efficiency gains.
2026-Feb: Educator adoption surveys documented continued scaling across the globe. Macquarie University's 2026 open dataset surveyed educators on AI integration in assessment; LATAM higher education reported 92% student and 79% faculty AI engagement with 50% student support for AI-assisted feedback (though only 19% of faculty had deployed it). Systematic literature review of 103 AQG studies documented clear technical trend toward Transformer models but identified critical gap: educator acceptance remains largely understudied and evaluation methods lack standardization. Vendor ecosystem sustained momentum: TurinQ reported 50,000+ learners with multi-format generation and AI-based grading. Institutional deployment continued: Phillips Academy Andover committed to Khanmigo rollout in spring term for integrated tutoring and quiz generation, though student skepticism persisted about AI reliability. Practitioner guides enumerated 8+ mature platforms (Quizizz, Kahoot, Conker, ProProfs, QuestionWell, Twee, Gibbly, Formative) with documented adoption barriers (data privacy, bias risks). The practice remained bifurcated: technical tooling matured and vendor scaling accelerated, formative assessment adoption widened, yet institutional barriers (user skepticism, acceptance gaps, unresolved evaluation standards) continued constraining high-stakes deployment despite proven technical capability and efficiency gains.
2026-Mar: Production-scale evidence confirmed question generation maturity in enterprise contexts: Coursera's international survey (4,200 educators across 5 countries) reported 28% of faculty actively using AI to draft exams, up from isolated pockets a year prior; e-Assessment Association survey identified item generation as the most frequently used AI application in assessment across assessment organizations globally — a signal of mainstream adoption within the professional assessment sector. Peer-reviewed quality validation advanced: medical education comparative study (March 2026) found Gemini and Copilot MCQs achieved high inter-rater agreement on Bloom's taxonomy and learning outcome alignment, though pediatric study simultaneously documented AI-generated MCQs showing lower discrimination indices and higher proportion of difficulty mismatches vs. human questions. VitalSource field deployment (200+ undergraduates) confirmed classroom impact: distributed AI practice questions yielded 2% average exam score gains with letter-grade improvements at 25th percentile, validating formative assessment value in production. LATAM adoption data: 76% of teachers use AI tools for creating teaching materials (highest use case across surveyed practices); 92% student AI adoption in higher education. Production governance evidence: Khan Academy's Khanmigo optimization documented 64 completed A/B experiments (March 2026) testing iterative quiz generation improvements; University of Jyväskylä research on human-AI co-creation showed ~50% of AI-generated MCQs acceptable without editing through hybrid prompting and human revision. Persistent limitations remained: Washington State University study found ChatGPT only 60% above-chance on true/false accuracy (2025 iteration), 16.4% accuracy identifying false statements—demonstrating consistency gaps limiting high-stakes reliability. The practice showed clear bifurcation: formative assessment adoption accelerating toward ubiquity (28-76% faculty adoption depending on survey/context), with documented classroom learning gains; yet institutional assessment deployment remained constrained by unresolved consistency, governance, and integrity concerns despite production-scale optimization evidence and technical capability validation.
2026-Apr: Platform-level normalization advances with enterprise adoption milestones and persistent quality caveats. AssessPrep reports 800+ schools across 85+ countries with 5M+ student submissions and 500K AI-generated questions, providing hard deployment scale in production; MangoApps 2026 Winter Release adds document-to-quiz generation, signalling continued embedding of question generation into enterprise platforms as standard capability. A RAND survey of 4,200 K-12 teachers, however, finds only 38% rate AI assessment questions for higher-order thinking as good/excellent, with 42% requiring significant editing — confirming formative adoption breadth does not yet translate to quality confidence for complex assessment. Peer-reviewed caution on MCQ design flaws (hallucination, fact extraction errors, curriculum mismatches) reinforces that high-stakes deployment without systematic human review remains inadvisable. Governance maturation signals emerge: K-12 districts establish explicit AI assessment frameworks (Alexandria City Schools, Niles Township implement red/yellow/green policies); major testing vendor PSI/ETS releases structured AI test development product with SME review and psychometric rigor guidance. International high-stakes deployment models appear: Kazakhstan's national testing center deploys hybrid AI/expert item generation achieving 97.5% acceptance rates. Institutional productivity gains documented: NYC STEM institute reports 30 hours/week savings on question paper prep after adopting AI-assisted platform. Emerging capability: interactive assessment generation shows stronger exam performance signals and engagement gains while reducing per-instance cost to <$1. Transparency gaps persist: academic audit of 20 educational AI tools finds 80% fail to disclose generative mechanisms and 0 reveal training data sources, indicating maturation gap between capability and informed institutional decision-making.
2026-May: Production-scale deployment evidence strengthens with Pearson Study Prep (62,000+ students) documenting 90% higher mastery rates and PSI Exams confirming 77.4% of AI-generated items meet psychometric thresholds at parity with human-authored content — while Khan Academy announces a new "Assessments" product with psychometrics and norming, signaling transition from tutoring to structured assessment. A critical reliability finding emerged: empirical research documents AI-assisted assessments boosting observable performance 30 points but collapsing assessment reliability (Cronbach's α from 0.87 to 0.31), degrading diagnostic validity; simultaneously, Springer Nature retracted a widely-cited ChatGPT learning meta-analysis (262 citations) for methodological discrepancies, underscoring premature claims circulating in the field.

TOOLS

Conker.ai QwizLab Quizify QuizFlex QuizGeniusAI