Automated grading & assessment

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

LEADING EDGE

TRAJECTORY— Stalled

AI that grades essays, written work, short answers, and problem sets with rubric-based evaluation and feedback. Includes holistic scoring and partial credit assessment; distinct from formative feedback which guides learning rather than evaluating performance.

OVERVIEW

Automated grading is a practice split in two. For code and objective assessment, forward-leaning universities have proven the model works—Gradescope spans 500+ institutions with 13,000+ instructors, and deployments routinely show 70%+ time savings with improved consistency. That half of the practice functions as a mature, production-grade capability. Essay and open-ended writing assessment tells a different story. LLMs can now match or exceed human inter-rater agreement on certain benchmarks (QWK 0.87 vs 0.77 human), yet every large-scale deployment has surfaced reliability failures, bias concerns, or equity objections that block adoption beyond pilots. Recent 2026 research clarified that LLM consistency reflects averaging of human raters rather than independent judgment, and that human-in-the-loop frameworks are now the institutional norm, not a fallback. Survey evidence shows only 4% of teachers use AI for grading despite 80% using AI broadly—revealing adoption barriers are organizational and trust-based rather than technical. The result is a stable bifurcation: objective assessment is institutionally established while essay grading remains experimental, with barriers shifting from technical feasibility to fairness assurance, organizational governance, and stakeholder acceptance. This tension defines the practice's leading-edge position.

CURRENT LANDSCAPE

Gradescope anchors the objective-assessment side of the market, with 13,000+ instructors across 500+ universities using it for exam and assignment grading. Seoul National University's large-scale math deployment cut TA workload by 70%; the University of Florida's three-year rollout across seven colleges demonstrated 71% consistency improvements and 76% time savings. Spring 2026 pilots at Chico State, UIUC, and the University of York confirm continued institutional expansion, with Turnitin's ecosystem serving as the principal vendor through LTI integrations and product additions like Clarity. Montgomery County Public Schools (Maryland) documented 80% essay grading time reduction and 19% writing score improvement, while vendor ecosystems like PrepareBuddy demonstrate production scale with 500 submissions graded in 2 hours across 200+ institutions. Budget trends confirm institutional commitment: global higher education institutions now allocate 18–24% of IT budgets to AI learning tools in 2026 (up from ~9% in 2023–2024), with adaptive assessment and automated grading identified as principal procurement drivers.

Essay grading with LLMs presents a sharper challenge. Pearson's Intelligent Essay Assessor operates at production scale, routing hundreds of millions of responses through hybrid human-AI scoring. Patent analysis and technical research from 2001–2026 show three evolutionary phases: rule-based feature engineering (Phase 1), deep learning with embeddings (Phase 2), and transformer ensembles with multimodal OCR (Phase 3). Hybrid human-machine pipelines achieve 19.8% accuracy gains when humans review ~30% of low-confidence cases. But May 2026 critical research exposes fundamental limitations. A PRISMA scoping review of 46 LLM-based argumentative essay scoring studies (2022–2025) documents field fragmentation, insufficient grounding in argumentation theory, and fragile validity claims across datasets and prompting conditions. Stanford's May 2026 bias research on 600 middle school essays resubmitted with varying demographic labels found consistent, directional bias across all four AI models examined: Black students received more praise emphasizing 'leadership'; Hispanic/ELL students triggered grammar corrections; white students received structural feedback on argument quality; female students received affectionate tone. This asymmetric feedback creates unequal learning opportunities despite equivalent baseline performance. Edexia's analysis confirmed that 0.87 AI inter-rater QWK reflects averaging of multiple human raters rather than independent judgment. San Diego USD's deployment of Writable grading software showed 50% time savings but triggered equity audits after ETS analysis revealed -1.16 point bias for Asian American students, union resistance, and teacher manual grade correction workflows. Multi-institutional UK trial (Jisc, April 2026) across 15 universities confirmed efficiency gains are real but erode as academic oversight intensifies, with students preferring human feedback. The K-12 AI-in-education market has reached $7.57B with 46% year-over-year growth, yet 65% of teachers report implementation difficulties. Critically, a global survey of 11,500 educators in April 2026 found that while 80% use AI broadly, only 4% use it for grading—revealing that adoption barriers are organizational and trust-based rather than technical. Institutional deployments prioritize human oversight: e-Assessment Association's 2026 finalists show consistent patterns where AI generates candidate scores and feedback while institutional staff maintain control of final marks. The barriers are structural: bias mitigation requires continuous human review that eliminates promised time savings, fairness auditing demands transparency tools vendors have not yet implemented, and the gap between consistent scoring and valid assessment remains unresolved.

TIER HISTORY

ResearchJan-2015 → Jan-2015

Bleeding EdgeJan-2015 → Jan-2016

Leading EdgeJan-2016 → present

EVIDENCE (149)

Where Universities Are Placing Their AI Bets in 2026, per PearsonAdoption Metrics2026-05-02

— Gartner/Pearson/Coursera data showing higher education budget reallocation: 18–24% of IT budgets now devoted to AI learning tools (up from 9% two years prior). Adaptive assessment identified as primary procurement driver, not content delivery.

Artificial Intelligence and Educational Assessment Equity: An Integrated Analysis Based on the Technology-Policy-Practice FrameworkResearch Papers2026-05-01

— Peer-reviewed conference paper directly examining AI's dual role in educational assessment, balancing efficiency gains against equity and bias concerns.

AI in MarkingIndustry Reports2026-05-01

— e-Assessment Association's 2026 award program finalists document six real institutional deployments across sectors (higher ed, K-12, professional assessment). Named organizations with specific outcomes and metrics. Shows adoption breadth and consistent focus on human oversight.

Automated essay grading accuracy tech landscape 2026 - PatSnapIndustry Reports2026-04-30

— Patent and innovation research mapping AEG evolution through 3 phases (2001-2026), technical clusters, accuracy metrics, and geographic IP shifts. ~15M test-takers scored; 19.80% accuracy gain from hybrid human-machine pipelines.

AI Essay Feedback Differs by Student Race and Gender, Study FindsResearch Papers2026-04-29

— Stanford study showing consistent bias in AI feedback systems: essays attributed to Black students received more praise, Hispanic/ELL students received grammar corrections, white students received structural critique. Demonstrates fairness limitations in deployed systems.

AI gives more praise, less criticism to Black studentsNews Coverage2026-04-27

— Journalism reporting Stanford peer-reviewed research on systematic bias in AI writing feedback by student race/gender/achievement, documenting unequal learning opportunities.

Argumentative essay assessment with LLMs: A critical scoping reviewResearch Papers2026-04-27

— Critical scoping review of 46 AAES studies (2022-2025) following PRISMA. Documents fragmentation, insufficient argumentation theory grounding, fairness/transparency gaps, sensitivity to prompting and learner proficiency. Concludes LLM systems lack validity and accountability for high-stakes assessment.

AI in Education News: April 2026 Update on Policy, Classrooms, and CheatingNews Coverage2026-04-27

— Comprehensive news roundup with multiple strong adoption and policy signals: universities disabling AI detection (Curtin, Vanderbilt, UCLA, Cal State LA, Yale, Johns Hopkins, Northwestern) due to false-positive bias; 134 state AI-in-education bills across 31 states; Khanmigo learning gains (34% improvement vs. traditional tutoring per NBER).

HISTORY

2015: Automated grading emerged from research labs into early commercial products and institutional pilots. Turnitin released its NLP-based scoring engine; Notre Dame and University of Illinois deployed institution-wide or course-level graders. Peer-reviewed research validated improvements in writing accuracy and consistency, though adoption barriers around accuracy and fairness remained.
2016: Vendor momentum accelerated with Gradescope's $2.6M Series A round, following deployment across computer science and exam-grading use cases. Institutional deployments expanded to UMass and other large programs. Research community validated techniques and classroom impacts; adoption barriers shifted from technical feasibility to fairness, transparency, and cost-benefit analysis.
2017: Ecosystem matured with expanded vendor offerings (Turnitin Revision Assistant, Blackboard participation grading) and open-source tools (GatorGrader). Deployment scale grew (M-Write at 2,000 students, Croydon College 40,000 submissions); however, critical research emerged on limitations—low precision in Chinese AWE systems and gaming vulnerability in major essay scorers. Faculty concerns about transparency hardened adoption barriers.
2018: Ecosystem consolidation with Turnitin's acquisition of Gradescope (600+ institutional deployments). Code assessment emerged as a distinct, mature subdomain with 127+ documented systems. Essay scoring faced intensifying professional opposition: NCTE issued formal position statement opposing machine grading, citing inability to assess logic and argumentation. Empirical research documented specific failures—5% rejection rates in production AES systems and vulnerability of neural models to adversarial input. Practice bifurcated sharply: objective and code assessment advancing; writing assessment stalled by fairness and transparency concerns.
2019: Code assessment solidified as institutional standard (Purdue 1,600-enrollment course deployment; Autolab adopted at scale). Academic research confirmed bifurcation: IJCAI survey concluded AES "far from solved" after 50 years; comprehensive studies identified persistent vulnerabilities (gaming via sophisticated nonsense, inability to assess creativity, documented biases). Real-world deployment failures emerged (Utah statewide essay scoring criticized for bias and gaming). Essay and writing assessment consensus shifted from feasibility to trustworthiness questions, cementing a two-tier practice: leading-edge for code/objective assessment with proven ROI; experimental and contested for open-ended writing due to accuracy, fairness, and systemic vulnerabilities.
2020: COVID-19 pandemic accelerated Gradescope adoption for remote assessment; University of Leeds reported 60x usage growth and strong faculty satisfaction with digital grading. However, 2020 exposed critical limitations in high-stakes algorithmic grading: the UK's Ofqual A-Level algorithm and International Baccalaureate's algorithm both failed, producing biased outcomes and triggering widespread backlash. University of Texas at Austin discontinued its GRADE algorithm after 7 years due to bias concerns. Research showed both technical improvements (explainable AES with SHAP for transparency) and systemic vulnerabilities (Edgenuity platform gamed by students through keyword injection). The practice remained bifurcated: objective/code assessment strengthened through pandemic-driven adoption; essay/writing assessment faced mounting skepticism about bias, gaming vulnerability, and fairness in high-stakes contexts.
2021: Ecosystem consolidation continued with Gradescope expansion across major institutions (Purdue, NC State, others) and new vendor entrants (Microsoft Azure Automatic Grading Engine). Research advanced code and objective assessment maturity: empirical studies showed autograder deployment improved student satisfaction and learning outcomes; technical research reduced computational costs of AES models. However, adoption barriers persisted: CHI 2021 research revealed students distrust autograders even at ~90% accuracy, perceiving unfairness despite accuracy; this trust gap remained a critical impediment to broader adoption in high-stakes assessment contexts.
2022-H1: Research community intensified focus on fairness and accuracy trade-offs in AES systems. A comprehensive study of 9 AES methods on 25,000+ essays confirmed a core dilemma: prompt-specific models achieved higher accuracy but showed greater demographic bias; traditional machine learning models (SVM with engineered features) proved fairer than neural networks, challenging the assumption that more sophisticated models would improve all dimensions of performance. Systematic review of 125 studies (2016-2020) synthesized evidence on benefits (scaling, efficiency, bias reduction) and drawbacks (suppression of innovation, gaming vulnerability). Empirical evidence from CS education showed autograding yielded measurable gains (higher scores with lower variance), while critical assessments of production systems like Pigai highlighted gaps in context-aware feedback. The bifurcation sharpened: code and objective assessment continued expanding with vendor consolidation; essay grading remained contested, with research treating accuracy-fairness trade-offs as fundamental rather than solvable.
2022-H2: Institutional deployment of code and objective assessment continued expanding globally. Aalto University piloted Gradescope for paper-based assignment grading (mathematics, engineering); Hanyang University integrated Gradescope into CS courses, reducing exam grading from 2 weeks to automated assessment; Western University reported consistent faculty adoption growth since 2020 rollout. Research and ecosystem documentation confirmed maturity of programming autograding: survey of tool formats documented prevalence and diversity of solutions due to platform demand. Fairness concerns persisted: systematic review found minimal evidence that data-driven technologies effectively mitigate teacher biases, with risks of perpetuating algorithmic inequities. By end of 2022, the bifurcation remained stable: code and objective assessment were institutional standard with proven ROI and adoption momentum; essay and writing assessment remained contested due to unresolved fairness and bias trade-offs.
2023-H1: LLM-based essay grading emerged as a new approach. ChatGPT demonstrated feasibility for exam grading (70% agreement with humans within 10 points on 463 Master's responses); educators deployed Azure OpenAI and Copyleaks tools for production assessment. Systematic reviews of programming autograding documented maturity and diversity of code assessment tools (121 papers analyzed). Fairness remained the limiting factor: research surveys identified persistent bias risks, accuracy-fairness trade-offs, and algorithmic inequities despite vendor claims of bias-mitigation features. Institutional deployment of Gradescope and objective assessment continued; Rose-Hulman's adoption study documented post-pandemic sustainability of technology integration. By end of H1, essay grading with LLMs showed technical promise but bias concerns remained unresolved, maintaining the bifurcation: code/objective assessment productionable; essay assessment experimental and contested.
2023-H2: LLM-based grading and feedback continued expanding. Turnitin announced expanded offerings including AI-powered grading features in October 2023. Research on GPT-4's consistency as a text rater validated LLM reliability for certain assessment contexts. Azure OpenAI released production-grade tools for programming test scoring with partial-credit logic. Generative AI-based smart grading tools emerged for knowledge-grounded answer evaluation. Fifty-year historical review of automated essay scoring identified persistent challenges in feedback quality and assessment validity. Code and objective assessment remained institutional standard with expanding LLM applications; essay assessment bifurcation persisted between promise and fairness concerns, with vendor momentum but limited evidence of bias mitigation at scale.
2024-Q1: LLM-based essay grading entered empirical validation phase with comparative studies showing closed-source models (GPT-4, o1) achieving r=.74 alignment with human teachers; ACER e-Write reported 170K+ annual K-12 sittings. Institutional adoption of Gradescope continued (University of Iowa replacing Scantron; Aalto, Hanyang deployments). However, critical evidence emerged on reliability risks: GitHub Classroom autograder failures in February–March 2024; research showing AI detection tools exhibit high false positive rates (27%) and fairness concerns. Innovation in workflow automation (gradetools) addressed efficiency gaps. Code and objective assessment consolidated as institutional standard; essay grading with AI showed technical promise but persistent fairness-accuracy trade-offs and reliability concerns limited high-stakes adoption.
2024-Q2: LLM essay grading empirical testing accelerated with mixed results. Positive signals: UC Irvine study (1,800 essays) showed 89% ChatGPT agreement within one point in some contexts; IU researchers reported 44% lower error than humans on short-answer grading. Negative signals exposed real deployment failures: Texas Education Agency statewide STAAR deployment triggered equity audits after zero-score spikes; University of Delaware Gradescope pilot achieved <20% student usage despite 300 courses created; experimental evidence showed ChatGPT grade inconsistency (78-100 on same essay). Vendor momentum continued (Turnitin AI features, Azure OpenAI tools), but deployment experience revealed gap between research promise and production reliability. Code and objective assessment remained institutional standard; essay grading bifurcation sharpened between vendor claims and deployment reality.
2024-Q3: Institutional Gradescope adoption continued expanding (University of Nebraska-Lincoln fall 2024 pilot, Indiana University production deployments in large Calculus/math courses, Swarthmore new feature rollout), confirming persistent vendor momentum and institutional reliance on code/objective assessment infrastructure. LLM essay grading research turned critical: IJCAI 2024 survey reassessed the field as "largely unsolved despite 50+ years," while University of Alberta empirical study found ChatGPT and Llama assign lower scores than humans with poor correlation, contradicting positive narratives. Systematic review of Automated Writing Evaluation (19 studies, 2016-2020) documented persistent adoption barrier—students distrust AI feedback despite positive perceptions of efficiency. Evidence showed practice remained bifurcated: code and objective assessment consolidated as institutional standard with proven deployment ROI; essay and writing assessment remained contested between research capability gains and real-world reliability/fairness failures.
2024-Q4: LLM essay grading research matured with empirical evidence of both capability and limitations. German comparative study found GPT models (especially o1) achieving r=.74 alignment with human teachers on multidimensional essay scoring but exhibiting leniency bias requiring refinement. Broader research consensus emerged: EMNLP 2024 critical reflection on AES field identified narrow focus on benchmark metrics without solving fundamental problems; Advance HE analysis highlighted persistent gap between research promise and actual university-level adoption after 58 years. ChatGPT systematic review documented stricter grading and inconsistency on subjective tasks, reinforcing deployment concerns. Vendor ecosystem continued expanding: Examino commercial platform claimed 450,000+ papers graded across 25+ subjects; University of Connecticut piloted Gradescope bubble-sheet scanner replacing legacy Scantron system. By year-end 2024, bifurcation held firm: code/objective assessment solidified as production-ready with institutional rollouts; essay grading remained at inflection point—LLMs demonstrated technical feasibility but real-world deployment constraints and fairness limitations prevented high-stakes adoption momentum.
2025-Q1: Multimodal essay grading research revealed scaling limitations: EssayJudge benchmark (ACL Findings 2025) showed 18 state-of-the-art MLLMs exhibit significant gaps in discourse-level trait assessment, tempering optimism about larger models automatically solving accuracy. Adoption research shifted focus from technical feasibility to societal barriers: Technology Acceptance Model study identified mixed exam formats (70% MC/30% short-answer) as highest-acceptance condition; University of Twente longitudinal study framed bias, transparency, and explainability as central ethical prerequisites for teacher/student acceptance, not peripheral concerns. Code/objective assessment continued institutional expansion: University of Delaware Spring 2025 pilot showed growing Gradescope adoption across assignment types and bubble-sheet scanning. Bifurcation now clear: objective/code assessment institutionally viable with proven ROI; essay assessment technically advancing but adoption blockers are ethical and organizational (trust, fairness, human oversight) rather than technical, favoring human-in-the-loop and hybrid approaches over full automation.
2025-Q2: Research momentum accelerated with focus on foundational improvements and critical appraisals. PERSUADE corpus research (25,996 essays) investigated AES accuracy enhancement via feedback-oriented annotations, advancing methodology on large-scale K-12 datasets. Geographic expansion continued with Indonesian essay scoring systems using transfer learning (IndoBERT). UMD research advanced ensemble learning for constructed-response reading assessment, extending automation to more subjective domains. Critical voice persisted: educators highlighted fundamental barriers of essay reduction to numeric scores and Pearson competitors' failures. Systematic reviews of algorithmic bias synthesized benefits (efficiency, scaling) against persistent fairness concerns, reinforcing that adoption blockers remain organizational and ethical rather than technical. Code/objective assessment maintained institutional dominance; essay grading research continued advancing but real-world deployment remained constrained by fairness requirements and human-oversight expectations.
2025-Q3: Institutional adoption of objective/code assessment accelerated: University of Florida completed 3-year Gradescope pilot rollout across seven colleges showing 71% consistency improvements and 76% time savings; community college research documented positive learning outcomes from auto-grader feedback. Vendor momentum continued with Turnitin Clarity GA in July. Essay grading research advanced technically (unsupervised methods, rubric refinement, multimodal benchmarks) but real-world constraints hardened: ACL 2025 benchmark revealed MLLMs exhibit significant gaps in discourse-level assessment; critical analysis documented chatbot bias and fundamental unreliability. Bifurcation remained stable between production-ready objective/code assessment and contested essay grading with unresolved bias, interpretability, and reliability barriers.
2025-Q4: Global institutional adoption of objective/code assessment continued expanding with Gradescope reaching 500+ universities (13,000+ instructors) and Seoul National University demonstrating large-scale math exam automation with 70% TA workload reduction. K-5 IES-funded research on MI Write AEE in Delaware showed strong predictive validity but highlighted implementation barriers including student usability challenges and feedback misalignment, requiring sustained teacher training. Essay grading research continued but independent empirical evidence surfaced accuracy variability across domains and fairness gaps for non-native speakers. Bifurcation hardened: objective/code assessment solidified globally; essay grading remained methodologically advancing but constrained by real-world reliability and equity concerns.
2026-Jan: Institutional adoption of code and objective assessment continued accelerating into 2026, with major universities launching Spring pilots (Chico State, UIUC) and integrating Gradescope across STEM and humanities courses. Turnitin ecosystem expanded with LTI migrations (Jyväskylä, others) and competitive shifts (Sheridan College transitioning to Copyleaks). Essay grading research matured with balanced evidence: Stanford SCALE Initiative consolidated academic syntheses; EasyClass AI synthesis of 2024-2025 studies confirmed proportional bias in AI systems and fundamental accuracy-fairness trade-offs. Product momentum in essay grading continued (EssayGrader 3.0 with custom rubrics and LMS integration, EasyClass K-12 claims) but independent evidence documented limitations—accuracy variability across domains, leniency bias on weak essays, struggles with nuance. Bifurcation held firm: code/objective assessment production-ready with global momentum; essay grading technically advancing but real-world adoption blocked by fairness and reliability barriers, favoring hybrid human-in-the-loop approaches.
2026-Feb: LLM essay grading research advanced with mode-specific rubric optimization (CARO framework) and clarity on consistency-vs-accuracy (Edexia analysis showing 0.87 AI inter-rater QWK exceeds 0.77 human but reflects averaging, not superior judgment). Vendor ecosystem momentum continued (Pearson Intelligent Essay Assessor production deployment on hundreds of millions). Critical assessment intensified: Inside Higher Ed analysis documented that AI measurement validity crisis forces institutional control measures (proctoring, oral defenses) that widen equity gaps. K-12 market reached $7.57B (46% YoY growth) but with 65% teacher implementation concerns and equity risks. Code/objective assessment continued institutional expansion; essay grading remained blocked by fundamental tensions between automated consistency, assessment validity, and equity.
2026-Mar: Major LMS ecosystem acceleration with Instructure (Canvas) releasing IgniteAI Grading Assistance (March 21), enabling AI-generated scores and feedback for written assignments aligned to teacher rubrics. Large-scale real-world evidence emerged: UC Irvine production deployment on ~800 calculus students using OCR-conditioned LLMs demonstrated practicality with documented failure modes and rubric-design principles. Government-scale deployment confirmed with Janison's NAPLAN assessment spanning Australia's national K-12 program. Critical negative signals surfaced: Connecticut investigation of Amity Regional HS grading deployment revealed semantic reasoning failures, student resistance (150+ petition), and accuracy concerns despite $19k vendor spending; independent analysis documented vendor claims (90% accuracy) obscuring reality (40% exact-score agreement). Market sizing: online exam software projected to reach $15.86B by 2030 (12.6% CAGR 2025-2026), with automated grading/evaluation identified as key growth driver. Bifurcation remained firm: objective/code assessment solidified as institutional standard with expanded LMS integration and proven deployment ROI; essay grading confronted with widening gap between technical capability and real-world reliability, fairness, and user acceptance barriers.
2026-Apr: Real-world deployment evidence sharpened the bifurcation. Objective assessment gains: Montgomery County Public Schools documented 80% essay grading time reduction, 95% human-rater correlation, and 19% writing score improvement; PrepareBuddy's RAG-based batch grader processed 500 submissions in 2 hours across 200+ institutions; a UK Jisc trial across 15 universities confirmed efficiency gains are real but erode as human oversight intensifies. Bias and equity concerns intensified around essay grading: San Diego USD's Writable deployment triggered governance scrutiny after ETS documented a -1.16 point gap for Asian American students alongside union resistance and teacher manual correction workflows, while a UAE mixed-methods study (400 students) found severe SEND learner disadvantage (d=0.76–1.12). Research confirmed systematic LLM scoring biases — overvaluing short essays, penalizing minor errors, and yielding only moderate holistic agreement (QWK ~0.6) — reinforcing that human-in-the-loop models remain the institutional standard for high-stakes essay assessment. A global survey of 11,500 educators underscored the adoption gap: 80% use AI broadly but only 4% use it for grading, indicating trust and governance barriers rather than awareness. Research advances included neuro-symbolic approaches combining GPT-4o with rubric-aligned explanations and formal logic rules for ENEM essay scoring, improving transparency while matching accuracy — a signal that interpretability work is advancing in parallel with deployment.
2026-May: Critical research on bias and field maturity clarified deployment constraints. Stanford researchers (Marked Pedagogies, nominated for LAK 2026 best paper) submitted 600 middle school essays to four AI models and resubmitted each 12+ times with varying demographic labels (race, gender, motivation, disability). Consistent patterns emerged across all models: Black students received praise emphasizing 'leadership' or 'power'; Hispanic/ELL students triggered grammar corrections; white students received structural critique; female students received affectionate tone; high-achieving students got critical refinement, unmotivated students got encouragement. Asymmetric feedback creates unequal learning opportunities despite equivalent content quality. A critical PRISMA scoping review (46 AAES studies, 2022–2025) documents LLM essay assessment field fragmentation: research insufficiently grounded in argumentation theory, datasets non-comparable, evaluation methods inconsistent, validity claims fragile across prompt diversity and learner proficiency. Patent and technical analysis (2001–2026) shows three phases of AES evolution, with Phase 3 (2021–2026) hybrid human-machine pipelines achieving 19.8% QWK gains and 25.6% accuracy improvements through human review of ambiguous cases. Budget trends show higher education allocating 18–24% of IT budgets to AI learning tools (up from 9%), with adaptive assessment and automated grading as principal drivers. e-Assessment Association 2026 finalists across higher ed and K-12 show institutional pattern: AI generates candidate scores and feedback while staff maintain control of final marks. Bifurcation firm: objective/code assessment institutionally established with budget support and proven ROI; essay grading remains blocked by systematic bias, validity concerns, and equity requirements that reverse promised efficiency gains through human oversight.