Perly Consulting │ Beck Eco

The State of Play

A living index of AI adoption across industries — where established practice meets the bleeding edge
UPDATED DAILY

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

The Daily Dispatch

A daily newsletter distilling the past two weeks of movement in a domain or two — delivered to your inbox while the index updates in the background.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN
BLEEDING EDGEESTABLISHED

Adversarial, bias & fairness testing

BLEEDING EDGE

TRAJECTORY

Stalled

AI that tests models for demographic bias, fairness violations, and adversarial robustness through systematic red-teaming and probing. Includes automated fairness audits and adversarial prompt generation; distinct from hallucination detection which tests factual accuracy rather than fairness or robustness.

OVERVIEW

Adversarial, bias, and fairness testing -- systematic red-teaming, automated fairness audits, and robustness probing of AI systems -- remains firmly bleeding-edge: mature tooling meets immature adoption discipline. The ecosystem expanded to 10+ vendor platforms, frontier labs operate formal red teams (Microsoft, Anthropic, OpenAI documented 12+ months of operational programmes), and US/EU regulatory frameworks now mandate adversarial testing by August 2026. Yet deployment discipline is uneven: 72% of organisations still report inadequate evaluation protocols; 54% of deployed systems contain undetected biases despite toolkit maturity. The practice faces two foundational tensions. First, adversarial robustness research hit a confirmed ceiling around 90% due to imperceptible perturbation constraints, while attack success rates persist at 86-97% across vectors, revealing a hard limit in current methodologies. Second, fairness testing itself faces a mathematical constraint: satisfying statistical parity, equalized odds, and calibration simultaneously is provably impossible, forcing organisations to choose trade-offs without regulatory clarity on which metric matters. Evaluation methodology has shifted from one-shot audits to continuous testing (CI/CD-integrated red-teaming now emerging as standard), and testing scope has broadened from models to agents (tool-level attacks, multi-agent emergent bias). The hard problems remain organisational: usability friction, ROI demonstration, practitioner knowledge gaps, and bridging the measurement-practice divide in high-stakes domains like hiring, healthcare, and emergency response.

CURRENT LANDSCAPE

Regulatory enforcement is accelerating adversarial testing adoption. The EU AI Act (August 2, 2026 enforcement for Annex III high-risk systems; Article 15 mandates resistance to adversarial input manipulation) converges with NIST AI RMF (Measure 2.5–2.7 for diversity and robustness testing), White House voluntary commitments, and emerging US state law (Connecticut SB 5, effective October 2026, explicitly requires "adversarial resilience testing" for automated employment tools). Frontier lab red-teaming programmes have operationalized from research to governance: Anthropic's Frontier Red Team (15 researchers, policy division) documents constitutional classifiers blocking 95% of jailbreaks; Microsoft's 12-month red-teaming engagement discovered seven new agentic failure modes (supply chain compromise, goal hijacking, visual attacks on computer-use agents, session contamination); OpenAI and Google maintain parallel programmes with $35-55K bug bounty allocations. Methodological advances include ICML 2026 work on adversarial training preventing catastrophic overfitting (SORA), semantic-level vulnerability discovery revealing model-specific attack profiles (MAP-Elites framework across 4 frontier models), and rigorous evaluation standards for detectors (Gate AI: 16 benchmarks, 12,111 samples with cross-validation to eliminate threshold tuning artifacts). Testing scope broadened from models to agents: distinct tool-level attack surface (MCP/plugin compromise, data-level prompt injection) requires synthetic-world methodology (intercepting tool calls to test without rebuilding architecture). Agent evaluation frameworks emerged: Microsoft ASSERT (80-90% human annotator agreement) and responsible AI red-teaming guides distinguish fairness testing from security testing.

Yet critical gaps persist. Regulatory clarity on fairness metrics lags requirement: the mathematical impossibility of satisfying statistical parity, equalized odds, and calibration simultaneously remains unresolved in EU AI Act Article 10, forcing organisations to choose trade-offs without guidance. Fairness testing in multi-agent systems has no regulatory framework (emergent bias from agent interactions falls outside coverage entirely). Practitioner research documents consistent ML engineering agent failures on fairness constraints despite explicit prompting, and clinical fairness research reveals only 1 of 63 fairness metrics actually measures patient outcome benefit. Attack success rates remain stubbornly high (86-97% across vectors, with single-turn testing inadequate proxy for iterative resilience: Cisco research showed 7.89%-88.30% multi-turn vs. 2.19%-64.91% single-turn ASR across cohort). Organisational adoption barriers persist despite regulatory deadlines: usability friction, ROI demonstration challenges, measurement-practice divergence, and integration overhead remain primary constraints in high-stakes domains (hiring, insurance, emergency dispatch).

TIER HISTORY

ResearchJan-2019 → Jan-2020
Bleeding EdgeJan-2020 → present

EVIDENCE (25)

— Technical analysis distinguishing agent-level testing (tool-based attacks) from model-level testing; proposes synthetic-world methodology intercepting tool calls to test real agent behavior without rebuilding architecture.

— Regulatory synthesis showing convergence across EU AI Act Article 9, NIST AI RMF, and White House executive order on adversarial testing as baseline compliance requirement; explicitly covers bias and discrimination testing as legal obligation.

— Microsoft AI Red Team operational research documenting 7 new agentic failure modes (supply chain compromise, goal hijacking, visual attacks, context contamination) discovered through 12 months of red-teaming engagements.

— Critical analysis of fairness testing gaps in EU AI Act: mathematical impossibility of satisfying all three fairness metrics simultaneously, inability to audit emergent bias in multi-agent systems, and delayed harmonized standards. Negative signal on regulatory clarity.

— Empirical study finding ML engineering agents consistently underperform manual baselines in both predictive quality and fairness despite fairness-oriented prompts, revealing evaluation gaps in automated ML systems.

— ICML 2026 paper addressing catastrophic overfitting in efficient adversarial training; proposes PertAlign metric and SORA adaptive method achieving state-of-the-art robustness across datasets and architectures with single hyperparameter.

— MAP-Elites framework discovers model-specific semantic-level vulnerabilities across 4 frontier models (GPT-4o-mini, Claude Sonnet, Gemini 2.0, Llama); reveals distinct attack profiles and interpretable strategies vs. token gibberish.

— Practitioner guide establishing bias and discrimination as explicit harm category in frontier lab red teaming; identifies public benchmark evaluation awareness (19.8% vs. 2.0% private) as critical signal for testing methodology.

HISTORY

  • 2019: Fairness and adversarial robustness testing emerges from academic research with early toolkit releases (Fairlearn, IBM AI Fairness 360) and government stakeholder engagement. Real-world deployments reveal significant fairness gaps despite mitigation efforts; practitioner surveys document gaps between research and industrial practice.

  • 2020: Vendor toolkit ecosystem matures with production-grade fairness assessment and mitigation systems. High-profile deployment failures in medicine, content moderation, and examinations highlight critical limitations in bias and fairness testing despite toolkit availability. Adversarial robustness research advances techniques but reveals 35%+ of defenses vulnerable to novel attack patterns. Data diversity and test coverage emerge as primary barriers to systematic fairness deployment.

  • 2021: Vendor tooling consolidation continues (IBM AI Fairness 360 GA, Meta Casual Conversations dataset for fairness evaluation). Research shows both progress (scalable adversarial testing achieving 57%+ discrimination reduction, individual fairness algorithms) and critical gaps (NeurIPS Adversarial GLUE benchmark reveals 90% of attack methods invalid, all models fail robustness). Emerging risks highlighted: fairness audit manipulation ("D-hacking") exposes regulatory vulnerabilities. Healthcare deployments demonstrate efficacy and persistent data diversity gaps. Organizational adoption of fairness toolkits remains inconsistent despite vendor maturity.

  • 2022-H1: Vendor tooling advances (Meta releases 500+ demographic term datasets for NLP fairness, IBM deploys fairness testing to 10M-impression advertising campaigns). Adversarial robustness research achieves 10x speedup in evaluation but reveals adaptive defenses do not improve over static methods. Critical practitioner research (ACM FAccT) shows fairness is a socio-technical challenge; bias detection methodology review exposes implementation inconsistencies. Real-world advertising deployment demonstrates bias identification but incomplete mitigation; organizational adoption gaps persist.

  • 2022-H2: Adversarial training methodology advances (IBM CAT algorithm), but critical research shows adversarial evaluation methods themselves introduce fairness risks and unstable rankings. Comprehensive field survey identifies 100+ fairness testing papers but field methodology reveals inconsistencies. Real-world documented failures (Amazon recruitment, COMPAS, Apple Card, healthcare) expose incomplete bias testing despite toolkit maturity. Governance challenges surface: internal AI ethics monitoring questioned after high-profile exit from major vendor. Research connects fairness interventions to distribution-shift robustness. Organizational adoption barriers persist despite vendor maturity and research momentum.

  • 2023-H1: Vendor tooling continues consolidation (Fairlearn sociotechnical integration, Meta VRS for housing ad equity, Holistic AI library release). Government backing accelerates: NIH NCATS awards $700K for clinical bias detection tools, attracting 200+ participants and signaling high-stakes institutional adoption. Critical research reveals fundamental methodology flaws: adversarial training reduces robust accuracy in small-sample regimes (ICLR 2023); vision-language models exhibit high vulnerability to adversarial evasion (NeurIPS 2023). Fairness remains socio-technical with persistent barriers in organizational integration despite advancing tooling and government momentum.

  • 2023-H2: Vendor ecosystem consolidation accelerates: Microsoft transitions fairness dashboards to unified Responsible AI platform; open-source toolkit expansion (mlr3fairness for R community). Real-world deployment cases emerge: UK government documents Advai's adversarial stress-testing across multiple sectors; Meta VRS achieves equitable housing ad distribution with privacy-enhanced auditing. Critical research exposes robustness limitations: multimodal models (MiniGPT-4, LLaVA) show 60%+ adversarial vulnerability; robustness certification methods themselves reveal methodological constraints. Fairness-testing adoption remains hindered by socio-technical barriers and measurement inconsistency despite vendor product maturity and government institutional support.

  • 2024-Q1: Production deployment of adversarial + bias testing in generative AI (Adobe Firefly with three-tiered human impact assessment). Government-backed fairness tooling advances (CMU SEI AIR tool for DoD bias auditing). Critical research reveals methodological gaps: standard fairness benchmarks show zero correlation with realistic bias in deployed contexts; widely used clinical ML models fail to correct dataset bias; adversarial training robustness plateaus at fundamental limits (~90%) due to human perception constraints. Early evidence suggests practice is broadening to multimodal/T2I systems but encountering systematic testing limitations.

  • 2024-Q2: Ecosystem maturity continues with specialized fairness tools (FairX benchmarking toolkit, DispaRisk proactive risk assessment framework). Adversarial testing methodology research advances (RL-based autonomous driving robustness evaluation at ICST 2024). Critical signals on fundamental limitations persist: peer review of IEEE S&P 2024 defense papers uncovers mathematically impossible claims and systematic code bugs, revealing systemic failures in robustness evaluation; superhuman Go AI research demonstrates that even in narrow domains, current defenses fail against newly trained adversaries; ICML scaling law analysis confirms robustness plateaus at ~90% with human performance as theoretical ceiling.

  • 2024-Q3: Government-backed fairness tools reach deployment maturity with UK government publishing FairNow methodology for conversational AI bias assessment; UK DWP conducts fairness impact assessment on fraud detection algorithm. Critical research reinforces limitations: ICML paper demonstrates adversarial robustness scaling laws plateau at ~90% due to imperceptible perturbations becoming invalid images; research identifies gaps between GenAI fairness assessment methods and regulatory goals with case studies of discriminatory deployed systems; usability study reveals fairness testing tools remain difficult to adopt despite ecosystem growth. Tooling ecosystem expands with Miami University's open-source AiR-TK robustness testing kit release.

  • 2024-Q4: Vendor ecosystem consolidation accelerates with Google publishing official adversarial testing guides for generative AI and specialized tooling emergence (BiasAlert for LLM bias detection, ViLBias for multimodal bias with 40,945 annotated pairs). Academic research advances attack and defense methodologies (NeurIPS CAA for tabular models, uncertainty-aware adversarial training) while confirming 90% robustness ceiling. Government deployment expands: UK FairNow reaches scale for conversational AI fairness assessment. However, organizational adoption barriers intensify: AI project deployment declined from 55.5% (2021) to 47.4% (2024) with ROI challenges; practitioner research reveals persistent usability and integration gaps despite vendor maturity. Practice broadens to multimodal systems but encounters fundamental technical constraints and organizational ROI limitations.

  • 2025-Q1: Vendor ecosystem expands into commercial SaaS (FairPlay fairness-as-a-service) and inference-time methodologies emerge (OpenAI compute scaling approach). Critical gap analysis reveals enterprise testing practices severely lag toolkit availability: 72% of organizations report inadequate evaluation protocols, 54% of deployed systems have undetected biases, 30% fail operationally despite lab validation. Real-world deployment audits (RisCanvi criminal justice tool) uncover transparency and fairness flaws. Year-over-year adoption of adversarial testing practices increasing (BSIMM15 report) but constrained by measurement-practice divergence and persistent organizational adoption barriers despite methodological and tooling advances.

  • 2025-Q2: Vendor governance integration advances (IBM-Fairly AI partnership for compliance mapping and automated red-teaming). Adversarial testing methodologies expand with frameworks for agent robustness (RedTeamCUA with 42-60% attack success rates) and targeted attacks on tabular models (sigma-binary achieving 90%+ success with minimal perturbations). Domain-specific gaps surface: healthcare fairness research identifies scarcity of clinical AI fairness assessment studies and disconnect between regulatory requirements and assessment methods; criminal justice analysis (COMPAS) documents how AI exacerbates racial bias despite fairness testing. Auditing methodology research exposes systemic weaknesses: current approaches predominantly technical, overwhelmingly one-shot assessments, scarce community participation. Adversarial defense evaluation itself compromised with 90%+ success rates across malware detection and other domains, signaling fundamental brittleness in evaluation methodologies themselves.

  • 2025-Q3: Vendor governance integration advances with specialized commercial offerings (Qualitest adversarial red-team testing service). Fairness testing expands to emerging architectures (RAG/SLMs) but reveals persistent bias vulnerabilities; metamorphic testing identifies one-third breakage rates on demographic perturbations. Practitioner research documents continuing knowledge gaps despite tooling maturity (inconsistent practices, fairness deprioritized). Critical research exposes methodological brittleness: audit study evaluations reveal fairness interventions exhibiting ~10% disparity despite metric parity; adversarial robustness evaluation inconsistencies documented in AccuracyBench framework. Hiring domain evidence shows significant discrimination in Fortune 500-scale AI adoption (98.4% of Fortune 500 using AI hiring). However, testing remains socio-technical challenge with adoption constrained by organizational barriers, practitioner knowledge gaps, and measurement-practice divergence.

  • 2025-Q4: Adversarial testing tooling consolidates with research frameworks addressing fragmentation (AdversariaLLM: 12 attacks, 7 benchmarks with 28% success-rate improvements; AdvERSEM for semantic-level robustness testing). Bias testing benchmarking advances with BiasFreeBench comparing eight mitigation techniques across LLMs. Critical research signals stagnation: Phare V2 benchmark finds no correlation between model capability and bias resistance, with newer models underperforming 1.5-year-old baselines on jailbreak defense—contradicting hardware-scaling narratives. Auditing tools expand with standardized bias detection protocols and datasets. However, foundational tension deepens: tooling ecosystem expands yet real-world deployments remain inadequately tested; testing methodologies advance while fundamental limits (90% robustness ceiling, imperceptible perturbations, defense brittleness) persist; adversarial testing necessity recognized but adoption barriers (usability, integration friction, ROI demonstration) continue constraining deployment.

  • 2026-Feb: Vendor ecosystem accelerates product innovation with consent-driven fairness benchmarking (Sony FHIBE with Nature publication, 10,000+ cross-cultural images) and commercial auditing platform launch (Paritas with regulatory compliance mapping). Red-teaming market growth signals intensify ($1.43B in 2024, $4.8B projected by 2029) with autonomous agent frameworks and culturally adaptive methodologies emerging. Research documents persistent limitations: empirical fairness audits of open LLMs succeed methodologically but organizational deployment discipline remains uneven; adversarial robustness benchmarks for object detection reveal transformer architecture transferability gaps; attack success rates persist at high levels (89.6% roleplay, 97% multi-turn). Regulatory adoption accelerates with Colorado AI Act compliance signals in employment auditing. Critical organizational gaps persist: 72% report inadequate protocols, 54% of deployed systems contain undetected biases despite toolkit maturity. Foundational tension deepens: regulatory mandates and vendor products proliferate while real-world testing adequacy and organizational adoption remain fundamentally constrained by usability, integration friction, ROI demonstration barriers, and measurement-practice divergence.

  • 2026-Apr: Regulatory enforcement escalates: Massachusetts AG secures $2.5M settlement against a student loan company for AI model disparate impact, and CFPB establishes disparate impact testing (AIR, regression, matched-pair) as an operationalized compliance requirement for all AI-driven lending decisions. Market consolidation accelerates with Red Hat's acquisition of Chatterbox Labs' AIMI platform as adversarial ML market grows from $1.64B (2025) to $2.09B (2026, 28% CAGR). A University of Washington study of 528 recruiter-LLM pairs found recruiters mirrored biased AI in ~90% of severe-bias cases despite human oversight, demonstrating that human-in-the-loop review without systematic measurement provides insufficient fairness protection. Adversarial defense tooling advances at scale: Anthropic's Constitutional Classifiers reduced jailbreak attack success from 86% to 4.4%, and an industry-wide leaderboard now benchmarks 14 frontier models across five standardized adversarial suites; Eightfold's independently audited bias assessment across 29M+ candidate records passed all three audit categories including intersectional analysis, setting a public compliance benchmark for employment AI.

  • 2026-May: Methodological breadth expands across high-stakes domains: a multi-domain red-teaming framework evaluated 11 medical LLMs across 690 clinically grounded scenarios, and a bias audit of emergency police dispatch tested 11 frontier LLMs across 19,800 cases in 15 scenarios and multiple languages, finding demographic bias amplified up to 2x in Mandarin. Specification-following audits using adversarial multi-turn scenarios quantified generational improvement in model compliance (Claude: 15%→2% violation rate), and an algorithmic fairness testing study of 34 real auto insurers found all failed demographic parity with 16 exhibiting statistically significant disparate impact, validating statistical fairness testing as a production compliance methodology.

  • 2026-Jun: Regulatory convergence on adversarial testing as a baseline compliance requirement sharpened: EU AI Act Article 9, NIST AI RMF, and White House executive order now jointly mandate bias and discrimination testing as legal obligation, while Microsoft's AI Red Team published a taxonomy of seven new agentic failure modes (supply chain compromise, goal hijacking, visual attacks, context contamination) from 12 months of operational engagements. A methodological shift from model-level to agent-level red-teaming gained traction, with synthetic-world approaches intercepting tool calls to test real agent behavior; simultaneously, empirical research confirmed ML engineering agents consistently underperform manual baselines on fairness constraints despite explicit fairness prompting, and critics documented the mathematical impossibility of satisfying all three EU AI Act fairness metrics simultaneously—leaving regulatory compliance criteria unresolved.