Synthetic data generation

The AI landscape doesn't move in one direction — it lurches. Some techniques leap from experiment to table stakes in a single quarter; others stall against regulatory walls, technical ceilings, or organisational inertia that no amount of hype can dislodge. Knowing which is which is the hard part. The State of Play cuts through the noise with a rigorously maintained index of AI techniques across every major business domain — classified by maturity, evidenced by real-world adoption, and updated daily so you always know where you stand relative to the field. Stop guessing. Start knowing.

AI Maturity by Domain

Each dot marks the weighted maturity of practices within a domain — hover for a brief summary, click for more detail

DOMAIN

BLEEDING EDGEESTABLISHED

LEADING EDGE

TRAJECTORY— Stalled

AI that generates realistic but artificial datasets for testing, training, and privacy-preserving data sharing. Includes tabular, text, and image synthetic data; distinct from data augmentation which modifies real data rather than generating from scratch.

OVERVIEW

Synthetic data generation promises to break the deadlock between data access and data privacy, but after seven years of development the practice remains experimental, with production use confined to a handful of high-governance verticals. The core idea -- generating artificial datasets that preserve the statistical properties of real data across tabular, text, and image domains -- has attracted substantial vendor investment and regulatory attention. NVIDIA's acquisition of Gretel and Microsoft's integration of synthetic data into Phi-4 training signal genuine commercial confidence. Yet independent research from EPFL and Max Planck has formalised hard limits: for many use cases the trade-off between fidelity and privacy cannot be overcome algorithmically. Vendor consolidation reinforces the caution; multiple funded startups have shut down or been acqui-hired, and surviving companies are pivoting toward platform embedding rather than standalone tools. Where synthetic data works -- fraud detection in banking, clinical trial augmentation in pharma, QA in regulated software -- it works within tightly bounded conditions. Broader enterprise scaling remains blocked not by a lack of tooling but by unresolved privacy-validation standards, relational-data quality gaps, and a growing model-collapse risk that turns synthetic convenience into a systemic liability.

TIER HISTORY

ResearchJan-2019 → Jan-2021

Bleeding EdgeJan-2021 → May-2026

Leading EdgeMay-2026 → present

EVIDENCE (116)

FCA Synthetic Data and Anti-Money Laundering project report: Key points for financial services firmsCase Studies2026-05-01

— UK Financial Conduct Authority multi-stakeholder project deploying fully synthetic datasets with money laundering typologies for AML testing and innovation compliance.

Synthetic Data | European Data Protection SupervisorIndustry Reports2026-05-01

— Authoritative EU regulatory framework defining synthetic data techniques, use cases, governance requirements, and privacy/fairness implications from data protection authority.

Digital Twins and Synthetic Data in Medical Device Validation: When Simulated Evidence Helps and When It FailsIndustry Reports2026-04-30

— FDA framework and regulatory guidance on synthetic data and digital twin acceptance in medical device validation with deployment pathways by domain.

How CHIMERA Proved the Efficiency of Synthetic Data - Latent NotesResearch Papers2026-04-29

— CHIMERA framework demonstrating high-quality synthetic data (9K samples) outperforms larger models on reasoning tasks via data-centric design; validates quality-over-scale paradigm.

Simula: Google's Framework for Reasoning-Driven Synthetic Data GenerationConference Talks2026-04-22

— Google framework achieving independent control over quality, diversity, and complexity in synthetic data; signals major platform vendor confidence in practice maturity.

Synthetic Data in Healthcare Market Forecast 2033Adoption Metrics2026-04-22

— Healthcare synthetic data market growth from $658M (2025) to $5.88B (2033) at 31.5% CAGR; adoption in clinical trials, AI training, and privacy-preserving analytics.

Enterprise-Scale Test Data Transformation for a Global Specialty InsurerCase Studies2026-04-17

— Qualitest + Synthesized platform deployment at multi-billion-dollar insurer: 60% faster test data production, 28M+ rows secured, 100% referential integrity, zero security waivers; full enterprise production deployment.

Synthetic Data's Real Value Isn't Alpha - It's ConfidenceConference Talks2026-04-15

— Named practitioner panel (Laurion Capital, T. Rowe Price, Jupiter Research Capital) on synthetic data in quantitative finance; documents operational deployment with clear scope limitations and ontology bias framing.

HISTORY

2019: CTGAN (Conditional Tabular GAN) published at NeurIPS as foundational tabular synthetic data method; open-source ecosystem emerging via Synthetic Data Vault project; privacy-preserving variants (DP-auto-GAN) researched; multi-domain adoption signals (education systems, DoD funding) indicate growing interest but limited production deployment.
2021: Microsoft released SmartNoise with DP-CTGAN and PATE-CTGAN synthesizers, bringing differential privacy to mainstream tooling. First real-world deployments validated in genomics (Gretel.ai + Illumina GWAS replication) and climate science (62% accuracy improvement in ML emulators). Regulatory attention peaked with EDPS convening 170+ experts on privacy-utility trade-offs. Production quality concerns surfaced: community reports of out-of-bounds value generation and training inconsistency persist, limiting broader adoption despite growing vendor and academic activity.
2022-H1: Cloud vendors entered mainstream support: AWS launched synthetic image generation in SageMaker Ground Truth (June 2022). Enterprise adoption broadened to JPMorgan, John Deere, American Express in early pilots for fraud detection and model training. Specialized vendors scaled rapidly: MOSTLY AI raised $25M Series B funding; Gretel.ai expanded genomics deployments. However, critical research exposed fundamental limitations: tabular data benchmarks revealed evaluation metrics don't capture quality gaps; privacy auditing framework found synthetic data quality-leakage trade-off and differential privacy failures under membership inference attacks. Healthcare domain showed promise with GAN-based patient glucose data replication. Adoption remained strongest in privacy-critical and data-constrained domains despite growing evidence of quality and privacy risks.
2022-H2: Production validation emerged in niche domains: healthcare systems deployed GAN-based glucose monitoring, industrial manufacturing used synthetic vision data achieving 15% improvement in sim-to-real gaps. Research on privacy-preserving recommendation systems and regulatory analysis (Canadian Privacy Commissioner) highlighted privacy-utility tensions as fundamental trade-offs rather than solvable problems. Gartner's 60% synthetic data prediction by 2024 contrasted sharply with growing skepticism about quality and privacy guarantees. By year-end, early adoption solidified in regulated domains (genomics, healthcare, manufacturing) but broader enterprise use remained blocked by quality inconsistency, privacy vulnerabilities, and lack of privacy-utility evaluation standards.
2023-H1: Market analyst forecasts amplified hype (Gartner 60% by 2024, MarketsandMarkets $2.1B by 2028), but empirical deployment evidence revealed harsh tradeoffs. UK Financial Conduct Authority established Synthetic Data Expert Group, cautiously exploring use cases in financial services. ICML 2023 research documented 40+ concrete enterprise deployment challenges (generation, infrastructure, governance, compliance) and showed naive synthetic data approaches fail for minority classes, requiring ensemble methods. Critical new risk emerged: recursive training on synthetic data causes irreversible model degradation and distribution collapse. Adoption remained confined to privacy-critical domains despite growing market positioning; broader enterprise use blocked by unresolved governance and quality-utility frameworks.
2023-H2: Tooling ecosystem matured with competing open-source frameworks (MOSTLY AI SDK 749 stars, hitsz-ids/SDG 2.4k stars) and ongoing SDV development. Regulatory pathways validated: research confirmed differentially private synthetic data could meet GDPR/CCPA standards. Empirical studies revealed contingent benefits—synthetic data enhanced models only with scarce real data but degraded performance with excessive use, requiring careful real-synthetic data orchestration. Model degradation risk ("Curse of Recursion") gained wider visibility through tech journalism, highlighting structural limits to training on synthetic outputs. By year-end, market forecasts of 60% adoption by 2024 contrasted with reality: enterprise deployments remained narrow (genomics, healthcare, early pilots); broader scaling blocked by unresolved governance, quality consistency, privacy-utility frameworks, and model degradation risks.
2024-Q1: Government and vendor validation accelerated despite regulatory scrutiny. EU Digital Finance Platform deployed Synthesized's synthetic data for production data hub with JRC validation of distribution accuracy and confidentiality. Major vendors advanced integrations: Gretel partnered with Microsoft Azure and AWS for MLOps workflows; MOSTLY AI released v200 with enhanced generator architecture. However, critical legal analysis emerged: EU data protection official warned that GDPR compliance gaps persist during synthesis phases, with personal data processing risks and regulatory grey areas. Ethical discourse intensified around bias, privacy, and 2030 predictions of synthetic data dominance, highlighting governance challenges. By quarter-end, early regulatory adoption (EU financial services) coexisted with growing caution about compliance and ethical risks, suggesting adoption confined to high-stakes regulated domains with strong legal review.
2024-Q2: Deployment evidence diversified across healthcare, vision, and text domains; cloud vendor integrations matured (Azure, AWS workflows). Oncology research validated synthetic data for survival analysis with structured comparative evaluation (CART methods achieving 88%-98% accuracy). Face recognition systems achieved competitive performance with synthetic training data (DCFace, GANDiffFace) enabling privacy preservation and bias reduction. LLM-driven approaches advanced: persona-guided synthetic surveys outperformed generic methods; large-scale synthetic text (OAK, 500M tokens) addressed data scarcity. However, fundamental technical barriers clarified: model collapse proven unavoidable with synthetic-only training, requiring careful real-synthetic mixing. Adoption remained bifurcated—regulated domains with governance capacity continued pilots; enterprise rollout blocked by GDPR compliance gaps and absence of harmonized quality standards. Adoption confined to privacy-critical and data-scarce domains despite vendor ecosystem maturation.
2024-Q3: Cloud vendor ecosystem matured (Google Cloud integration), market growth projected to $1.788B by 2030. Regulatory frameworks developing (Singapore PDPC guidelines); institutional readiness surveys launched (UK Data Service). Critical research intensified: Stanford/Harvard documented model collapse at scale, Lausanne/EPFL revealed synthetic data unsuitable as real-data replacement, Hastings Center identified persistent privacy/accuracy/bias risks. Enterprise adoption confined to high-governance domains; broader rollout blocked by absence of standardized evaluation and fundamental technical barriers.
2024-Q4: Vendor ecosystem continued expansion (Google Cloud BigQuery integration, MOSTLY AI text synthesis platform). Healthcare validation framework validated (SYNTHEMA EU project for AML/SCD). However, ground-reality adoption remained constrained: only 2% of enterprises production-ready for GenAI with 48% blocked by privacy/security concerns. Relational data synthesis still lacks fidelity; model collapse containment proven theoretically but unresolved in practice. Analyst forecasts (75% by 2026) diverged sharply from enterprise readiness surveys; production adoption confined to high-governance domains requiring careful real-synthetic data mixing.
2025-Q1: Research breakthrough on model collapse mitigation: ICLR 2025 studies showed theoretical path forward—maintaining constant real-data proportion in training loops prevents collapse and ensures convergence. However, adoption barriers hardened. Regulatory uncertainty deepened: Canadian legal analysis found ambiguity in privacy law treatment; privacy metric research showed technical validation insufficient for regulatory compliance. Legal scholarship identified systemic risk: synthetic data contamination could entrench incumbents with access to pre-2022 uncontaminated data. Healthcare domain validated utility for rare disease research. Cloud ecosystem deepened (Azure AI Foundry integration). Enterprise production adoption remained confined to high-governance domains; broader scaling blocked by unresolved regulatory frameworks and privacy-legal sufficiency gaps.
2025-Q2: Vendor ecosystem expanded: AWS Bedrock synthetic data strategy (April), continued Azure/Google Cloud integrations. Model collapse research advanced to multi-modal systems (VLMs, diffusion models); ICML empirical work confirmed accumulation strategy prevents collapse. However, legal barriers intensified: June analysis documented re-identification studies (99.98% success), FTC/CPPA enforcement rising, revealing regulatory-legal gap—synthetic data claims insufficient for compliance. Image synthesis survey (USENIX Security) benchmarked privacy-utility tradeoffs. Healthcare domain narrowed to high-governance settings; fidelity-utility-privacy tradeoffs fundamentally incompatible. Enterprise adoption remained vertically narrow. Practice trajectory: visible technical progress on collapse mitigation, but structural adoption barriers—regulatory ambiguity, privacy validation insufficiency, relational synthesis quality limitations—blocking mainstream enterprise scaling.
2025-Q3: Collapse prevention research matured with practical solutions: ICML 2025 papers (September) from University of Chicago (verifier-based filtering) and Google/USC (curation-based convergence) established theoretical and empirical pathways to iterative synthetic training without degradation. Real-world deployment signals expanded: Gretel internal case (July) achieved 10x experimentation velocity and 1000x training token reduction; UK government/healthcare orgs (Ministry of Justice, NHS England, DfE, ONS) piloted synthetic generation (August) in regulated domains. Cloud vendor confidence grew with AWS Bedrock enterprise templates (April, reiterated in context). However, adoption barriers remained: enterprise use confined to high-governance domains (genomics, pharma, finance, government); broader scaling blocked by regulatory ambiguity, absent privacy validation frameworks, and relational data synthesis limitations. Practice trajectory: technical maturation on collapse solutions and expanded real-world signals in constrained domains, but mainstream scaling blocked by legal-regulatory insufficiency rather than algorithmic gaps.
2025-Q4: Deployment signals expanded beyond research into operational enterprise and government adoption. Financial services: MIT research cited >60% synthetic data use in AI applications (2024); banks deployed for QA and regulatory compliance (fraud detection, anti-money laundering, payment testing). Federal government: GDIT/AWS partnership for disability fraud detection PoC with agency validation of data fidelity. Pharmaceutical research: FDA/EMA joint guidance (January 2026), EHDS Regulation (March 2025), landmark PLOS Digital Health study (2025) validating synthetic data as external control arms in single-arm trials. Market research platforms operational: Qualtrics reported 90% satisfaction, 73% researcher adoption, 39% using as complete replacement. Vendor adoption forecasts: 75% of businesses expected to use GenAI for synthetic customer data by 2026 (from <5% in 2023). Adoption remained vertically concentrated (finance, pharma, government, market research, agentic AI) in data-scarce and privacy-critical domains. Practice trajectory: technical solutions maturing with operational deployments in bounded regulated domains, but mainstream enterprise scaling blocked by incomplete regulatory frameworks, relational data synthesis limitations, and absence of standardized privacy-utility validation frameworks acceptable to regulators.
2026-Jan: Ecosystem crossed into mainstream market awareness with regulatory maturation (EDPB, NIST, FCA guidance), but real-world adoption remained concentrated in high-governance sectors. Gartner Peer Community surveys documented broad awareness: 84% text, 54% image, 53% tabular adoption by organization, but concentrated in QA/compliance use cases. Critical new research: model collapse mitigation strategies (accumulation, verification-based filtering) proven effective; however, AI-generated content pollution accelerating (30-40% of web text AI-originated). Security research exposed compliance spoofing attack surface—fake synthetic audit reports (SOC 2, ISO 27001) bypass auditors. Enterprise QA adoption confirmed practical barriers: complex systems require "right depth and realism" generators (UBS), and relational data synthesis quality remained fundamentally limited. Market research showed most adoption (39% complete replacement) but expert consensus warned unsuitability for real-world behavioral change capture. Barriers: security attack surface expansion, content pollution, relational synthesis limitations, governance gaps. Adoption remained vertically concentrated (finance, pharma, government).
2026-Feb: Market consolidation accelerated with ecosystem bifurcation: NVIDIA acquired Gretel, Scale AI reached $14B valuation, Microsoft integrated synthetic data in Phi-4 training; but independent research (EPFL, Max Planck) formalized fundamental limits showing many use cases are poor problem fits. Vendor ecosystem fragile—critical failures (Datagen shutdown after $70M, Synthesis AI dissolved, AI.Reverie acqui-hired) alongside surviving companies requiring platformization strategy. Deployment evidence domain-specific: Qualtrics fine-tuned synthetic model achieved 12x accuracy vs. GPT/Gemini on attitudinal surveys, but confined to trained use cases. New security risk: attackers generating convincing synthetic compliance reports to bypass audits. Model collapse solutions (accumulation, verification-filtering) validated in research but operational deployment remained limited. Relational synthesis fidelity fundamental blocker (UBS: "maintaining generators not straightforward"). AI-generated content pollution (30-40% of web text) creating systemic degradation risk. Enterprise adoption remained concentrated in high-governance verticals despite regulatory framework maturation (FDA/EMA pharma guidance, GDPR/DORA architectures).
2026-Mar: Deployment and research signals continued to differentiate evidence quality. Healthcare market data (DataM Intelligence) projects sector growth from $657M (2025) to $5.88B (2033); McGill University neuro-oncology case study validates synthetic data enabling cross-institution collaboration on sensitive research. Statistical foundations strengthened: peer-reviewed research (Ahmad Abdel-Azim et al., Statistical Science) documented synthetic data pitfalls (model misspecification biases, attenuated uncertainty, generalization failures), while empirical privacy work (Tari/Iamnitchi) quantified privacy-fidelity trade-off (81% authorship attribution on real Instagram posts vs. 16.5-29.7% on synthetic). Critical assessment intensified: LGT analysis of "Habsburg AI" failure mode (model collapse compounding errors) and risk of data poisoning from small amounts of false/biased synthetic training data. Production maturity signals emerged: Ministry of Testing framework defines Four Dimensions validation for synthetic data (statistical fidelity, query pattern reproduction, system behavior replication, edge case coverage), indicating bleeding-edge practitioners moving from adoption to systematic validation. Institutional recognition of broader implications: ERC-funded large-scale social science project (York University SYNDATA) launched January 2026 to investigate societal/ethical consequences. Adoption remained vertically concentrated (healthcare, finance, pharma, government, market research) constrained by relational data synthesis quality limits and unresolved regulatory-legal frameworks despite technical progress on model collapse mitigation.
2026-Apr: Deployment evidence expanded across test data, healthcare, finance, and agentic AI domains; research maturation continued. Qualitest + Synthesized delivered 60% faster test data production at multi-billion-dollar insurer with 100% referential integrity; Simsurveys validated market research deployment at scale with KL divergence benchmarks (0.039–0.006) enabling studies in minutes at 10x cost reduction; Verisma healthcare deployment demonstrated privacy-first QA model training on synthetic data only. Market expansion accelerated: test data generation $1.96B (2025) → $2.52B (2026, 28.3% CAGR); AI synthetic data $1.97B → $2.75B (40% CAGR) → $10.48B by 2030 (39.7% CAGR); healthcare-specific $657M → $5.88B (31.5% CAGR to 2033). Systematic review (101 papers) revealed evaluation method immaturity and negligible domain expert involvement (3.96%), indicating quality assurance remains bleeding-edge. Research advances on model collapse: RL-based synthetic data generation (Llama/Qwen frontier models) showed curriculum learning improves performance; mechanistic analysis confirmed accumulate paradigm prevents collapse mathematically (vs. replace paradigm). Critical assessments intensified: Stanford research documented synthetic data unsuitable for rare events/causal inference; policy analysis identified regulatory vacuum (GDPR lacks synthetic data clause), agentic feedback loop risks, and false fairness masking structural disparities; practitioner analysis quantified model collapse via replace paradigm and web contamination risk (74% of new content AI-generated). Adoption remained vertically concentrated (finance, pharma, healthcare, government, market research, agentic AI) in data-scarce and privacy-critical domains; mainstream enterprise scaling blocked by relational synthesis limitations, incomplete governance frameworks, and absent privacy-utility validation standards.
2026-May: Regulatory frameworks hardened in high-governance sectors: the UK FCA published its Synthetic Data and Anti-Money Laundering project report deploying fully synthetic AML datasets for regulatory testing, and the EU Data Protection Supervisor issued authoritative guidance on synthetic data governance requirements and privacy implications. FDA pathways for medical device validation using synthetic data and digital twins received detailed analysis, confirming acceptance criteria by domain. On the research side, Google's Simula framework achieved independent control over quality, diversity, and complexity in synthetic generation—validating data-centric approaches over scale—and the CHIMERA framework demonstrated 9K high-quality synthetic samples outperforming larger models on reasoning tasks. Healthcare market growth projections ($658M to $5.88B by 2033 at 31.5% CAGR) reflect concentrated adoption in clinical trials and privacy-preserving analytics; fundamental limitations on rare events, causal inference, and relational synthesis continue to bound mainstream enterprise deployment.