AI EvaluationJune 23, 2026·6 min read

Synthetic Data Passes the PII Test. It May Still Corrupt Your Model Validation.

Synthetic data for model validation in BFSI without PII solves the data-access problem — but introduces a distributional fidelity risk that RBI examiners and internal audit are not yet equipped to ignore.

📥 Featured researchEU AI Act Readiness Index 2026

Get the report →

Key takeaways

Synthetic data removes PII exposure from model validation pipelines, but it introduces a second, less-examined risk: the synthetic dataset itself may not faithfully represent the real distribution it is meant to substitute.
LLM-prompt-driven and GAN-based synthesis both produce plausible-looking data; neither guarantees that population-level statistical structure — correlations, tail behaviour, class imbalances — is preserved without explicit measurement.
Differential privacy provides a mathematically grounded privacy guarantee, but a low epsilon value does not imply high distributional fidelity — the two properties are independent and must be evaluated separately.
TVD, KS-statistic, and PSI are the three practical metrics for quantifying whether a synthetic dataset is a valid substitute for real data in a model validation context; each captures a different type of divergence.
RBI Model Risk Guidelines require that validation be conducted on data that is representative of live conditions; if the synthetic dataset is never tested for fidelity, the validation evidence is structurally incomplete and audit-exposed.

Why PII Exposure Breaks Model Validation Pipelines

Credit-scoring and fraud-detection models at Indian BFSI firms live on transaction histories, bureau data, and behavioural sequences that are almost entirely composed of personal data as defined under the Digital Personal Data Protection Act 2023. The moment a Model Risk team pulls a validation dataset from a production data store — even into a sandboxed environment — it triggers data-handling obligations that most organisations have not yet operationalised cleanly. Consent traceability, purpose limitation, and data-minimisation requirements under DPDP do not disappear because the dataset is labelled internal. They apply at the point of processing, which a model validation run plainly is.

The downstream consequence is practical and immediate. Validation teams either operate on samples so aggressively anonymised that they are no longer statistically representative, or they accept informal risk positions that no one has formally signed off. Neither option is defensible once an RBI examiner asks for the data lineage behind a validation report. Synthetic data entered this conversation as the structural fix: generate a statistically equivalent dataset with no real individuals in it, and the PII problem is eliminated at the source. That framing is correct as far as it goes. The problem is that it stops exactly where the harder question begins.

How LLM-Prompt-Driven and GAN-Based Synthesis Work Mechanically

Two synthesis approaches dominate enterprise practice. Generative Adversarial Networks train a generator and a discriminator in an adversarial loop: the generator learns to produce samples that the discriminator cannot distinguish from real ones. When this converges, the generator has internalised the joint probability distribution of the training data. For tabular financial data — loan applications, transaction records, bureau features — conditional GANs such as CTGAN or TVAE extend this to mixed-type columns and conditional relationships, so that synthetic obligors reflect realistic correlations between income band, credit utilisation, and default probability.

LLM-prompt-driven generation works differently. A large language model is given a schema, a statistical profile of the real dataset, and sometimes a small number of real examples as context, then prompted to produce synthetic rows that conform to specified distributional parameters. This is faster to configure and does not require training a bespoke generative model, but the fidelity guarantee is weaker: the LLM is pattern-matching against its training distribution and the prompt constraints, not directly optimising against the real data's joint distribution. For structured tabular data with complex conditional dependencies — exactly the type that drives credit and fraud models — this matters a great deal.

How Differential Privacy Quantifies the Protection

Differential privacy provides a formal mathematical guarantee about information leakage. The core idea is that the output of a computation — including the trained parameters of a generative model — should be approximately the same whether or not any single individual's record is included in the input dataset. The epsilon parameter quantifies how much the output is permitted to vary: a lower epsilon means a stronger privacy guarantee. In practice, achieving a low epsilon requires injecting calibrated noise into the training process, typically through the DP-SGD optimiser applied during GAN training.

The critical point for BFSI model risk managers is that differential privacy and distributional fidelity are orthogonal properties. A model trained with epsilon equal to one provides a rigorous privacy bound. It says nothing about whether the synthetic data it produces accurately reflects the marginal distributions of income, default rates, or fraud prevalence in the real population. You can have high privacy and low fidelity, or low privacy and high fidelity, or any other combination. Conflating the two is the most common and consequential error in how synthetic data programmes are scoped internally. The privacy team signs off on epsilon; the model validation team assumes fidelity has been established; neither checks the other's assumption.

How TVD, KS-Statistic, and PSI Evaluate Synthetic Data Quality

📊 Related research

EU AI Act Readiness Index 2026

Most regulated enterprises remain structurally unprepared for EU AI Act obligations despite partial enforcement beginning February 2025, with 78% taking no meaningful compliance steps and 83% lacking even basic AI system inventories—the foundation for all subsequent requirements.

Get the report →

Distributional fidelity must be measured explicitly, using metrics that capture different aspects of divergence between the real and synthetic populations. Three are operationally important in a model validation context.

Total Variation Distance measures the maximum difference between the real and synthetic probability distributions across all possible outcomes for a given feature. It is bounded between zero and one and is interpretable as the largest probability mass that the two distributions disagree on. A TVD below 0.1 on continuous credit-bureau features is a reasonable starting threshold, though the acceptable bound should be calibrated to the feature's predictive weight in the downstream model.

The Kolmogorov-Smirnov statistic compares the empirical cumulative distribution functions of a real and synthetic column. It is particularly sensitive to differences in the tails — the low-probability, high-consequence region where fraud signals and extreme credit events concentrate. Because fraud detection models are precisely trying to learn from these tail events, a KS-statistic that looks acceptable on average can mask serious fidelity failure in exactly the segment that matters most.

Population Stability Index was developed for credit model monitoring and measures whether the distribution of a variable has shifted between two populations, using a symmetric relative-entropy formulation. PSI below 0.1 is conventionally stable, 0.1 to 0.2 requires investigation, above 0.2 indicates a significant shift. Applying PSI to compare real training data against synthetic training data gives model risk teams a metric that is already embedded in their RBI-aligned monitoring vocabulary, which simplifies internal audit conversations considerably.

How This Maps to Audit-Ready Validation Under RBI and DPDP Act

RBI's Model Risk Guidelines, aligned in principle with the Federal Reserve's SR 11-7 framework, require that model validation be conducted on data that is representative of the conditions under which the model will operate. The guidelines call for independent validation, documentation of data quality, and evidence that the validation process itself is not compromised by input data deficiencies. If a validation team substitutes synthetic data for real data but produces no evidence that the synthetic data is distributionally equivalent, they have introduced an untested assumption into the validation chain. That assumption is not visible in the model's performance metrics. It surfaces only when an examiner asks what the validation dataset actually was and whether its representativeness was verified.

The DPDP Act 2023 does not mandate synthetic data, but it creates the conditions that make synthetic data operationally attractive. Purpose limitation and data-minimisation obligations make it harder to justify broad access to real customer data for validation activities that could, in principle, be conducted on a statistically equivalent synthetic substitute. Organisations that invest in synthetic data programmes therefore need a parallel investment in synthetic data assurance: documented generation methodology, fidelity test results at the column and joint-distribution level, differential privacy parameters, and a clear statement of which model validation conclusions rest on synthetic inputs.

Commercial generation platforms — including Mostly AI, Gretel, and AWS Clean Rooms with synthetic data features — provide generation pipelines and, in some cases, basic univariate fidelity summaries. What they do not provide is generation-quality testing at the level that a model risk validation report requires: multivariate fidelity assessment, tail-distribution analysis, PSI against the real validation baseline, and an audit trail that links synthetic dataset version to model validation artefact version. That gap is not a criticism of those platforms' core capability; it is a scope boundary. Generation and assurance of the generated output are different functions, and treating the first as a substitute for the second is the error that will eventually surface in an RBI model risk examination.

A synthetic dataset that passes a privacy audit but fails a distributional fidelity check is not a safe input — it is an unacknowledged assumption baked into every subsequent validation artefact. The organisations that will handle regulatory scrutiny well are those that treat synthetic data not as an endpoint but as an input that requires its own validation discipline, applied before the model validation cycle begins.

“A synthetic dataset that passes a privacy audit but fails a distributional fidelity check is not a safe input — it is an unacknowledged assumption baked into every subsequent model validation artefact.”

Go deeper — gated research

EU AI Act Readiness Index 2026

Get the report →Talk to our team →

By Qapitol· AI assurance & governance

Synthetic Data Passes the PII Test. It May Still Corrupt Your Model Validation.

Why PII Exposure Breaks Model Validation Pipelines

How LLM-Prompt-Driven and GAN-Based Synthesis Work Mechanically

How Differential Privacy Quantifies the Protection

How TVD, KS-Statistic, and PSI Evaluate Synthetic Data Quality

How This Maps to Audit-Ready Validation Under RBI and DPDP Act

EU AI Act Readiness Index 2026

Related insights

Your RLHF Model Passed Staging. The Reward Signal Is Already Decaying.

Your Fraud Scoring SaaS Cleared QA. It Has Never Been Tested for Distributional Drift.

HARA Finds the Cliff Edge. It Cannot See the Fog: SOTIF Test Coverage for Machine Learning ADAS

Enjoyed this? There’s more every two weeks.