A report that, subsequent to the FTC taking action, this vendor deleted three million face images received from OKCupid speaks to the volumes of extremely sensitive data that had been retained and stored, for the vendor’s training purposes, at risk of leak.
On cue to underscore the challenge is the next incident involving Mercor,
1 a $10 billion AI company supplying biometric training data to major players like OpenAI and Meta.
According to reports, Mercor suffered a significant breach linked to a supply chain attack on the open-source LiteLLM library, exposing gigabytes of sensitive identity documents and facial biometrics. This event not only jeopardizes individual privacy of the individuals whose faces were stolen but also raises concerns about the integrity of identity verification (IDV) systems that rely heavily on real biometric data for training purposes.
What is Synthetic Data?
Synthetic data refers to artificially generated information that mimics real-world datasets without containing any actual personal or sensitive details. It is created using advanced algorithms such as generative adversarial networks (GANs) or other machine learning models designed to replicate statistical properties of authentic data while ensuring that no direct link back to individuals exists. In IDV technology, synthetic datasets can simulate facial features or document images needed for system tuning without risking exposure of genuine user information.
How Synthetic Data Enhances Accuracy
One might assume synthetic data compromises accuracy due to its artificial nature. However, if applied correctly, it enhances model performance by providing diverse scenarios that may be underrepresented in limited real datasets. Synthetic faces can cover various ethnicities, ages, lighting conditions, or angles more comprehensively than traditional collections allow. This helps algorithms generalize better when verifying identities across global populations.
Test Subject: Santa Claus
Let us suggest that Santa Claus will present himself for ID verification when renting his convertible for his January trip to Miami. The "white-beard complex" is well known in the world of IDV, whereby men with large white-beards fail to present for the “selfie” liveness check due to the over-reflection of light against their face complexion and beard. The appropriate fix is not to use Santa’s real face image to iterate against until overcoming the challenge. Rather, more effective as a solution and less damaging to Santa, is to tune the AI system by auto generating synthetic images of white-bearded men, across multiple ethnicities, ages, and face profiles. This improves the engines' ability to recognize hundreds of white-bearded men in the future, without using any of Santa's real cheery smile (i.e. PII).
Protecting User Privacy: The Advantages of Using Synthetic Data Over Real Data
The FTC action and Mercor breach illustrate in the most egregious way how reliance on real biometric databases create vulnerabilities exploitable by hackers aiming at deepfake creation or social engineering attacks. By contrast, synthetic data eliminates these risks because it contains no personally identifiable information (PII). Organizations adopting synthetic datasets reduce their attack surface significantly while still maintaining high standards for model tuning quality.
Legal and Ethical Considerations When Implementing Synthetic Data Solutions
While synthetic data offers promising benefits for privacy preservation and accuracy improvement alike, companies must navigate complex legal frameworks governing biometric information use - such as GDPR in Europe - that impose strict controls over personal data processing and sharing practices. Transparency around how synthetic datasets are generated and validated is essential alongside rigorous testing protocols to ensure that they do not inadvertently encode biases present in original source material used during generation processes.
Looking to the Future: The Consumer will Choose
Looking ahead, AI-driven identity verification will continue to evolve rapidly amid rising cybersecurity threats exemplified by incidents such as the Mercor breach. This will not be without consequence. Companies that continue to place IDV tools at the front-end of their online business processes will prioritize (1) resilient solutions integrating cutting-edge techniques with (2) synthetic dataset augmentation strategies that strengthen end user trust. Because it will take one Top 5 trending news article to apply safer tools.
IDV with personal data training is the next trans-fat.
Yes, innovation must continue to evolve, to improve realism within synthetics while developing detection tools capable of distinguishing between legitimate users versus sophisticated deepfake and injection attacks derived from stolen biometrics. Ultimately, balancing privacy protection against fraud prevention remains paramount as organizations strive toward safer digital ecosystems where their customers’ rights are respected alongside seamless authentication experiences.
1. The Record, “Mercor confirms security incident tied to LiteLLM,” 2026.