In this next installment of our 2025 Trends series, we explore a major shift that is reshaping clinical research: the growing reliance on reliable, representative real-world data (RWD) and synthetic datasets to accelerate clinical research.
As synthetic data and digital twin models move from emerging tools to essential components of modern clinical trials, their success increasingly hinges on the quality and diversity of the real-world data they reflect. Tokenized RWD offers the foundation needed to build accurate, privacy-preserving synthetic datasets that strengthen trial design and better reflect real-world populations.1
In this blog, we examine the critical role of tokenized real-world data, outline the types of synthetic models advancing clinical research and highlight the best practices research teams can adopt to ensure data-driven innovation remains scientifically sound, ethically grounded and operationally effective.
Synthetic data refers to computer-generated information that replicates the statistical properties of real-world data without exposing actual patient identities. Digital twins are virtual models of individual patients that enable researchers to simulate treatment plans, predict outcomes and assess adverse events with greater precision than traditional trial designs. Both approaches offer the potential to accelerate timelines, reduce costs and minimize patient burden.2 However, their success depends heavily on the quality of the underlying data. High-quality tokenized real-world data, which replaces personally identifiable information (PII) with encrypted identifiers, is essential for creating synthetic models that are accurate, representative and compliant. Without diverse longitudinal and well-validated data inputs, synthetic models can introduce bias, reduce generalizability and compromise their reliability in clinical research.
Tokenization supports the creation of synthetic match-controlled cohorts by linking structured data, such as diagnosis codes with unstructured data, physician notes or imaging results. It also facilitates cross-institutional collaboration, ensuring privacy and compliance without exposing protected health information (PHI). A referential data layer can enhance the precision, consistency and privacy protection of tokenization processes used to generate high-quality synthetic data.
When synthetic datasets and digital twins are built on tokenized longitudinally linked RWD, they more effectively reflect the diversity and complexity of real-world patient populations. These models can simulate complex disease trajectories and predict varied treatment responses. Tokenization improves the scientific validity of synthetic models and helps advance more efficient, inclusive and reliable clinical research.
While synthetic data and digital twins offer significant potential to improve clinical research, sourcing the underlying real-world data introduces several complex challenges.
First, data representativeness remains a critical concern. Synthetic models are only as reliable as the data they reflect, making it essential to ensure that datasets capture the full diversity of the intended patient populations. Without careful efforts to include underrepresented groups across dimensions such as race, ethnicity, gender, age and comorbidity profiles, synthetic datasets can perpetuate existing biases and lead to inaccurate modeling.
Second, robust validation and benchmarking processes are necessary to ensure that synthetic data maintains clinical and regulatory credibility. Research teams must develop standardized methods to compare synthetic cohorts against real-world benchmarks, evaluating attributes such as demographic similarity, clinical outcome patterns and predictive performance. Without consistent validation frameworks, it becomes difficult to assess whether synthetic data accurately mirrors the populations and outcomes it is intended to simulate.
Finally, regulatory compliance is an ongoing priority. As synthetic data solutions advance, research teams must ensure alignment with existing regulations such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States, the General Data Protection Regulation (GDPR) in Europe and emerging agency-specific guidance related to data integrity and synthetic control arms. Ethical considerations, including transparency around synthetic data usage and maintaining robust privacy protections, are equally important to sustain public trust and meet regulatory expectations.
Together, these challenges highlight the need for careful planning, rigorous methodology and a commitment to high-quality, representative data sourcing in any synthetic modeling effort.
As synthetic data and digital twin technologies become increasingly integrated into clinical research, the importance of high-quality tokenized real-world data cannot be overstated. Tokenization enables the creation of accurate, privacy-preserving synthetic datasets by maintaining the continuity and richness of patient information across multiple sources without compromising confidentiality.3
Blending synthetic datasets with real-world insights greatly improves the credibility and practical value of emerging models. When synthetic cohorts are built on validated, diverse real-world data, research teams can create more realistic simulations and generate evidence that truly reflects the patients they serve. This approach helps build trust across the board with regulators, healthcare providers and patients. It helps to illustrate that synthetic methodologies are not only scientifically sound but also practical and reliable in real-world research.
Ultimately, success in synthetic model development hinges on a foundation of rigorous validation, ethical data stewardship and proactive regulatory alignment. Organizations that invest in these pillars today will not only accelerate their clinical research timelines but also position themselves as leaders in the next evolution of evidence generation and precision medicine.
References:
Please fill out the form below and we'll be in touch shortly, or call us for immediate assistance at
1-866-396-7703