Healthcare organizations increasingly depend on a diverse dataset for richer insights into patient health and outcomes. Beyond the clinical information contained in the EHR, a person’s health is also the product of real-world lifestyle and social factors. Thus, the ability to reliably link real-world and clinical datasets is becoming critically important to understanding the whole person.
For life science researchers designing and running clinical trials, use of accurately matched clinical and real-world data (RWD) can increase the likelihood of recruiting the right patients, create a diverse pool of participants, reduce costs, and improve the chances for successful launch.
Life science researchers must reach beyond clinical records and claims data to obtain a more complete picture of health outcomes. Sources of real-world data, such as wearable technology, lab results and other consumer data, can provide supplemental insights to the clinical record. Along with RWD, there is a growing need to address health equity. This means understanding social determinants of health (SDOH), such as where a patient lives, their ethnicity and their family structure.
The FDA recognizes the importance of RWD and is advocating its broader utilization for life science research. In 2021, the agency issued a draft guidance on real-world evidence, acknowledging the limitations of EHRs and claims data. In a 2022 draft guidance, the FDA recommended that developers of medical products incorporate diversity plans early in clinical development to reduce healthcare disparities1.
Pharmaceutical companies recognize the value of RWD at every stage of a clinical trial. Utilizing RWD means better data quality and increased clinical trial efficiency, which in turn impacts drug development costs – estimated at between $314 million and $2.8 billion for each new drug2. Further, drugs developed with the use of RWD have a better chance of coming to market. With RWD, a new drug has an 89% chance of launching, compared to 68% for a drug developed without RWD3.
There’s no question that RWD is essential to life science research for a variety of reasons. Every stage of a clinical trial, from study design to recruitment, execution, reporting and follow-up, may rely on RWD. The challenge lies in accurately matching data sourced from disparate data sources.
While it’s understood that combining disparate datasets is critical for more complete and meaningful insights, it does makes linking more complex. Datasets from different sources may use different “recipes” with varying data formats and data points. These datasets may be either identified or de-identified.
De-identified data increases patient privacy and security, but removing identifiers hinders the ability to link to other data. Traditional tokenization makes it possible to protect privacy without altering the source data, but it has limitations. Advanced, interoperable tokenization methods make it possible to link disparate datasets with greater reliability.
First-generation tokenization typically uses one of two methods to connect data records: deterministic linking or probabilistic linking. These methodologies, established in the 1970s, rely on either direct matches between data points or statistical probabilities to establish links. However, due to the sheer volume and complexity of healthcare and life sciences data, achieving precise matches can be challenging.
Deterministic linking compares datasets to identify direct matches between data points.
Probabilistic linking relies on points of direct match, but also considers how frequently a data value occurs within a dataset.
Both methods use scoring methods to assess the likelihood of a match. Deterministic linking might match two records for John Doe, both age 35, as the same person if there is no other conflicting data present. In contrast, probabilistic linking, would also consider the prevalence of the names “John” and “Doe” in the general population.
The problem with both methods is a tendency to either under-link or over-link records. Under-linking occurs when different records for the same individual fail to match due to inconsistencies or errors in the data. For example, if John Doe also goes by John D. Doe and entered his phone number with a typo, this may look like two different people to a deterministic engine. In the healthcare context, under-linked data may also mean that critical health information, such as a drug allergy, is omitted – potentially with dire consequences.
Conversely, over-linking can incorrectly associate records that should remain separate. For example, John D. Doe and his son, John D. Doe, Jr., may be incorrectly matched as the same person if “Jr.” is omitted from one facility’s record but included in the other. Again, this may result in the omission (or incorrect inclusion) of critical health information.
While traditional tokenization methods have paved the way for data linking, advancements in interoperable tokenization help overcome these challenges.
Next-generation tokenization introduces a novel approach to data linkage. Instead of directly comparing datasets, records are first compared against a comprehensive, referenceable database. This "referential" database utilizes thousands of data points to confidently establish links between multiple records, significantly reducing the risk of both under-linking and over-linking.
With next-generation, or smart tokenization, the data “recipes” of disparate datasets are less important. The referential database serves as the intermediary that confirms two distinct records belong to the same individual. Datasets are securely matched behind a firewall, tokenized, and delivered in a de-identified, research-ready format.
The vast array of data in a referential database contains all known insights about a person – but it is only as good as the data. To effectively serve as the “source of truth” for data linking, the referential database needs to be updated frequently as records change.
Researchers need precise, research-ready data for reliable results. Interoperable tokenization methods represent a significant advancement in data linking technology, suitable for complex healthcare and life sciences research requirements. With next-generation tokenization, researchers can create a data-rich, longitudinal view of a patient’s journey, fully de-identified to meet HIPAA standards.
References:
1 21st Century Cures Act. H.R. 34, 114th Congress. 2016. https://www.gpo. gov/fdsys/pkg/BILLS-114hr34enr/pdf/BILLS-114hr34enr.pdf. Accessed June 2, 2023
2 Wouters OJ, McKee M, Luyten J. Estimated research and development investment needed to bring a new medicine to market, 2009-2018. JAMA.2020;323(9):844-853. doi:10.1001/JAMA.2020.1166. https://pubmed. ncbi.nlm.nih.gov/32125404/.
3 The Economist. Real-World Data Trials - Summary. Accessed January 31, 2023. https://druginnovation.eiu.com/real-world-data-trials/.