Data Anonymization Techniques in a Clinical Trial

“Originally published on TrialSite News on June 28, 2021”

Clinical trial transparency is quickly emerging as a critical function within sponsor and academic organizations, as its importance in the pharmaceutical, biotechnology, and medical device industries increases. Understanding data anonymization techniques and tools is an important part of adhering to increasing regulation.

In the last five years alone, we have seen the introduction of numerous country-specific transparency regulations, industry guidelines, and requirements, including EMA Policy 0070 and FDAAA 801, for instance.

In the imminent future, even more regulation, such as the full implementation of the EU Clinical Trial Regulation and associated Clinical Trial Information System (CTIS), and increased scrutiny and focus on data transparency are expected.

These changing and increasing requirements force sponsors to adapt quickly, especially when it comes to data anonymization. Those who are not able or willing to do so face increased public pressure, the potential of penalties for non-compliance, and issues leveraging advantages that greater transparency can bring.

Greater transparency has the potential to increase:

Trust, while enhancing public perception
Awareness and interest from investigational sites and clinical trial participants
Consistency of information and data released to the public
The potential reuse of clinical documents and data, thereby utilizing existing research to shorten the drug development process

With greater transparency, ensuring clinical trial participants’ privacy and confidential company information is of paramount importance. Strong internal processes around anonymization, redaction of documents, and anonymization of data are needed.

Anonymization vs. Pseudonymization: What’s the Difference?

De-identification can be accomplished in one of two ways: complete redaction or pseudonymization.

Based on the ISO standard definition, anonymization is the process in which identifiable information is irreversibly altered so that the person can no longer be directly or indirectly identified.

Therefore, anonymization can be thought of as a complete redaction or masking of information and data, where all identifying characteristics are deleted. After the transformation process, it is not possible to retrieve or re-identify that data.

Another approach to de-identification is pseudonymization, in which data that could be used for identification is replaced with a pseudonym. The anonymizer generates and retains a master key, usually a table or graph, which connects the subject and the pseudonym, meaning it is still possible to re-identify the subject if one knows the key.

The advantage of pseudonymization (replacement of identifiers with a pseudonym) vs. anonymization (redacting or masking identifiers) is that pseudonymization retains the utility of the data. This allows for meaningful secondary analyses and follow-on research while maintaining patient confidentiality.

Examples of anonymization include redacting the name, date of birth and other demographic identifiers of individuals associated with the conduct of the trial. Redaction of this information generally does not impact the utility of the data since it is not related to any medical or clinical information collected from trial participants.

Pseudonymization, in comparison, goes one step further by generalizing potential identifying characteristics and important clinical information from other personal data, for example, generalizing a location such as a city to region or country, or generalizing a discrete age to an age range. Pseudonymization therefore provides the benefit of ensuring patient privacy while maintaining a level of data utility so that it may be utilized in follow on research and analysis.

There is a perpetual trade-off between patient privacy and data utility when considering the approach to anonymization; meaning, the better the data utility, the lower the data privacy, and vice versa. Because of this, a robust risk assessment process must be in place to assess the potential risk of re-identification. It is critically important to ensure patient confidentiality. However, balancing this while maximizing the utility of secondary research data will continue to be a challenge.

Sponsors and researchers continually face potential breaches in patient confidentiality and data when releasing clinical trial information to the public. A data breach can be more than just being able to identify an individual participant in a clinical trial. It could be as deep as divulging sensitive medical and lifestyle information that could lead to social boycott and social stigmatization of individuals. If information of this nature becomes public, the potential impact on an individual could be personally, socially, and financially devastating. Further, the impact on the entire clinical trial process could be significant and lead to a reluctance of individuals willingness to participate in clinical trials and reduce trust in the industry and a sponsor’s ability to ensure clinical data is secure. Maintaining a patient’s anonymity not only protects their right to privacy but also helps curb data breach and identity theft.

Avoiding these potential breaches requires that clinical information collected in a clinical trial is secure and, therefore, must be adequately anonymized before public disclosure. Adopting an anonymization process that broadly follows the steps outlined below can reliably reduce the risk of disclosing personal information.

Step #1: Identify and Classify Variables

Before proceeding with the anonymization of clinical information, particularly in documents where the data and information is presented in an unstructured format, it is important to define and classify direct and indirect identifying variables.

Direct identifying variables are commonly described as information that meets the following criteria:

Replicable: the variable is unlikely to vary frequently over time
Distinguishable: the individual patients may have distinct, recognizable results or values
Knowable: someone knows a variable, or variables associated with a particular individual

Other identifying variables that fall within the definition of personal information are indirectly identifying variables. These are defined as variables that may present a significant possibility of re-identifying an individual when combined with other available information, like demographic data. Still, these variables may be necessary to understand the clinical data, and therefore their anonymization must be carefully considered and justified.

Variables that do not present a high probability of re-identifying an individual, alone or in combination with other information, are not considered personal information and should not be transformed or redacted during the anonymization process.

Step #2: Measure the Re-Identification Risk

The overall risk of re-identification associated with the disclosure of clinical information is the product of the risk inherent to the data and the risk associated with the release context. While too involved to describe in detail here, a robust quantitative risk assessment process (potentially paired with a qualitative risk assessment process) is critical in evaluating re-identification risk. Once the data risk is measured, this risk measurement provides justification for any data transformation(s) that may be employed.

In a public release environment, the risk associated with the context of release is unreducible (i.e., once released, the information is unretractable), so the overall risk of re-identification is higher than the release of information to a small group of select individuals (e.g., researchers). In this scenario, the inherent risk associated with the data becomes the sole driving factor that reduces the overall risk to an acceptable level.

For a successful release of information to the public (i.e., ensuring patient privacy while maximizing data utility), the calculation of re-identification risk needs to reflect this environment. The data itself must be assessed quantitatively so that the overall risk can be measured against a predefined threshold. For example, one must consider the number of subjects that share a common indirect identifier, such as age. If this number is low for any particular value, that potential identifying variable may require anonymization.

Step #3: Anonymize the Data

The methodology used to anonymize clinical information can have a significant impact on data utility. Therefore, it’s best not to anonymize variables that do not contribute to the risk of re-identification and to adopt methods that have the lowest impact on data utility.

Directly identifying variables, like name, initials, signature, job title, address, email address, and phone number should be anonymized through the process of redaction.

Indirectly identifiable variables, like subject/patient assigned ID, city, state, postal code, demographic data, medical history, serious adverse events, dates, height, weight, and BMI, should be considered for transformation or generalization via pseudonymization, so long as the risk of re-identification has been mitigated.

A subsequent risk assessment on the anonymized data consistent with the previous approach is recommended to ensure the shareable data meets acceptable standards and a defined threshold for privacy protection. Adjustments to, or “fine-tuning” of the anonymization rules (redaction and/or pseudonymization) may be needed to ensure maximum data utility while maintaining the lowest possible acceptable level of re-identification risk.

Who should support these steps?

Transparency subject matter experts (SMEs) are the best suited to conduct anonymization and de-identification of clinical datasets and documents because of their knowledge, background, and expertise in:

Clinical trial transparency laws, regulations, and guidance
Clinical trial disclosure, including clinical trial registration and results summary posting
Management and execution of transparency activities, including policies, protocol, and submissions
Data privacy and the handling procedures for patient data

Transparency SMEs need to follow good data management and handling procedures such as ensuring an appropriate audit trail for any data transformations is established in anticipation of potential regulatory inspections or sponsor/vendor audits.

They should also seek to use a facilitating technology that is efficient, repeatable, and scalable and which allows them to measure risk and apply changes in an iterative fashion. This also allows for consistency in rules and methodology across the sponsor’s pipeline of products.

Additionally, establishing a robust, nimble, scalable, and enduring governance structure and process across all disclosure and transparency-related activities is essential. A responsible data-sharing culture starts with executive sponsorship and should be reinforced across the organization and embedded as part of its mission and vision.

Kelly Vaillant contributed to this article.

Data Anonymization Techniques in a Clinical Trial

Anonymization vs. Pseudonymization: What’s the Difference?

Step #1: Identify and Classify Variables

Step #2: Measure the Re-Identification Risk

Step #3: Anonymize the Data

Who should support these steps?

See a Demo

Discover how TrialAssure’s Transparency Suite Can Transform Your Compliance Experience

Data Anonymization Techniques in a Clinical Trial

Anonymization vs. Pseudonymization: What’s the Difference?

Step #1: Identify and Classify Variables

Step #2: Measure the Re-Identification Risk

Step #3: Anonymize the Data

Who should support these steps?

See a Demo

Discover how TrialAssure’s Transparency Suite Can Transform Your Compliance Experience

Looking to streamline your clinical trial disclosure and transparency efforts?

Yes, please send me more information about TrialAssure.