TEDS Data Dictionary

Algorithm for Scrambling IDs

Contents of this page:

IDs used in TEDS
- Admin IDs
- Pseudonymous and anonymous dataset IDs
Scrambling IDs
Unscrambling IDs

IDs Used in TEDS

Admin IDs

In everyday contacts with families, in the TEDS admin database, in the raw data, and for twin DNA samples kept in the lab, the following IDs are used to identify families and twins:

Name of ID	Identifies	Length	Structure	Fictional example	Comments
FamilyID	a family	4 or 5 digits	Numeric values between roughly 1200 and 36000	24501	Assigned at the time of recruitment and unaltered since. (This variable may have names like XFamilid, AFamilid in old scripts)
TwinOrder	a twin within a given family	1 digit	1=elder twin or 2=younger twin	2	Denotes the twin birth order. Generally named twin in scripts and datasets. (May have names like atwin, gtwin in old scripts and datasets.)
TwinID	a twin	7 or 8 digits	Comprises the FamilyID followed by the TwinOrder followed by two randomly generated digits.	24501273	Assigned at the time of recruitment and unaltered since. In rare cases where the birth order (TwinOrder) has been corrected, the value of TwinID has been left unchanged, so in these cases the 3rd-last digit does not match the value of TwinOrder.
Atempid2	a twin	5 or 6 digits	Comprises the FamilyID followed by the TwinOrder.	245012	May have names like Xtempid2, gtempid2 in old scripts
TEDS ID	a twin	string of 7 or 8 characters	Comprises the letters "TD" followed by the FamilyID followed by the TwinOrder.	TD245012	Used to identify TEDS twin DNA samples stored in the lab.

These IDs (except for TwinOrder) can be directly linked to confidential information about individual TEDS families and twins. For this reason, the IDs above are not used in the TEDS analysis datasets. (Furthermore, identifying data such as names and postcodes are not included in the datasets.)

As noted in the table above, the main family and twin identifiers (FamilyID, TwinID) were assigned at the time of recruitment and 1st Contact, when family and twin details were first entered in the TEDS admin database. The twin birth order (TwinOrder) for each named twin was established in the 1st Contact study; in rare cases this was subsequently found to be incorrect and then corrected in the admin database, but FamilyID and TwinID have been left unchanged in all cases. The current and most reliable source of twin birth orders is the admin database; birth order digits in variables like TwinID and TEDS ID may be incorrect in rare cases. When new datasets are created, the correct birth order is taken from the admin database record for the purpose of making variables like twin, id_twin (see below).

Pseudonymous and anonymous dataset IDs

For the purposes of the main TEDS analysis datasets, the IDs are 'scrambled' in order to protect the confidentiality of the data. The IDs in the main TEDS datasets are named id_fam (family ID) and id_twin (twin ID). This scrambling is done using an algorithm, devised by Tom Price, which is outlined below. The resulting IDs can be converted back to their original form by a process of 'unscrambling'. Therefore, data identified in this was is categorised as pseudonymous, not strictly anonymous. For all twins, the values of id_fam and id_twin are the same across different datasets, allowing variables to be merged longitudinally.

A further non-reversible and randomised encryption process is necessary to make the IDs truly anonymous. This encryption is a useful additional step in the construction of datasets that are to be shared with researchers; it significantly reduces any risk that the participants could be identified, by irreversibly and randomly modifying the family identifiers. The new IDs are named randomfamid (family ID) and randomtwinid (twin ID). In any given dataset, the encryption is made unique by re-computing the IDs, incorporating random number generation in the computation. Hence, the IDs created in this way differ from one dataset to another, making it impossible to merge with other datasets. This encryption process is not described further on this page. For longitudinal datasets, the data are first merged using identifiable or pseudonymous IDs before the final encryption step.

It is now TEDS policy to use anonymous (not pseudonymous) IDs in datasets provided to researchers. The main exception to this rule is where researchers need to merge their phenotypic dataset with genotypic data for analysis; in these cases, pseudonymous IDs are used. All analysis of genotypic data is done within KCL, and raw genotypic data are not shared externally, hence any dataset shared outside KCL will always be anonymous not pseudonymous.

Datasets used within the LLC TRE, where they can be linked with NHS medical records, have a different twin identifier called STUDY_ID. This is a pseudonymous twin identifier, with long string values. The raw values of the identifier are stored in TEDS, and included as a variable in each dataset submitted to the LLC; the values are then irreversibly hashed by LLC. The hashed values that will be found in the STUDY_ID variable in datasets within the LLC are therefore different from the raw values held in TEDS. However, the hashing is carried out identically for every TEDS dataset, which means that it can still be used for linking TEDS datasets inside the LLC, and it is therefore pseudonymous. The TEDS family identifier randomfamid, as described above, will also be made available to researchers within the LLC; its function is to act as a family 'grouping variable', enabling researchers to identify any pairs of twins related as siblings.

This table summarises the pseudonymous and anonymous IDs used in TEDS datasets:

Name of ID	Type	Purpose	Identifies	Length	Structure	Fictional example
twin	-	to specify twin birth order within a pair	a twin within a given family	1 digit	1=elder twin or 2=younger twin	2
id_fam	Pseudonymous	Protection of confidentiality within the main TEDS datasets while allowing longitudinal data to be merged. Given the scrambling algorithm, it is possible to convert the IDs back to identifiable form.	a family	up to 6 digits	Numeric values between roughly 100 and 999999	87654
id_twin	Pseudonymous		a twin	up to 7 digits	Comprises the id_fam value followed by the atwin value.	876542
STUDY_ID	pseudonymous	The unique twin identifier used in all TEDS datasets within the LLC TRE, allowing TEDS datasets to be merged with the linked NHS medical record datasets. It will not be possible to trace the hashed values back to original twin identifiers, so from practical purposes the values are anonymous.	a twin	unknown	Hashed string values.	(not available)
randomfamid	Anonymous	Complete protection of confidentiality for shared datasets; randomly generated and unique to each dataset, so merging is impossible. The encryption of the IDs is irreversible.	a family	5 digits	Numeric values between roughly 50000 and 70000	54321
randomtwinid	Anonymous		a twin	6 digits	Comprises the randomfamid value followed by the atwin value.	543212

Note that dataset variable twin is the same as the variable TwinOrder used in admin and in the raw data.

Scrambling IDs

Scrambling of IDs refers to the process of converting the original admin IDs (as used in the raw data) into the pseudonymous dataset IDs, as described above. Scrambling of IDs is therefore a routine part of dataset construction, and is included in the scripts used to make the TEDS datasets.

Because this data dictionary is widely shared, as are the TEDS analysis datasets, the actual algorithm for scrambling IDs (in the form of a syntax or script) is not shown here. This is to help protect the confidentiality of the TEDS twins and parents whose data are in the datasets. The detailed algorithm is no longer shared with researchers.

The essential properties of the scrambling algorithm are:

It is reversible.
It converts a value of FamilyID into a value of id_fam. Both values are unique to a specific TEDS family.
For twin data, the algorithm also creates a value of id_twin by appending the twin order (1=elder or 2=younger) to the end of the value of id_fam. Hence, each value of id_twin is unique to a specific TEDS twin.
The algorithm is guaranteed to work for the fixed range of FamilyID values (roughly 1200 to 36000). For ID values outside this range, the algorithm may not generate unique values of id_fam.

The scrambling algorithm is a fairly simple form of encryption, which can be encoded in a syntax or script. The mechanism of the algorithm includes a sequence of steps, which are not described here for reasons of data protection. The algorithm effectively achieves the aim of disguising an original value of FamilyID, such that it is not in any way recognisable from the value of id_fam into which it is converted: the length is typically different, and some or all of the component digits are different.

Unscrambling IDs

This is the reverse of the scrambling algorithm. It converts values of id_fam (or id_twin) back into the original values of FamilyID. This is effectively achieved by reversing the steps of the scrambling algorithm.

Unscrambling of IDs may be needed for specific admin or research purposes, to identify individual twins or families whose data are of interest in any way, or for checking anomalies in the data, and so on. This unscrambling may be done under the control of the TEDS data manager, but the process is not made available to researchers.