TEDS Data Dictionary

Data Processing to Create Analysis Datasets

Contents of this page:

Introduction
Cleaning and aggregating raw data
Sources of raw data
Exporting raw data
Processing by scripts

Introduction

This page describes, in general terms, how the TEDS analysis datasets have been created for each main study.

Every study starts with data collection, followed by data entry, then usually some data cleaning and aggregation of data. These processes are described elsewhere in the data dictionary, although a brief recap of data cleaning and aggregation is given on this page.

After a study has ended (in terms of data collection and data entry), the creation of the main dataset for analysis is a process that may be refined and repeated many times. The starting point is the raw data, in cleaned and aggregated form. The first step generally involves "exporting" the raw data from the source file (usually an Access database) into files that can be used by statistical software (usually SPSS). The remaining processing stages are then carried out using the statistical software; the instructions for the various processing steps are saved and stored in "scripts" (syntax files), so that they can be repeated when necessary without duplication of effort.

Each analysis dataset is typically re-made in a series of different versions, as data collection and data cleaning processes are finalised, and more commonly as the scripts are developed to improve the variables in the dataset. Generally, a new version of the analysis dataset is created for one of the following reasons:

More raw data has been collected for addition to the dataset
The raw data have been improved by cleaning or some new coding
Decisions are taken to restructure or even remove particular item variables
Syntax for new derived variables is added to the scripts
The scripts are updated in other ways to improve the dataset variables, e.g. correction of syntax errors, changes in value coding, etc.

Cleaning and aggregating raw data

As described elsewhere (see data entry and data files), raw data may be collected and entered in a variety of raw formats, and in many files.

There may be multiple files of raw data for a given data collection, generated by data entry processes. For example, in each of the early booklet studies, manual data entry typically generated different sets of files for different cohorts of twins. In later questionnaire studies, data entry by scanning was carried out in batches, generating a new file for every batch. In web studies, there were often two waves or two cohorts of data collection, carried out in successive years, each generating its own set of files. In such cases, permanently aggregating the data for each data collection (e.g. each questionnaire, or each web test) simplifies subsequent processes of dataset generation.

In booklet and questionnaire studies, the data from each questionnaire is usually aggregated into a table in the Access database for the given study. Sometimes two or more database tables are needed for a long questionnaire, because each Access table can hold no more than 255 fields.

In web studies, where the files often contain very large numbers of variables, Access database tables are unsuitable. In such cases, the data are aggregated into a single delimited text file, usually a csv file. Generally, one file is needed for each distinct web test in a given study.

Late returns of paper questionnaires are usually aggregated directly with the rest of the data by means of data entry directly into the Access database. In web studies, late returns do not happen because the web activities are halted prior to the final data download.

Aggregation, whether in an Access table or in a csv file, can be combined with processes of data cleaning. For example, duplicated questionnaires from the same participant must be eliminated, by checking for uniqueness; invalid participant IDs must also be eliminated, by checking against a lookup list; and it is essential to ensure that identical coding is used for every variable in every aggregated file.

Other data cleaning operations generally involve checking the set of values, or the range of values, recorded in every numeric field. Out-of-range or invalid values can either be corrected (if there is some record of the original raw data) or deleted. Missing values in the aggregated raw data are often replaced by coded values, typically -99 (missing) or -77 (not applicable). Direct identifiers, such as names, usernames or email addresses, are removed once they have served their purpose in data collection.

Text fields, where used, may also be checked for inappropriate values of various sorts, including identifiable text strings, which are then typically removed.

Sources of raw data

This is a brief recap of the storage of cleaned and aggregated raw data: see also the data files page.

Most of the cleaned and aggregated raw data are stored in an Access database file. There is a separate Access file for each main TEDS study. Within the Access database file, there are many tables, used for logically distinct data collections. Typically, the Access database will contain separate tables for twin data, parent data, teacher data and admin data. Access is a convenient format because multiple tables can be stored within the same database file; because the database can incorporate data-entry forms and other useful programs in addition to the data; and because it can enforce various data integrity rules to ensure that the data remain clean when updated.

The files of twin web/app data (collected in web studies from age 10 onwards, and via a phone app in TEDS21) are generally too large for convenient storage in Access. Each web data file has very many columns, and would have to be split into many database tables. Furthermore, the raw electronic data files are complete, in the sense that further data are not added manually, and there is no need for the data entry and data integrity facilities that Access provides. Instead, the data from each test/study, sometimes aggregated across two or more waves of data collection, is saved in a single data file. Such files may be periodically updated during the course of the study, as more web data are collected; but once the study ends these files become "static", and do not need further updating in any way.

Some of the raw data used to make analysis datasets are not study-specific, but consist of general family background data. These data include twin sexes, birth dates, birth orders, zygosities, and medical conditions (used to classify medical exclusions). These types of data are maintained in the TEDS admin database; they are continually updated as a result of new information from families, zygosity test results, and so on. These admin data are stored in the TEDS_ADMIN SQL Server database, running on the TEDS database server. This database is quite large and has many tables; the significant tables in this context are Twins (containing twin IDs, birth orders, sexes and birth dates), TwinMedicalCondition (containing twin medical exclusion and autism categories) and DNAprogress (containing DNA zygosity tests).

Exporting raw data

SPSS, which is used to process the data, cannot easily open data directly from Access database tables. It is therefore necessary to "export" the raw data by copying them into a more suitable file format. The .csv file format is chosen because it is platform-independent, can be read by many different software programs, has no limit on the number of rows or columns of data, and is non-proprietory (and therefore not subject to format changes with different versions, etc).

Generally, the data from each Access database table is exported into a separate csv file. A csv file consists of plain text, which can be opened in text editors like Notepad, and can also be opened in grid format by Excel. The csv format means that variables are comma-separated in rows of data, and each case starts on a new line. The first line of the file contains the column headings (or variable names).

Data are exported indirectly from Access tables using saved "queries" within each Access database file. Each query selects appropriate columns from the relevant table, hence excluding unwanted columns such as verbatim text fields. The queries also modify the format of the data values in some columns, so that they are saved in a form that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and boolean true/false columns (changed to integer 1/0 values).

Access allows the data from these queries to be exported and saved by a number of means. Choices involve the type of file to be saved (csv), inclusion of field names in the first row, and the file name and folder location for saving. These tasks may be automated within Access by means of macros, which are small programs that can be run on demand.

Background data from the TEDS admin database (SQL Server) can be exported in a similar way, if an Access front-end interface is used. The exported csv files from this source are stored in the \Export\ folder for the 1st Contact study.

The locations of all the files mentioned above are described in the data files page.

Before a new version of the analysis dataset is made, by running the scripts described below, it is not always necessary to re-export the raw data. Some new dataset versions incorporate changes made to the syntax in the scripts, but not changes in the raw data. Repeating the export of data from Access is therefore only necessary if (1) additional raw data have been collected and entered, or (2) the raw data have been updated by cleaning or other changes.

Processing by scripts

Creating one of the TEDS analysis datasets involves a fairly long sequence of operations, which are described under the headings below. These operations are carried out in statistical software (SPSS). The instructions for each operation, and the order in which the operations are performed, are encoded in proprietory syntax for SPSS; the syntax is saved in a set of "scripts" or syntax files. Each script is a plain text file, saved with a proprietory file extension (.sps for SPSS syntax).

Typically, for each dataset, the syntax is divided into several scripts, mainly in order to reduce the length of each script; each script generally contains a logically related set of functions, such as merging raw data sources, or creating scales; and this makes it easier to find a given piece of syntax, and to understand the operations being carried out. The scripts must be executed in a given order for each dataset, as the syntax operations follow each other in a logical sequence.

Each script contains not only syntax but comments. The comments give an explanation of the operations being carried out. The syntax and comments together form a crucial part of the documentation for each dataset, as they explain exactly how the dataset is constructed.

Different TEDS datasets have been created using slightly different syntax methods, sometimes with operations performed in different orders. However, they generally all involve the operations that are described under the headings below. Some examples of variations between the ways the datasets are constructed are:

The 1st Contact and 8 Year studies involved only parent booklets/questionnaires and no twin data. Hence, any steps involving on merging using TwinID, and double entry of twin data, do not apply.
The in home study has its own exclusion variable, based on information collected at the time of the visits. Hence, the usual across-age exclusion variables are not used.
At ages 7, 9, 10, 12 and 14, the studies involved teacher questionnaires. Hence, additional steps are needed to merge and process the teacher data (treated in essentially the same was as twin data).
At ages 10, 12, 14, 16, 18 and 21 the studies included twin web tests and (in some cases) parent web activities, in addition to the usual paper booklets/questionnaires. Hence, construction of these dataset involves additional files, typically one per web activity. The web activities were very varied and required different types of processing.
In TEDS 21, parents and twins had a choice of completing the questionnaires via mobile app, on the web or on paper. Hence, there are separate files for the different sources of data which must be made compatible, merged together, then cleaned to remove duplication for some participants.

Importing raw data

Before further processing can be carried out, the raw data must be "imported" into SPSS, from the files saved in the \Export\ folder for the given study (see exporting data above). Most of these are csv files. Importing data generally means opening a file of a foreign type (e.g. csv) and copying its data into an open SPSS data file. In SPSS, the syntax for importing a csv file also allows the name and data type of each variable to be specified.

The operation of importing a raw data file is generally followed by sorting the rows in ascending order of subject ID, sometimes followed by variable naming and recoding, followed by saving the file in SPSS format to be merged at a later stage.

Naming variables

In SPSS, variable naming is often done simultaneously with importing the raw data, because the syntax for importing a csv file allows the variable names to be specified at the same time. In some dataset scripts, the renaming of variables (from raw variable names to final analysis dataset variable names) is done at a later stage.

The naming of variables in TEDS datasets follows certain conventions, described in the variable naming page.

Formatting variables

To make variables easier to use, they are formatted in SPSS by modifying properties such as the "width", "decimals", "columns" and "measure".

Firstly, by using the SPSS FORMATS syntax command, or by specifying the format in syntax for importing a text file, each variable's "width" and "decimals" properties are set. For example, the SPSS format F1.0 would be used for a single-digit integer variable (width=1, decimals=0), and format F5.2 might be used for a decimal variable such as a twin's height (width=5, decimals=2). Setting the format does not alter the values stored in SPSS; it merely alters the default display of variable values in SPSS.

Secondly, by using the SPSS VARIABLE LEVEL syntax command, each variable can be described (in the SPSS "measure" property) as nominal, ordinal or scale. Conventionally in the TEDS datasets, integer-valued categorical variables are set to nominal or ordinal as appropriate, and this property makes it easier in SPSS to use the variables in analyses such as tabulations. Other variables set to measure=scale include quantitative measurements such as weights or ages, scales, composites and decimal-valued variables generally. String variables are nominal by default.

Further variable properties may also be set in SPSS syntax. These properties include the "columns" (set with the SPSS VARIABLE WIDTH syntax command), which determines the width of the column displayed in the SPSS data view. The default width is often appropriate, but may be changed when it is inconveniently wide or narrow. Another variable property that may be set is the "missing" property. For a variable containing a value such as 2=don't know or 9=not applicable, such a value can be set as "missing" in SPSS (using the MISSING VALUES syntax command); the result is that the value is conveniently treated as missing in any SPSS analysis, including computation of scales.

Recoding variables

In the raw data, missing values are generally denoted by the code value -99, and not-applicable values are denoted by the code value -77. These codes are not used in the analysis datasets, so these values are recoded to missing for all item variables (in SPSS, a system-missing or 'sysmis' value is used).

Often, the value coding used in the raw data differs from that used in the analysis datasets. For example, simple value codes of 1,2,3,... are often used in data entry for the raw data, but in the analysis dataset it might be more convenient to used value codes of 0,1,2,... (often to aid the computation of scales). Reverse-coding is sometimes necessary, to ensure that all related item variables are coded in a consistent way. The scripts therefore contain syntax for changing the value codes where necessary.

Occasionally, where booklet data have not already been fully cleaned in the raw state, there are attempts in the scripts to identify and recode invalid or compromised data values. Where identified, invalid values are generally recoded to missing, to prevent them being used in analysis. Examples of invalid values might include extreme or out-of-range quantitative measurements, such as ages or heights. Examples of compromised data might include inconsistent responses in multi-part questions.

Recoding and cleaning web/app data

The data collected in web (or app) studies, from age 10 onwards, are recoded for several purposes: in order to replace missing values, to identify item events such as timeouts, discontinues and branching, and to identify anomalies such as item crashes.

Furthermore, attempts have been made to identify instances of tests that were compromised by loss of data, malfunctioning of test rules, or random responding by twins. Such instances are flagged using test status variables, and are effectively excluded by setting the test variable values to missing.

The processes involved in recoding and cleaning the web data are described in more detail in a separate page.

Double entering parent/family data

In parent-reported questionnaire data, and in some admin data, there are items referring specifically to the elder and younger twins, and these occur in the same row of data for each family. Double entering these data consists of two main operations in the script: firstly, the data must be duplicated so that there is one row per twin, not just one row per family; secondly, the twin-specific variables must be "swapped over" in the duplicated data so that they correctly match the twin and co-twin for the relevant row of data. The script usually follows the following stages:

Start with the parent- and/or family-specific data, as imported from the raw data files. This is often done after merging all the family-specific data together (using FamilyID as the key variable), but before merging the twin/teacher data.
This is usually the point at which the random variable is added (see also below).
Within the family-specific data, if not already done, rename twin-specific items so that "elder twin" item variables have names ending in '1', and "younger twin" item variables have names ending in '2'.
Compute the twin identifier Atempid2 by appending the digit 1 to the end of the FamilyID value.
At this stage, the dataset represents the elder twins; each variable with name ending in '1' represents the elder twin as identified by Atempid2, while each variable with name ending in '2' represents the younger twin, i.e. the co-twin. Save this as the "top" half of the dataset.
Now make an equivalent dataset for the younger twins. Starting by re-computing the twin identifier Atempid2, so it ends in 2 instead of 1.
Also, for the second twin, reverse the value of the random variable (0 to 1, or 1 to 0) that has been derived for the elder twin.
Now swap over the twin-specific variables: the variables having names ending in '1' must be swapped over with the variables having names ending in '2'. This is conviently done by renaming the variables: for example, to swap variables var1 and var2, rename var1 to var1x, then rename var2 to var1, then rename var1x to var2.
At this stage, the dataset represents the younger twins; each variable with name ending in '1' represents the younger twin as identified by Atempid2, while each variable with name ending in '2' represents the elder twin, i.e. the co-twin. Save this as the "bottom" half of the dataset.
Combine the cases from the "top" and "bottom" datasets saved above, into a single dataset.
Re-sort in ascending order of Atempid2.
The family-specific data is now double entered. There is one row per twin, each twin being identified uniquely by Atempid2. The variables having names ending in '1' contain data for this twin (whether elder or younger). The variables having names ending in '2' contain data for this twin's co-twin.

Creating the random variable

The variable named random has values 0 and 1. For a given twin, the value 0 or 1 is assigned randomly in this variable; however, co-twins must have opposite values. So if the elder twin has random=1, then the younger twin must have random=0, and vice versa. During analysis, if the dataset is filtered or selected on either random=1 or random=0, then exactly half the twins will be selected, but only one twin per pair is selected; and the selection will include an approximately 50:50 mix of elder and younger twins.

Computing this variable cannot be done directly in a dataset that already has one row per twin, because twin and co-twin values of random are not independent in different rows of the dataset. The random variable is often added to the dataset during double entry of the parent/family data (see above), because both processes involve combining "top" and "bottom" halves of the dataset. The processes involved are as follows:

Start with a dataset containing the elder twin cases only. Elder twins are typically identified by variable Atempid2, with 1 as the final digit.
Add the random variable, assigning the value 0 or 1 randomly to each elder twin.
Save this as the "top" part of the dataset, representing the elder twins.
Re-compute the twin identifier Atempid2, by changing the final digit from 1 to 2. This now identifies the younger twin cases.
Modify the random variable, by recoding 1 to 0 and 0 to 1.
Save this as the "bottom" part of the dataset, representing the younger twins.
Combine the cases from the "top" and "bottom" datasets saved above, into a single dataset.

In the second step above, the random variable can be computed in the syntax using an appropriate SPSS random number generator function. In SPSS, a suitable command is COMPUTE random = RV.BERNOULLI(0.5). which will generate approximately equal numbers of 0 and 1 values randomly. Alternatively, a ready-made set of random values (one per family, matched against FamilyID) is available in a raw data file for importing and merging. The latter method was used historically in older versions of the datasets, whereas recent versions use a newly-computed value of random each time the dataset is made.

Merging data sources

In a given dataset, data typically come from several sources, e.g. from parent, twins and teachers, and from different booklets, questionnaires, web tests and admin sources. Separate raw data files are usually used to store the data from different sources; each raw data file is imported and saved as an SPSS data file (see importing raw data above). These files must be combined together, or "merged", into a single dataset. This must be done in such a way that all data relating to a given twin appears in the same row of the dataset; to do this, the files must be merged using a twin identifier as the key variable.

Many background variables that are the same across all datasets are conveniently merged in from a reference dataset containing such variables. This minimised the need to duplicate steps such as importing raw data and deriving the background variables in every dataset. Background variables treated in this way are typically zygosities, exclusion variables. See the background variables page for more details.

Different sets of data may identify twins in different ways, so sometimes it is necessary to modify the IDs so that they match up for merging. In raw twin and teacher data, twins are usually identified by a variable called TwinID. In raw parent data, each case is usually identified by a family identifier called FamilyID. As explained above, after double-entering the parent/family data, twins are identified by a variable called Atempid2. In the most scripts, files are merged in this order:

Merge all per-family data (parent and admin data) on FamilyID.
Double enter the per-family data, resulting in a file where twins are identified by Atempid2.
Merge all per-twin data (twin and teacher data) on TwinID.
Replace TwinID with Atempid2 in the per-twin data.
Merge the per-family data with the per-twin data on Atempid2.

The means of replacing TwinID with Atempid2 is provided by the raw data file (exported from the TEDS admin database) containing the TwinID, FamilyID and twin birth order for every TEDS twin. After importing this file, Atempid2 can be computing by appending the twin birth order (1 or 2) to the end of the FamilyID value. This file can then be merged with the per-twin data (on TwinID), hence providing the Atempid2 variable needed for merging with the per-family data.

Before two datasets are merged, they must both be sorted in ascending order of the relevant key variable. The files must therefore be re-sorted if a new key variable is about to be used for the next merge. In the scripts, data files are usually sorted prior to saving, in steps preceding each merge.

Before merging, each set of data from a particular source is given a new flag variable to show the presence of data from that source. Sometimes these flag variables are already present in the imported raw data; if not, they are computed after importing the data into SPSS. These flag variables can be used later in the scripts to identify any cases without data from any source in the study; such cases can then be deleted.

Scrambling IDs

The family and twin identifiers used in the raw data are the same as those used in the TEDS administrative system and in all contacts with the families. These IDs are easily matched against identifiable individuals, and are never used in analysis datasets. In order to protect the identities of twins and their families, and to maintain confidentiality in the data, alternative "scrambled" (pseudonymous) IDs are used in the analysis datasets. The scrambling of the raw IDs is a key stage in dataset construction, and it is generally carried out at an early stage, once the files from all sources have been merged together.

The treatment of IDs is dealt with more fully in the scrambled IDs page.

Labelling variables

All variables to be retained in the analysis dataset are labelled, to help researchers identify what each variable represents, and to give information about value codes. While there is no rigid convention for labelling variables, where possible the following information is included in the variable label:

The study or approximate twin age
The source of data (e.g. parent, twin or teacher, booklet or web)
The name of the measure
For item variables, the item number where appropriate
For questionnaire items, a truncated or simplified version of the text of the question
Where appropriate, the nature of the item, e.g. response, score or response time for a web item.
Some indication of the coding, range of values, or units of measurement

Value labels are also added to categorical integer-valued variables, where there are three or more categories present. The use of value labels removes the need to specify all the values within the text of the variable label.

Creating scales and composites

Scales, composites and other derived variables are generally added to the dataset after the initial stages of merging the data, and before labelling all the variables. Various methods are used to compute scales, but most involve calculation of the mean or sum of a set of related item variables. Sometimes, some of the item variables must be reversed-coded first (if this was not already done earlier in the scripts), so that all the relevant items are coded in a consistent way. Sometimes, the item variables must be transformed in other ways before computing scales: for example, some scales used standardised items; or sometimes, neutral responses like "don't know" must be removed before scaling.

For per-twin data (from teachers and twins), it is helpful to create scales before double entry, so that each scale variable only needs to be computed and labelled once. The new scale variables are then effectively copied and matched to the co-twins during double entry.

For per-family data (typically from parents), scales computed from twin-specific items will be correctly double entered if the item variables were previously double entered (see double entering parent/family data above). If this has not yet been done when the scales are computed, then the scales will subsequently have to be double entered along with the items.

Some derived variables added may be referred to as "composites" rather than scales. Usually, the term "composite" variable implies that data from two or more different measures have been combined together, whereas a "scale" variable generally is computed from items of the same measure. Use of the term "composite" may also imply that the derivation is more complex than for a "scale", the latter typically involving a straightforward mean, sum or count. However, these distinctions are not strict.

Creating background variables

Many of the background variables are not derived in each separate dataset, but instead are merged in from a reference dataset as described above. This reduces the need to duplicate the same derivations in every dataset, hence shortening the processing and reducing the risk of errors. The background variables merged in this way generally include zygosities and exclusion variables. See the background variables page for details. See also the exclusions page.

However, some background variables must be derived separately for each dataset. These include twin ages, which are specific to the data collection, and sometimes study-specific exclusions for example.

Within each dataset, there are typically different twin age variables for different data collections. For example, within a given study, the data from parent booklets, teacher questionnaires, twin web tests and twin booklets may all have been returned on quite different dates. The two twins in a pair may often return data on different dates, hence their age variables are double entered where appropriate.

Double entering twin data

Raw data from twin-specific sources, such as teacher questionnaires, twin tests and twin booklets, will contain one row of data per twin. Every data item relates to this twin. To double enter these data, the corresponding co-twin data items must be added to the same row of data for every twin. The data items must therefore be duplicated and matched to the co-twins.

Note that double entering the per-twin data, as described here, involves different processing from that involved in double entering the per-family data, as described above. Note also that double entering the per-family data is usually done at any early stage in the scripts, before the per-family data are merged with the per-twin data. However, double entering the per-twin data is more conveniently done later, once all the data sources have been merged together, after items have been labelled, and after scales have been computed.

In most cases, this double entry procedure comes after the script that scrambles the IDs. So the description below assumes that the twins are identified by the variable id_twin. The corresponding family identifier is id_fam. The value of id_twin comprises the twin birth order (1 or 2) appended to the value of id_fam. The same double entry procedure can be used with unscrambled IDs Atempid2 and FamilyID if necessary.

The exact method used for double entry can vary between scripts, but the procedure always follows some version of the following steps:

Rename all item variables from twin-specific sources, so that the variable names end in '1'.
Sort the dataset is ascending order of twin identifier (usually id_twin).
At this stage, the dataset represents the data for the "index" twins, i.e. the data relating to the twin identified by id_twin in each row of data. Save this as the "left hand" part of the dataset.
Now make an equivalent dataset for the co-twins. Start by dropping any per-family variables that do not require double entry, and also dropping variables that have already been double entered (for example, from the parent data).
Rename all the twin-specific item variables so that the variable names end in '2' instead of '1'.
Re-compute id_twin, to match the co-twin's id_twin value in each case: if the final digit of id_twin is 2, change it to 1, and vice versa.
Sort the dataset in ascending order of the re-computed id_twin.
At this stage, the dataset represents the data for the co-twins. Save this as the "right hand" part of the dataset.
Merge the left and right hand parts, as saved above, using id_twin as the key variable. The dataset is now double entered.

After double entry, it may be possible to reduce the size of the dataset by deleting any twin pairs that do not have any data in the study. This may be done by selecting on the flag variables that show the presence of data from each source. To maintain paired twins in the dataset, and to avoid unpaired "singletons" in the dataset, selection should be based on the double-entered flag variables for both twin and co-twin for each part of the data.