TEDS Data Dictionary

Processing the 21 Year Data

Contents of this page:

Introduction

This page describes how the 21 Year analysis dataset is created. The starting point is the collection of raw data files, already cleaned and aggregated as far as possible. These files are described in the 21 Year data files page. There are now three main sources of data used to build the 21 Year analysis dataset:

  1. The Access database file called 21yr.accdb. This provides the following:
    1. Twin paper booklet data (TEDS 21 phases 1 and 2)
    2. Parent paper booklet (TEDS 21 phase 1)
    3. General study admin data, such as booklet return dates
    This Access database provides data for the dataset via the exported csv text files that are saved in the \Export\ folder.
  2. Web (or web+app) cleaned raw data files, stored in the \web data files\ folder. There is one such file for each data collection: TEDS21 parent, twin phase 1 and twin phase 2; the twin g-game; and twin covid questionnaires phases 1, 2, 3 and 4.
  3. Background data: twin sexes, zygosity variables, twin birth dates, medical exclusion and overall exclusion variables. Some of these variables are from the 1st Contact dataset, but the source of most of the background variables is the TEDS admin database, where they may occasionally be updated. Rather than exporting the source variables from the admin database then importing them in the creation of every dataset, this is done separately in a reference dataset containing all the background variables. This reference dataset is used here to add the background variables, ready-made.

These data collections for the 21 Year study have now ended. The web sites and apps have been closed, and no further late booklet returns are expected. The files above are therefore essentially static. Changes to the raw data files will only be made if there are decisions to improve data cleaning or modify data structures, for example. Re-exporting the csv files from the Access database should only be necessary after such a change.

Other than exporting updated files from the databases, most steps in building the dataset are carried out by SPSS scripts (syntax files). These scripts carry out a long sequence of steps that are described in outline below. The end result is the latest version of the dataset, which is an SPSS data file.

General issues involved in creating TEDS datasets are described in the data processing summary page. The raw 21 year data files are described in more detail on another page.

Exporting raw paper booklet and admin data

Exporting involves copying the cleaned and aggregated raw TEDS 21 paper booklet data from the Access database tables where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.

In the event that any changes are made in the raw data stored in the Access database, for example to improve data cleaning, then the data should be re-exported into new versions of the csv files. The data stored in the database tables are in some cases exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. A query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields. The queries also modify the format of the data values in some columns, so that they are saved in a format that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and true/false columns (changed to 1/0 values). The Access queries and tables used to export the data, and the resulting csv files, are listed below.

Query or table name Source of data Database table(s) involved Exported file name
TwinPhase1Part1,
TwinPhase1Part2
TEDS 21 twin phase 1 paper questionnaires TwinPhase1Part1, TwinPhase1Part2 TwinPhase1Part1.csv, TwinPhase1Part2.csv
TwinPhase2Part1,
TwinPhase2Part2
TEDS 21 twin phase 2 paper questionnaires TwinPhase2Part1, TwinPhase2Part2 TwinPhase2Part1.csv, TwinPhase2Part2.csv
ExportParent TEDS 21 parent phase 1 paper questionnaires ParentPhase1 ParentPhase1.csv
ExportTEDS21admin return dates for paper questionnaires TEDS21progress TEDS21admin.csv

A convenient way of exporting these data files is to run a macro that has been served for this purpose in the database. See the data files summary page and the 21 Year data files page for further information about the storage of the files mentioned above.

Processing by scripts

Having exported and prepared the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.

Scripts 1a/b: Importing raw data files

The main purpose of each of these two scripts is firstly to import raw data from text files into SPSS; and secondly, to merge them together into larger files combining different data collections.

These scripts are also used to derive some admin-based status variables, and to set basic variable properties.

The scripts are named U1a_import_TEDS21.sps and U1b_import_ggame_and_covid.sps and they deal with raw data from the TEDS21 data collections and the g-game and covid data collections respectively. They have been separated into two scripts because of their length, to make them easier to use.

Each of the scripts carries out the following steps in order, where appropriate.

  1. Open each raw text file into SPSS. Each raw file is a comma-delimited (csv) file.
  2. Sort by participant identifier (FamilyID for parent data or TwinID for twin data
  3. Where there are two data files for the data collection, merge the variables together into a single file, using the participant identifier as the key variable. This is the case for paper data in TEDS21 twin phase 1 and in TEDS21 twin phase 2.
  4. Derive 'status' variables as a way of keeping track of the completion of each activity in the data collection. Each questionnaire, and the g-game, is divided into logical or named sections, and a status variable is used to measure the completion of each such section. This is generally done by counting the responses to items in each section.
  5. Delete any rows in which the status variables indicate that no meaningful data are present.
  6. Name or rename each variable. For each TEDS21 data collection, the equivalent variable must have the same name in the paper and web/app files.
  7. Set the visible width and number of decimal places for each item variable.
  8. In the TEDS21 paper booklet raw data, missing and not applicable responses are coded -99 and -77 respectively; recode these values to missing.
  9. If necessary, recode categoric variables so that the coding is consistent with equivalent variables from the other sources in the same data collection.
  10. In the g-game data, recode responses stored as text into numeric categories.
  11. For TEDS21, create a variable denoting the source of data (CMS app, CMS web, backup web, paper).
  12. For each TEDS21 data collection, aggregate the two files (paper data and web/app data) into a single file. Double check that no twins are duplicated after aggregation - eliminate duplicates if found.
  13. Where applicable, from web data, derived variables to measure the time spent completing each section. This is done by subtracting the start date-time from the end-date time (where these are recorded).
  14. Drop variables that are not to be retained in the dataset. These may include temporary variables used for counting responses, for example, and date-time variables that have now been used to derive durations.
  15. Merge files for different data collections together. This is done for phases 1 and 2 in TEDS21 (for twins); and the g-game data are merged with the four covid data files.
  16. Save working SPSS data files ready for processing by the next script. At this stage, there are 3 files: one for parents in TEDS21, one for twins in TEDS21, and one for twins in the g-game and the covid study phases.

Script 2: Merging and recoding raw data files

The main purpose of this script (filename U2_merge.sps) is to merge various raw data files together, so as to create an initial dataset with one row of data per twin. The raw data files include those created in the previous set of scripts, plus admin-related data from other sources. The script also carries out some low-level processing of the raw data, such as recoding and renaming some raw item variables, and setting variable formats. The script carries out the following tasks in order:

  1. Import into SPSS the raw admin data file containing all twin IDs and birth orders.
  2. Merge this with the two twin files saved at the end of the previous scripts (twin TEDS21 data, and twin covid and g-game data). The files are merged using TwinID as the key variable.
  3. Recode item variables to ensure consistent patterns of coding across all the data collections.
  4. The TEDS21 and g-game twin data include QC items. Recode these from their raw responses into error flags.
  5. Create reversed versions of item variables, where these will be needed later for deriving scales.
  6. Set variable levels, where this was not already done in an earlier script.
  7. Double enter the main twin data flags (for each data collection), as follows:
    1. Compute the alternative twin identifier utempid2 as the FamilyID followed by the twin order (1 or 2).
    2. Change the names of the twin data flag variables by appending the suffix 1.
    3. Sort in ascending order of utempid2 and save this file as the twin 1 part.
    4. Change the flag variable names by changing the ending from 1 to 2. Change the values of utempid2 to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of utempid2 and save with just the renamed variables as the twin 2 part.
    5. Merge the twin 1 and twin 2 parts using utempid2 as the key variable. The double entered data flags can now be used to select twin pairs having data.
    6. Save this file as the aggregated twin data file.
  8. Import the TEDS21 family admin data file, containing booklet return dates.
  9. Merge this with the parent data file (containing TEDS21 parent data) that was saved in the earlier script, uisng FamilyID as the key variable.
  10. Recode item variables to ensure consistent patterns of coding across the dataset.
  11. Create reversed versions of item variables, where these will be needed later for deriving scales.
  12. Set variables levels (nominal, ordinal or scale) for all items.
  13. Double enter twin-specific items in the parent/family-based data as follows:
    1. Compute twin identifier utempid2 for the elder twin by appending 1 to the FamilyID. Compute the Random variable. Save as the elder twin part of the family data.
    2. Re-compute utempid2 for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
    3. Combine the elder and younger twin parts together by adding cases. Sort in ascending order of utempid2 and save as the double entered family data file.
  14. Merge the aggregated twin data file with the double entered family data file, using utempid2 as the key variable. This dataset now contains all the raw data.
  15. Use the parent data flag and the double entered twin data flags to filter the dataset and delete any cases without any 21 Year data.
  16. Add a flag variable uteds21data to indicate the presence of any 21 year data for each twin pair (from parents and/or twins, in any data collection).
  17. Recode all data flag variables from missing to 0.
  18. Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page.
  19. Sort in ascending order of scrambled twin ID id_twin.
  20. Save the file and drop the raw ID variables.
  21. Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate.
  22. Use the data flags to filter the dataset and delete any cases without any 21 Year data.
  23. Save a working SPSS data file ready for the next script.

Script 3: Clean and correct data

The purpose of this script (filename U3_clean.sps) is to clean the data, as far as is possible. Cases of apparently careless or random responses are identified, in the TEDS 21 twin questionnaires and in the twin g-game, based on responses to QC items, response times, patterns of uniform responding and (for the g-game) patterns of low scoring. In TEDS 21, response time for each theme was computed as the time interval between the recorded start and end of the theme. In the g-game, instead, the mean item response time was used. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Search for random responders in the TEDS21 twin data and in the g-game twin data, as follows.
    1. Identify patterns of uniform responding in the immediate vicinity of each QC item (same response in the QC item and in at least 3 of the 4 surrounding items). Exclude data from the theme/section if such a pattern is found, together with a QC error, in any measure within the theme.
    2. Identify patterns of rapid responding in each theme/section (mean response time per item answered is in the lowest 20% of the distribution). Exclude data from the theme if such a pattern is found, together with a QC error occuring in any measure within the theme; for g-game sub-tests, an additional criterion of low sub-test score was used.
    3. In each g-game sub-test, additionally exclude twins with extremes of rapid responding (roughly, below the 0.2%-ile of the distribution).
    4. Exclude across the entire questionnaire if two or more themes are excluded using the rules above (do this independently for phase 1 and phase 2 of TEDS21). Likewise in the g-game, exclude across the entire battery of 5 sub-tests if two or more sub-tests are excluded using the rules above.
    5. Where exclusions have been made, recode the relevant item data (within the theme, or across the entire questionnaire) to missing to ensure consistent use in analysis.
  3. Save a working SPSS data file ready for the next script.

Script 4: Derive new variables

The purpose of this script (filename U4_derive.sps) is to add derived variables, including scales, time- and age-related variables, and zygosity and exclusion variables. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Derive variables for individual twin ages when each data collection was carried out, based on start dates in electronic data or return dates for paper booklets.
  3. Add scales for questionnaire measures (TEDS 21 and covid).
  4. Add scores for cognitive measures (g-game)
  5. Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
  6. Save a working SPSS data file ready for the next script.

Script 5: Label the variables

The purpose of this script (filename U5_label.sps) is simply to add variable labels and value labels to the variables in the dataset. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Label all variables.
  3. Add value labels to categorical variables, where appropriate (generally, for any numeric categorical variable having 3 or more categories).
  4. Save a working SPSS data file ready for the next script.

Script 6: Double entering the data

The purpose of this script (filename U6_double.sps) is to double-enter all the twin-specific data in the dataset. Note a few variables (twin-specific variables from family-level data and twin data flags) are already correctly double-entered at this stage (this was achieved in script 1). The script carries out the following tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Create the twin 1 part: rename all twin-specific item and derived variables by adding 1 to the end of the name, then save the dataset.
  3. Create the twin 2 part (for the co-twin) as follows:
    1. Rename the appropriate item and derived variables by changing the suffix from 1 to 2.
    2. Modify the id_twin values so they will match the co-twin (change the final digit from 1 to 2 or vice versa).
    3. Re-sort in ascending order of id_twin and save as the twin 2 part, keeping only the renamed variables.
  4. Re-open the twin 1 part.
  5. Merge in the twin 2 part, using id_twin as the key variable. The dataset is now double entered.
  6. Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
  7. Save an SPSS data file (filename u6double in the \working files\ subdirectory).
  8. Save another copy as the full 21 Year dataset, with filename Udb9456_full.