TEDS Data Dictionary

Processing the 18 Year Data

Contents of this page:

Introduction

This page describes how the 18 Year analysis dataset is created. The starting point is the raw data, in cleaned and aggregated form. (Prior processes of data collection, data entry, data cleaning and aggregation are taken for granted here.) There are three main sources of data for the 18 Year analysis dataset:

  1. Twin web test data. These are stored in .csv text files, one file per web activity (Perception, Bricks, Kings Challenge, Navigation and FFMP).
  2. The Access database file called 18yr.accdb. This provides the following:
    1. Twin questionnaire data
    2. Questionnaire study admin data
  3. Background data: twin sexes, zygosity variables, twin birth dates, medical exclusion and overall exclusion variables. Some of these variables are from the 1st Contact dataset, but the source of most of the background variables is the TEDS admin database, where they may occasionally be updated. Rather than exporting the source variables from the admin database then importing them in the creation of every dataset, this is done separately in a reference dataset containing all the background variables. This reference dataset is used here to add the background variables, ready-made.

Converting raw data from these sources into the dataset involves two main processes: firstly, where appropriate, raw data must be "exported" or converted into files that can be used by SPSS; secondly, the exported data files are combined and restructured, using SPSS, into a form suitable for analysis. The latter involves a lengthy series of steps, which are saved and stored in SPSS scripts (syntax files).

General issues involved in creating TEDS datasets are described in the data processing summary page. The raw 18 year data files are described in more detail on another page.

Exporting raw questionnaire and admin data

Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.

The 18 year questionnaire data and the administrative data, stored in the Access 18yr.accdb database file, are subject to occasional changes, even after the end of the study. These changes are sometimes due to late returns of data, and may sometimes be due to data cleaning or data restructuring changes. These data should therefore be re-exported before a new version of the dataset is created from the SPSS scripts. The data stored in the database tables are in some cases exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. A query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields. The queries also modify the format of the data values in some columns, so that they are saved in a format that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and true/false columns (changed to 1/0 values). The query used to export the twin questionnaire data is called ExportTwinQnr (based on the TwinQnr table) and the csv file exported from this query is called TwinQnr.csv.

A convenient way of exporting these data files is to run a macro that has been served for this purpose in the database. See the data files summary page and the 18 Year data files page for further information about the storage of the files mentioned above.

Preparing raw web data

Data from the twin web studies were stored on the web server during the course of each study. The Perception study took place in two waves (pilot and main study, including a retest exercise for a small number of twins); similarly, the Navigation study took places in two distinct waves (the aborted first wave, and an extended second wave); each of the other web studies (Bricks, Kings Challenge and FFMP) took place in a single wave. At the end of each study (or at the end of each wave for Perception and Navigation), the data were exported from the web server in two tab-delimited text files: an admin data file (containing various test status flags, start and end date/times, and some other items); and a main or "complete" data file (containing all item data from all activities in the study). These files were copied from the web server and initially stored as the primary raw data for each web study.

At a later date, for each of the 5 web studies at age 18+, the various files were aggregated together, so that there is now a single file for each web study. These twin web files contain too many variables to be conveniently imported into Access database tables alongside the admin and questionnaire data. Instead, they are stored separately as csv text files. When the original files were aggregated in this way, identifying fields other than IDs (e.g. names) were removed.

In the initial pilot wave of the Perception study, a small number of twins were asked to repeat the activities for the purpose of test-retest analysis. These twins therefore each had two sets of data in the original pilot data files. During processing, these have been separated so that only the first attempt by any given twin is included in the dataset. The second attempt, where it exists, has not been used in the dataset.

Processing by scripts

Having exported and prepared the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.

Script 1: Merging raw data sources

The main purpose of this script (filename R1_merge.sps) is to merge various raw data files together, so as to create an initial dataset with one row of data per twin. The script also carries out some low-level processing of the raw data, such as recoding and renaming some raw item variables, and setting variable formats. The script carries out the following tasks in order:

  1. There are 7 raw csv data files of twin-level data: (1) the 18 year questionnaire data, (2) the admin data file of twin IDs and birth orders, plus the files for the 5 web studies (Perception, FFMP, Bricks, Kings Challenge and Navigation). For each of these files in turn, carry out the following actions were necessary:
    1. Import the csv file into SPSS.
    2. For the questionnaire data, recode -99 and -77 to missing in all items.
    3. Carry out basic recoding of other categorical variables where necessary. In the web activities, this includes some recoding of text variables into numeric categories.
    4. Where necessary, add status and data flags for web activities and their sub-tests.
    5. Set variable formats including numeric width and decimal places to display, and variable level (nominal/ordinal/scale).
    6. Rename variables.
    7. Sort by twin identifier TwinID and save for merging later.
  2. Using TwinID as the key variable, merge together all 7 twin data files as described above (18 year questionnaires, twin IDs and birth orders, and the 5 web study files).
  3. Double enter the main twin data flags (for the questionnaire and for each of the web studies), as follows:
    1. Compute the alternative twin identifier rtempid2 as the FamilyID followed by the twin order (1 or 2).
    2. Change the names of the twin data flag variables by appending the suffix 1.
    3. Sort in ascending order of rtempid2 and save this file as the twin 1 part.
    4. Change the flag variable names by changing the ending from 1 to 2. Change the values of rtempid2 to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of rtempid2 and save with just the renamed variables as the twin 2 part.
    5. Merge the twin 1 and twin 2 parts using rtempid2 as the key variable. The double entered data flags can now be used to select twin pairs having data.
    6. Save this file as the aggregated twin data file.
  4. There is just one file of family-based raw background data: the 18 year questionnaire admin data (including cohort and return dates). For this file, carry out the following actions:
    1. Import into SPSS
    2. Sort in ascending order of family identifier FamilyID
    3. Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
    5. Carry out basic recoding of categorical variables where necessary.
    6. Drop raw data variables that are not to be retained in the datasets.
  5. Double enter twin-specific items in the family-based data as follows:
    1. Compute twin identifier rtempid2 for the elder twin by appending 1 to the FamilyID. Compute the Random variable. Save as the elder twin part of the family data.
    2. Re-compute rtempid2 for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
    3. Combine the elder and younger twin parts together by adding cases. Sort in ascending order of rtempid2 and save as the double entered family data file.
  6. Merge the aggregated twin data file with the double entered family data file, using rtempid2 as the key variable. This dataset now contains all the raw data.
  7. Use the double entered twin data flags to filter the dataset and delete any cases without any 18 Year data.
  8. Add a flag variable r18year to indicate the presence of any 18 year data for each twin pair (from the questionnaire and/or from the web studies).
  9. Recode all data flag variables from missing to 0.
  10. Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page.
  11. Sort in ascending order of scrambled twin ID id_twin.
  12. Save the file and drop the raw ID variables.
  13. Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate.
  14. Use the double entered twin data flags to filter the dataset and delete any cases without any 18 Year data.
  15. Save a working SPSS data file ready for the next script.

Script 2: Clean web data

The purpose of this script (filename R2_clean.sps) is to make the web item data easier to use, and to clean up anomalies in the raw web data. This is achieved largely by processes of detecting and recoding anomalies such as missing data, item timeouts, interrupted items, and extreme outliers in item times. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. For Kings Challenge tests em, 2d and 3d, which have raw item responses comprising long and unwieldy strings, recode the item responses into simplified numeric categorical variables.
  3. For Navigation missions, for each attempt of each task, replace the two raw outcome item variables (completion and error) with a single outcome variable. Conventional response and score items do not exist in the Navigation data; the outcome variable takes the place of both, and it is coded in the same way as for response variables in other studies (with values -1, -2, -3 and -4 as outlined below).
  4. For each web test having an item timeout rule, identify timed out items and recode the response variables to -1.
  5. For each web test having a discontinue rule, identify discontinued items and recode their item response variables to -2.
  6. In Navigation tasks involving multiple attempts, detect later attempts skipped due to success in an earlier attempt, and recode their outcome variables to -3.
  7. Where appropriate, detect interrupted or crashed items (or, for Navigation, missions with no twin input) and recode the item response variables to -4.
  8. Identify extreme outliers in item response times, and recode them to missing. Do likewise with invalid item response times (e.g. where timed out).
  9. Detect invalid instances of twin tests, in which the test data have been severely compromised by missing data caused by interruptions, timeouts or other events. Where detected, recode the test status variable from 2 to 3.
  10. Detect further invalid instances of twin tests, in which twins appear to have responded randomly or without effort across many of the items. Where detected, recode the test status variable from 2 to 4.
  11. Where incomplete or invalid instances of twin tests have been detected (status variable value is 1, 3 or 4), recode the test data flag from 1 to 0, and recode the item test item variables to missing.
  12. Save a working SPSS data file ready for the next script.

Script 3: Derive new variables

The purpose of this script (filename R3_derive.sps) is to add derived variables, including scales, time- and age-related variables. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Based on the questionnaire data, derive grade variables for specific subject A-levels and AS-levels.
  3. Derive an estimated UCAS score, based on grades in A- and AS-levels and other appropriate qualifications such as NVQ, Diploma and BTEC where recorded.
  4. In web questionnaires, add reversed versions of questionnaire items where needed.
  5. Add scales for questionnaire measures within web studies.
  6. For Navigation missions, derive accuracy scores, speed scores and total scores based on the outcomes of component tasks and attempts.
  7. For web activities generally, derive the duration of each activity from the start and end times, and derive the overall duration of the battery if completed.
  8. In web tests, derive mean or median item response times where these are potentially useful.
  9. In the FFMP data, convert all recorded heights and weights to metric units, and convert string responses for 'other allergies' into a coded numeric variable.
  10. Derive variables for individual twin ages when the 18 year questionnaires were returned, and when each web study battery was started.
  11. Drop any temporary variables that have been used to derive the new variables. Date and date-time variables are dropped at this point, having been used to derive ages.
  12. Save a working SPSS data file ready for the next script.

Script 4: Label the variables

The purpose of this script (filename R4_label.sps) is simply to add variable labels and value labels to the variables in the dataset. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Label all variables.
  3. Add value labels to categorical variables, where appropriate (generally, for any numeric categorical variable having 3 or more categories).
  4. Save a working SPSS data file ready for the next script.

Script 5: Double entering the data

The purpose of this script (filename R5_double.sps) is to double-enter all the twin-specific data in the dataset. Note a few variables (twin-specific variables from family-level data and twin data flags) are already correctly double-entered at this stage (this was achieved in script 1). The variables to be double entered in the current script are all items from the twin questionnaire and from the twin web studies, plus any twin-specific derived variables created in the previous script. The script carries out the following tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Create the twin 1 part: rename all twin-specific item and derived variables by adding 1 to the end of the name, then save the dataset.
  3. Create the twin 2 part (for the co-twin) as follows:
    1. Rename the appropriate item and derived variables by changing the suffix from 1 to 2.
    2. Modify the id_twin values so they will match the co-twin (change the final digit from 1 to 2 or vice versa).
    3. Re-sort in ascending order of id_twin and save as the twin 2 part, keeping only the renamed variables.
  4. Re-open the twin 1 part.
  5. Merge in the twin 2 part, using id_twin as the key variable. The dataset is now double entered.
  6. Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
  7. Save an SPSS data file (filename r5double in the \working files\ subdirectory).
  8. Save another copy as the full 18 Year dataset, with filename Rdb9456_full.
  9. Because this file is extremely large and unwieldy, save also a reduced version (filename Rdb9456_reduced), dropping the very numerous item variables from the twin web tests.