TEDS Data Dictionary

Processing the 16 Year Data

Contents of this page:

Introduction

This page describes how the 16 Year analysis dataset is created. The starting point is the raw data, in cleaned and aggregated form. (Prior processes of data collection, data entry, data cleaning and aggregation are taken for granted here.) There are several main sources of data for the 16 Year analysis dataset:

  1. Twin web test data. These are stored in .csv text files, one file per test/activity.
  2. The Access database file called 16yr.accdb. This database stores the following:
    1. Parent and twin Behaviour/LEAP Study booklet data
    2. Parent and twin LEAP-2 Study booklet data
    3. Twin examination results data
    4. Study admin data
    5. Web status (admin) data, including the brief parent web SES questionnaire.
  3. Background data: twin sexes, zygosity variables, twin birth dates, medical exclusion and overall exclusion variables. Some of these variables are from the 1st Contact dataset, but the source of most of the background variables is the TEDS admin database, where they may occasionally be updated. Rather than exporting the source variables from the admin database then importing them in the creation of every dataset, this is done separately in a reference dataset containing all the background variables. This reference dataset is used here to add the background variables, ready-made.

Converting raw data from these sources into the dataset involves two main processes: firstly, where appropriate, raw data must be "exported" from the Access database into files that can be used by SPSS; secondly, the data files are combined and restructured, using SPSS, into a form suitable for analysis. The latter involves a lengthy series of steps, which are saved and stored in SPSS scripts (syntax files).

General issues involved in creating TEDS datasets are described in the data processing summary page. The raw 16 year data files are described in more detail on another page.

Exporting raw questionnaire and admin data

Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.

The Behaviour/LEAP/LEAP-2 Study booklet data, the twin exam results data, and the administrative data, stored in the Access 16yr.accdb database file, have been subject to occasional changes, even after the end of data collection. In earlier years, changes were caused by late returns of booklets, and more recently changes have occasionally been caused by data cleaning or data restructuring changes. If such changes have been made, then the data should be re-exported before a new version of the dataset is created using the SPSS scripts. The data stored in the database tables are in some cases exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. A query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields. The queries also modify the format of the data values in some columns, so that they are saved in a format that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and true/false columns (changed to 1/0 values). The queries and tables used to export the data are as follows:

Query name Source of data Database table(s) involved Exported file name
(query not used; exported directly from tables) parent Behaviour/LEAP booklets Parent1, Parent2 Parent1.csv, Parent2.csv
(query not used; exported directly from tables) parent LEAP-2 booklets Leap2Parent1, Leap2Parent2 Leap2Parent1.csv, Leap2Parent2.csv
Child1, Export Child2, Child3
(the first and third of these are exported directly from the tables, the second is exported via a query)
twin Behaviour/LEAP booklets Child1, Child2, Child3 Child1.csv, Child2.csv, Child3.csv
(query not used; exported directly from tables) twin LEAP-2 booklets Leap2Child1, Leap2Child2 Leap2Child1.csv, Leap2Child2.csv
Export 16yr Admin 16 Year Study admin data (return dates, etc) yr16Progress 16yrAdmin.csv
Export GCSE results, Export Other results twin GCSE results questionnaire GCSEresults, OtherResults GCSEresults.csv, OtherExamResults.csv

A convenient way of exporting these data files is to run a macro that has been saved for this purpose in the 16yr.accdb Access database. See the data files summary page and the 16 Year data files page for further information about the storage of these files.

Preparing raw web data

Data from the twin web tests were stored on the web server during the course of each cohort of the study (cohorts 1 and 2). For each test, the data for all twins that had completed the test were collected into an "analysis file" that was downloaded from the web server. These analysis files, one per web activity per cohort, were the original twin web data files. There were 32 such files, 16 per cohort (for the 16 activities Ravens Matrics, Mill Hill Vocabulary, Reading Fluency, Passages, Figurative Language, Understanding Number, Number Sense, Dot Number, Number Line, PVT, Corsi Block, Reaction Times, and Environment and Wellbeing Questionnaires parts A, B, C and D). Additionally, there was a family-based files per cohort, containing the data from the brief parent web SES questionnaire (administered directly after consent) along with variables describing the status of the various web activities.

After the data collection had ended, for each of the 16 twin activities, the two wave files were aggregated together, so that there is now a single file for each web activity. Furthermore, the 4 twin web questionnaire files (questionnaires A, B, C, D) have been merged into a single file - this is convenient because there are relatively few variables from each questionnaire. Hence, there are now 13 files of raw twin web data. These twin web files contain too many variables to be conveniently imported into Access database tables alongside the admin and questionnaire data. Instead, they are stored separately as csv text files. When the original files were aggregated in this way, identifying fields other than IDs (e.g. names) were removed.

There was also a "family status" file downloaded from the server in each wave. The family status file contained parent web data (consent, brief SES questionnaire) plus a status variable for each twin in each of the web activities. These original family status files have now been aggregated and stored within the Access database, as indicated above.

An additional processing stage was required for the Number Sense web test, in order to generate the Weber fraction scores. The input data for this processing was the analysis file for the Number Sense test. The output data was a csv text file containing the Weber fractions (one file per wave). The processing involved firstly a Perl script, which copied and re-structured the test data into further text files; then these files were processed by an R script that computed the Weber fractions. Details of the mechanism of this processing have not been retained. The Weber Fraction variables have now been merged with the item variables in the Number Sense data file, hence all original data from this test are now stored in the single web test file, which is one of the 13 mentioned above.

Processing by scripts

Having exported the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.

Script 1: Merging raw data sources

The main purpose of this script (filename P1_merge.sps) is to merge various raw data files together, so as to create an initial dataset with one row of data per twin. The script also carries out some basic processing of the raw data, such as recoding and renaming the raw item variables, setting variable formats, and creating some basic derived variables including scrambled IDs. The script carries out these tasks in order:

  1. There are 6 files of family-based raw data: the parent Leap/Behaviour booklet data (2 files), the parent Leap-2 booklet data (2 files), the web family status data (containing parent consent dates and twin test status variables), and the 16 year general admin data (containing booklet return dates and other details). These raw data files all start in csv format, except for the web family status file which has already been aggregated into an SPSS file. For each of these 6 files in turn, carry out the following actions:
    1. Import the csv file into SPSS. This step involves assigning each variable a name and setting the displayed variable width and number of decimal places.
    2. Sort in ascending order of family identifier FamilyID
    3. Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. For each variable, set the SPSS variable level (nominal/ordinal/scale)
    5. Carry out basic recoding of categorical variables where necessary.
    6. Add reversed-coded versions of item variables where these will be needed for scales.
    7. Where appropriate, set coded values for responses such as "don't know" as missing values in SPSS (unlike system-missing values, these missing values are retained as distinct codes in the data but are ignored for computations such as correlations)
    8. Drop raw data variables that are not to be retained in the datasets.
    9. Save as an SPSS data file.
  2. Merge the 6 files of family-based data together using FamilyID as the key variable.
  3. Double enter twin-specific items in the family-based data as follows:
    1. Compute twin identifier ptempid2 for the elder twin by appending 1 to the FamilyID. Compute the Random variable. Save as the elder twin part of the family data.
    2. Re-compute ptempid2 for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
    3. Combine the elder and younger twin parts together by adding cases. Sort in ascending order of ptempid2 and save as the double entered family data file.
  4. There are 21 files of twin-based raw data: twin Behaviour/Leap booklet data (3 files), twin Leap-2 booklet data (2 files), twin GCSE/exam results data (2 files), a file for each of the twin web activities (13 files), plus the admin data file containing twin IDs and birth orders. These raw data files start in csv format. For each of these files in turn, carry out the following actions:
    1. Import the csv file into SPSS. This step involves assigning each variable a name and setting the displayed variable width and number of decimal places.
    2. Sort in ascending order of twin identifier TwinID
    3. Where applicable, recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. Carry out basic recoding of categorical variables where necessary.
    5. Add reverse-coded versions of item variables where these will be needed for scales.
    6. Where appropriate, set coded values for responses such as "don't know" as missing values in SPSS (unlike system-missing values, these missing values are retained as distinct codes in the data but are ignored for computations such as correlations)
    7. For each variable, set the SPSS variable level (nominal/ordinal/scale)
    8. Save as an SPSS data file.
  5. Using TwinID as the key variable, merge together all 21 of the twin data files described above.
  6. Double enter the main twin data flags, as follows:
    1. Compute the alternative twin identifier ptempid2 as the FamilyID followed by the twin order (1 or 2).
    2. Change the relevant variable names (twin sex, twin data flags) by adding "1" to the end of the name.
    3. Sort in ascending order of ptempid2 and save this file as the twin 1 part.
    4. Change the flag variable names by changing the ending from 1 to 2. Change the values of ptempid2 to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of ptempid2 and save with just the renamed variables as the twin 2 part.
    5. Merge the twin 1 and twin 2 parts using ptempid2 as the key variable. The double entered data flags can now be used to select twin pairs having data, and the twin sexes can be used to derive sex/zygosity variables in a subsequent script.
  7. Merge this twin data file with the double entered parent data file, using ptempid2 as the key variable. This dataset now contains all the raw data.
  8. Use the appropriate data flag variables (parent web consent, parent booklets, and double-entered twin data flags), to filter the dataset and delete any twin pairs in which neither twin has any 16 Year data. Add the overall 16 Year data flag variable p16year.
  9. Recode all data flag variables from missing to 0.
  10. Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page.
  11. Sort in ascending order of scrambled twin ID id_twin.
  12. Save and drop the raw ID variables.
  13. Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, autism, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate.
  14. Use variable p16year to filter the dataset and delete cases added from the reference dataset that do not have 16 Year data.
  15. Save a working SPSS data file ready for the next script (filename p1merge in the \working files\ subdirectory).

Script 2: Cleaning the raw data

The purpose of this script (filename P2_clean.sps) is to attempt to detect and clean up anomalies in the raw data. This generally involves recoding of variables to correct anomalies where found, or in some cases recoding to missing in order to delete data found to be invalid or of dubious quality. The script carried out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. In all web tests where item timeouts can occur, recode item response variable values to -1 for each timed out item. (In some tests, item timeouts were already coded with different values, while in other tests item timeouts can be detected from item answer times).
  3. In each test item where a timeout has been detected, recode the item answer time to missing. This enables a more meaningful mean answer time to be computed later for each test.
  4. In each web test where item timeouts can occur, compute (in temporary variables) the number of timed out items. This will be used later in the detection of tests with multiple timeouts.
  5. Compute the mean item answer time (seconds) for each web test in which such times were recorded.
  6. Compute the total time taken, in minutes for each twin web activity. For the web environment and wellbeing questionnaires, time was not directly measured but can be derived from the start and end times.
  7. For each twin web activity having uniform item response formats, compute (in temporary variables) the amount of variability in the twin responses. This will be used later in the detection of random responders.
  8. Identify and flag tests that have apparently crashed (too few items answered) or that have had excessive numbers of item timeouts. Where detected, delete the test data, and change the value of the test status variable.
  9. Identify and flag twin activities where the responses are apparently random. Various indicators were used to identify these: any combination of very short mean answer times, low test scores, and little variability in item responses. Different criteria were used for each activity, by identifying extreme outliers among these indicators. Where random responders are detected, delete the test data, and change the value of the test status variable.
  10. In web tests using branching and/or discontinue rules, recode item response and item score variables for any items skipped due to these rules. Discontinued items are recoded to -2 in the response variables and 0 in the score variables. Items skipped due to upward branching are recoded to -3 in the response variables, and have item scores of 1.
  11. Any remaining missing item answers and scores in the cognitive web tests are now assumed to be the result of discrete item crashes; in these cases, the missing item responses are recoded to -4, and the missing item scores are recoded to 0.
  12. In each test, where the status flag has been recoded to 3 (compromised test) or 4 (random responder), exclude the test data by recoding the test data flag from 1 to 0 and all test items and scores to missing.
  13. Derive the twin web data flag pcwebdata1 based on the existence (or not) of any meaningful web data after making the exclusions above.
  14. Clean item responses in the puberty measure of web environment and wellbeing questionnaire part D as follows. In the menstruation date item, delete values if only the month, and not the year, was recorded. In the self-reported twin sex item, check for discrepancies with the twin sex recorded in the admin data; where these occur, delete values in all the gender-specific items of the puberty measure.
  15. In the Behaviour/LEAP and LEAP-2 Study booklet data, there are measures having items with yes/no responses, followed by further items that only apply if the initial response was yes. Check these sets of items for discrepancies between the response in the first item and the responses in the follow-up item(s). Where discrepancies occur, used recoding to correct them. Different recoding rules were used according to the nature of the questions.
  16. In the Facebook measure of the twin Behaviour/LEAP Study booklet, there are multiple items with a 'no account' response. These are aggregated into a single consistent 'no account' response in the first item, then subsequent 'no account' responses are deleted.
  17. Save a working SPSS data file ready for the next script (filename pclean).

Script 3: Create derived variables

The purpose of this script (filename P3_derive.sps) is to compute derived variables, including scales and composites and twin ages. See derived 16 Year variables for full details. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Derive variables for individual twin ages when the web tests were carried out and when the various booklets/questionnaires were returned.
  3. Three of the web tests (Mill Hill Vocabulary, Ravens Matrices, Understanding Number) had discontinue rules; and in many or all of the items there was a significant chance of guessing the correct answer. Hence derive "adjusted" test scores: for any item skipped due to a discontinue rule, the default score of 0 is replaced by the "chance" score that would be obtained, on average, by selecting an answer at random. The adjusted total score is then the sum of the adjusted item scores.
  4. Derive standardized cognitive and SES composites as follows:
    1. Apply a filter (exclude1=0 & exclude2=0) to remove exclusions
    2. Standarise the necessary component items and scores, reversed where necessary for SES
    3. Compute the mean of the appropriate standardised items/scores
    4. Standardise the mean, to make the final version of each composite
    5. Remove the filter
  5. Derive twin environment and wellbeing scales from the twin web questionnaire measures. These measures include BMI, Puberty, Chaos, Attachment, Parental Monitoring, Personality, Academic Self-Concept, School Engagement, PISA school environment measures, Victimisation, Life Satisfaction, Hopefulness, Gratitude, Curiosity, SHS, GRIT, Ambition and Optimism.
  6. Clean the heights, weights and BMIs by removing anomalous extreme outliers (by recoding extreme values to missing).
  7. Derive scales from the parent and twin Behaviour/Leap and Leap-2 booklet measures. These measures (behaviour, psychotic experiences, environment and others) include AQ, SDQ, ICUT, CASI, ARBQ, MFQ, SWAN, Conners, SANS, Paranoid Checklist, TEPS, CAPS, Grandiosity, Cognitive Disorganisation, Anhedonia, HCL, SHS, Life Satisfaction, EDDS, Insomnia and Sleep.
  8. Derive composites from the twin GCSE/exam result questionnaires. These composites include numbers of graded qualifications obtained, mean grades and total point scores. Some derive from GCSE results only, and some derive from GCSEs plus qualifications having equivalent grades.
  9. Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
  10. Save a working SPSS data file ready for the next script (filename p3derive).

Scripts 4 and 5: Labelling the variables

The purpose of these two scripts (filenames P4_label_part1.sps, P5_label_part2.sps) is to label all the variables in the dataset, and to add value labels to integer-valued categorical variables having 3 or more categories. Because the number of variables in the dataset is very large, the labelling of the variables requires a very long script; hence this task has been split into two scripts for convenience. Each script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Label variables.
  3. Add value labels for numeric categorical variables that have 3 or more categories.
  4. Save a working SPSS data file ready for the next script (filenames p4label1, p5label2).

Script 6: Double entering the data

The purpose of this script (filename P6_double.sps) is to double-enter all the twin-specific data in the dataset. Note that twin-specific item variables from the parent (behaviour/leap and leap-2) booklets and admin data are already correctly double-entered at this stage (this was achieved in script 1). The variables to be double entered in the current script are all items from the twin behaviour/leap/leap-2 booklets, twin web tests and twin exam results; and all twin-specific derived variables added in script 3. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Rename all the twin item variables, so their names end in "1"; these will become the variables referring to the main twin (identified by id_twin in each row of data).
  3. Save this data file as the "left hand part".
  4. Modify the values of the id_twin variable, swapping values 1 and 2 in the final digit; this is so that the values will match the IDs of the co-twin.
  5. Sort in ascending order of the modified id_twin to allow merging.
  6. Rename all the twin item variables, so their names end in "2" instead of "1"; these will become the variables referring to the co-twin.
  7. Save this data file as the "right hand part", retaining only the renamed variables.
  8. Merge the "left" and "right" parts as saved above, using id_twin as the key variable. The dataset is now double entered.
  9. Delete any cases that do not have any 16 year data for at least one twin per pair, using the p16year variable as a filter.
  10. Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
  11. Save an SPSS data file (filename p6double in the \working files\ subdirectory).
  12. Save another copy as the full 16 Year dataset, with filename Pdb9456.