TEDS Data Dictionary

Processing the 14 Year Data

Contents of this page:

Introduction

This page describes how the 14 Year analysis dataset is created. The starting point is the raw data, in cleaned and aggregated form. (Prior processes of data collection and data entry are taken for granted here.) There are three main sources of data for the 14 Year analysis dataset:

  1. Web test data. These are stored in .csv text files, one file per test/activity plus an overall family status file.
  2. Booklet/questionnaire data, and study admin data. These are stored in tables in the Access database file called 14yr.accdb, and are exported into csv files for creation of the dataset (see exporting raw data below).
  3. Background data: twin sexes, zygosity variables, twin birth dates, medical exclusion and overall exclusion variables. Some of these variables are from the 1st Contact dataset, but the source of most of the background variables is the TEDS admin database, where they may occasionally be updated. Rather than exporting the source variables from the admin database then importing them in the creation of every dataset, this is done separately in a reference dataset containing all the background variables. This reference dataset is used here to add the background variables, ready-made.

Converting raw data from these sources into the dataset involves two main processes, which are described in more detail on this page. Firstly, the raw data must be converted into a convenient form that can be used by SPSS in building the dataset; this involves "exporting" data from databases into csv files, and for web data it involves a few additional steps. Secondly, the data files are combined and restructured, using SPSS, into a form suitable for analysis. The latter involves a lengthy series of steps, which are saved and stored in SPSS scripts (syntax files).

General issues involved in creating TEDS datasets are described in the data processing summary page. The raw 14 year data files are described in more detail on another page.

Exporting raw data

Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.

The booklet/questionnaire data and the administrative data, stored in the Access 14yr.accdb database file, are subject to occasional changes, even after the end of data collection. In earlier years, changes were caused by late returns of booklets, and more recently changes have occasionally been caused by data cleaning or data restructuring changes. If such changes have been made, then the data should be re-exported before a new version of the dataset is created using the SPSS scripts. The data stored in the database tables are exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. Each query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields. The queries also modify the format of the data values in some columns, so that they are saved in a format that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and true/false columns (changed to 1/0 values). The queries used to export the data are as follows:

Query name Source of data Database table(s) involved Exported file name
Export Parent parent booklets Parent Parent.csv
Export SLQ SLQ questionnaire SLQdata SLQ.csv
Export Teacher teacher questionnaires Teacher Teacher.csv
Export Child twin booklets Child TwinQuestionnaire.csv
Export 14yr admin 14 year booklet admin data yr14Progress 14yrAdmin.csv

A convenient way of exporting these data files is to run the saved macro called export data in the 14yr.accdb Access database. See the data files summary page and the 14 Year data files page for further information about the storage of these files.

Preparing raw web data

Data from the twin web tests were stored on the web server during the course of each wave of the study (waves 1 and 2). For each test, the data for all twins that had completed the test were collected into an "analysis file" that was downloaded from the web server. These analysis files, one per web activity per cohort, were the original twin web data files. There were 6 such files, 3 per cohort (for the 3 activities Science, Vocabulary and Ravens Matrices). Additionally, there was one family-based files per wave: the "family status" file contain data describing the dates and status of the various web activities including parent consent.

Subsequently, for each of these data files, the two wave files were aggregated together, so that there is now a single file for each web activity. Hence there are 4 aggregated web data files in total. These files contain too many variables to be conveniently imported into Access database tables alongside the admin and questionnaire data. Instead, they are stored separately as csv text files. When the original files were aggregated in this way, identifying fields other than IDs (e.g. names) were removed.

Processing by scripts

Having exported the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.

Script 1: Merging raw data sources

The main purpose of this script (filename N1_merge.sps) is to merge various raw data files together, so as to create a basic dataset with one row of data per twin. The script also carries out some basic processing of the raw data, such as recoding and renaming the raw item variables, setting variable formats, and creating some basic derived variables including scrambled IDs. The script carries out these tasks in order:

  1. There are 4 files of family-based raw data: the parent booklet data, the parent SLQ questionnaire data, the web family status data (containing parent consent dates and twin test status variables), and the 14 year general admin data (containing booklet return dates and other details). These raw data files all start in csv format. For each of these 4 files in turn, carry out the following actions:
    1. Import into SPSS
    2. Sort in ascending order of family identifier FamilyID
    3. Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
    5. Carry out basic recoding of categorical variables where necessary. In the SLQ data, this includes the recoding of strings (language names) into numeric codes.
    6. Add reversed-coded versions of item variables where needed.
    7. Drop raw data variables that are not to be retained in the datasets.
    8. Save as an SPSS data file.
  2. Merge the 4 files of family-based data together using FamilyID as the key variable.
  3. Double enter twin-specific items in the family-based data as follows:
    1. Compute twin identifier atempid2 for the elder twin by appending 1 to the FamilyID. Compute the Random variable. Save as the elder twin part of the family data.
    2. Re-compute atempid2 for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
    3. Combine the elder and younger twin parts together by adding cases. Sort in ascending order of atempid2 and save as the double entered family data file.
  4. There are 6 files of twin-based raw data: the admin data file containing twin IDs and birth orders, the file of twin booklet data, the file of teacher questionnaire data, and the 3 files of twin web activity data (for Science, Ravens and Vocabulary). These raw data files start in csv format. For each of these files in turn, carry out the following actions:
    1. Import into SPSS
    2. Sort in ascending order of twin identifier TwinID
    3. For paper questionnaire data files, recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. In the Science web test data file, wherever possible recode raw string response variables into equivalent numeric response variables.
    5. Add reversed-coded versions of item variables where needed.
    6. For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
    7. Save as an SPSS data file.
  5. Using TwinID as the key variable, merge together all 6 of twin data files described above.
  6. Derive a flag variable ncwdata1 to show which twins have some web data, from at least one web test.
  7. Double enter the main twin and teacher booklet data flags, as follows:
    1. Compute the alternative twin identifier atempid2 as the FamilyID followed by the twin order (1 or 2).
    2. Sort in ascending order of atempid2 and save this file as the twin 1 part.
    3. Change the flag variable names by changing the ending from 1 to 2. Change the values of atempid2 to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of atempid2 and save with just the renamed variables as the twin 2 part.
    4. Merge the twin 1 and twin 2 parts using atempid2 as the key variable. The double entered data flags can now be used to select twin pairs having data, and the twin sexes can be used to derive sex/zygosity variables in a subsequent script.
  8. Merge this twin data file with the double entered parent data file, using atempid2 as the key variable. This dataset now contains all the raw data.
  9. Use the parent booklet data flag, the SLQ data flag, the parent web consent flag, and the double entered twin and teacher data flags, to filter the dataset and delete any twin pairs in which neither twin has any 14 Year data. Add the overall 14 Year data flag variable n14year.
  10. Recode all data flag variables from missing to 0.
  11. Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page.
  12. Sort in ascending order of scrambled twin ID id_twin.
  13. Save and drop the raw ID variables.
  14. Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, autism, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate.
  15. Use variable n14year to filter the dataset and delete cases added from the reference dataset that do not have 14 Year data.
  16. Save a working SPSS data file ready for the next script (filename n1merge in the \working files\ subdirectory).

Script 2: Recoding web data

The purpose of this script (filename N2_recode.sps) is to make the web item data easier to use, particularly by recoding missing item responses and item scores, by identifying and recoding anomalies, and by identifying and excluding probable random responders. The recoding procedures are designed to be consistent with those used in the other web studies. The tasks carried out in this script are necessary for some of the derivations of new variables in the next script. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Derive variables to measure the variability in a twin's responses, in each web test that has a uniform response format across all items (Vocabulary and Ravens). Although these are temporary variables, designed for use in the next script, it is necessary to compute them here before the item response re-coding that follows.
  3. For each of the web tests, where appropriate, carry out the following recoding stages. (The coding is generally numeric, but for a few Science test items the responses are coded in strings.)
    1. All three tests have discontinue rules, so determine the point at which each twin discontinued (if at all). Hence identify items that were skipped due to the discontinue rule. For these items, recode item responses from missing to -2.
    2. All three tests have item timeout rules, although no timeouts were observed in practice in the Vocabulary test. Identify the items that were timed out. For these items, recode item responses to -1.
    3. In all tests, assume that any remaining items with missing item responses and/or scores are crashed/interrupted/malfunctioned items. Identify such items, and recode item responses from missing to -4.
    4. For all items identified as discontinued, timed out or crashed, recode the scores to zero and the response times to missing where necessary.
  4. For each web test, where appropriate, carry out the following additional steps to identify and deal with compromised tests:
    1. Identify any cases that have been affected by a high proportion of item crashes and/or item timeouts (hence a relatively low proportion of items with meaningful answers).
    2. In the Science test, identify cases where item crashes prevented the discontinue rule from working successfully, resulting in a significant increase in test score.
    Flag such cases of compromised tests by recoding the status variable from 2 to 3. Identification of such test instances is described in more detail in the web data cleaning page.
  5. Recompute the total score for each test, to adjust for the small number of cases where item scores were recoded from 1 to 0 (due to a faulty discontinue rule).
  6. Recode anomalous very large item times to missing, so they do not distort the means. Then derive the mean item answer time for each web test.
  7. Derive temporary variables, for each web test, to measure the proportion of attempted items that were answered with very fast response times. These variables are used in the following step.
  8. In each web test, identify twins who appear to have answered randomly or without effort. The identification procedures are described in more detail in the web data cleaning page. Where identified, mark such instances of twin tests as exclusions by changing the test status variable value from 2 to 4.
  9. In each test, where the status flag has been recoded to 3 (compromised test) or 4 (random responder), exclude the test data by recoding the test data flag from 1 to 0 and all test items and scores to missing.
  10. Re-compute the twin web data flag ncwdata1 to take account of these new exclusions.
  11. Save a working SPSS data file ready for the next script (filename n2recode in the \working files\ subdirectory).

Script 3: Creating derived variables

The purpose of this script (filename N3_derive.sps) is to compute derived variables, including scales and composites and twin ages. See derived 14 Year variables for full details. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Derive variables for individual twin ages when the web tests were started, when the various booklets/questionnaires were returned, and when the telephone language tests were reported as completed (note that the phone language test data are not in the dataset).
  3. All three web tests had discontinue rules; and in many or all of the items there was a significant chance of guessing the correct answer. Hence derive "adjusted" test scores: for any item skipped due to a discontinue rule, the default score of 0 is replaced by the "chance" score that would be obtained, on average, by selecting an answer at random. The adjusted total score is then the sum of the adjusted item scores.
  4. Derive the standardized cognitive (g) composite as follows:
    1. Apply a filter (nexclude=0) to remove exclusions
    2. Standarise the necessary component scores
    3. Compute the mean of the appropriate standardised scores
    4. Standardise the mean, to make the final version of each composite
    5. Remove the filter
  5. Derive behaviour/environment scales (parent, teacher and child versions where appropriate) for the following measures: APSD, Conners, AQ, Victimisation, Puberty, Chaos, Parental Feelings and Discipline. In each case, the scale is derived using a mean of the relevant items, requiring at least half the items to be non-missing.
  6. Derive twin BMI from heights and weights reported in the twin booklet. Recode extreme outliers to missing in heights, weights and BMI.
  7. Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
  8. Save a working SPSS data file ready for the next script (filename n3derive).

Script 4: Labelling the variables

The purpose of this script (filename N4_label.sps) is to label all the variables in the dataset, and to add value labels for categorical variables. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Label variables.
  3. Add value labels to all integer-valued categorical variables having 3 or more categories.
  4. Save a working SPSS data file ready for the next script (filename n4label).

Script 5: Double entering the data

The purpose of this script (filename N5_double.sps) is to double-enter all the twin-specific data in the dataset. Note that twin-specific item variables from the parent/admin data are already correctly double-entered at this stage (this was achieved in script 1). The variables to be double entered in the current script are all items from the twin web tests, twin booklets and teacher questionnaires. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Create the twin 2 part (for the co-twin) as follows:
    1. Rename the appropriate item and derived variables (from the twin web tests and teacher questionnaires) by changing the suffix from 1 to 2.
    2. Modify the id_twin values so they will match the co-twin (change the final digit from 1 to 2 or vice versa).
    3. Re-sort in ascending order of id_twin and save as the twin 2 part, keeping only the renamed variables.
  3. Re-open the data file saved at the end of the previous script: this already serves as the twin 1 part of the dataset.
  4. Merge in the twin 2 part, using id_twin as the key variable. The dataset is now double entered.
  5. Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
  6. Save an SPSS data file (filename n5double in the \working files\ subdirectory).
  7. Save another copy as the full 14 Year dataset, with filename Ndb9456.