TEDS Data Dictionary

Processing the 9 Year Data

Contents of this page:

Introduction

This page describes how the 9 Year analysis dataset is created. The starting point is the raw data, in cleaned and aggregated form. (Prior processes of data collection, data entry, data cleaning and aggregation are taken for granted here.) There are two main sources of data for the 9 Year analysis dataset:

  1. Booklet data, and study admin data. These are stored in tables in the Access database file called 9yr.accdb.
  2. Background data: twin sexes, zygosity variables, twin birth dates, medical exclusion and overall exclusion variables. Some of these variables are from the 1st Contact dataset, but the source of most of the background variables is the TEDS admin database, where they may occasionally be updated. Rather than exporting the source variables from the admin database then importing them in the creation of every dataset, this is done separately in a reference dataset containing all the background variables. This reference dataset is used here to add the background variables, ready-made.

Converting raw data from these sources into the dataset involves two main processes: firstly, where appropriate, raw data must be "exported" into files that can be used by SPSS; secondly, the data files are combined and restructured, using SPSS, into a form suitable for analysis. The latter involves a lengthy series of steps, which are saved and stored in SPSS scripts (syntax files).

General issues involved in creating TEDS datasets are described in the data processing summary page. The raw 9 year data files are described in more detail on another page.

Exporting raw data

Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.

The booklet data and the administrative data, stored in the Access 9yr.accdb database file, have been subject to occasional changes, even after the end of data collection. In earlier years, changes were caused by late returns of booklets, and more recently changes have occasionally been caused by data cleaning or data restructuring changes. If these changes have been made, then the data should be re-exported before a new version of the dataset is created using the SPSS scripts. The data stored in the database tables are exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. Each query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields. The queries also modify the format of the data values in some columns, so that they are saved in a format that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and true/false columns (changed to 1/0 values). The queries used to export the data are as follows:

Query name Source of data Database table(s) involved Exported file name
Export Parent1,
Export Parent2,
Export Parent3
parent booklets Parent1, Parent2, Parent3 Parent1.csv, Parent2.csv, Parent3.csv
Export ChildBooklet twin booklets ChildBooklet ChildBooklet.csv
Export Teacher teacher booklets Teacher Teacher.csv
Export 9yr admin 9 year admin data yr9Progress 9yrAdmin.csv

A convenient way of exporting these data files is to run a macro that has been saved for this purpose within the Access database. See the data files summary page and the 9 Year data files page for further information about the storage of these files.

Processing by scripts

Having exported the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.

Script 1: Merging raw data sources

The main purpose of this script (filename i1_merge.sps) is to merge various raw data files together, so as to create a basic dataset with one row of data per twin. This script also does some basic variable formatting and recoding, and double-enters the twin-specific items from the parent booklet. The script carries out these tasks in order:

  1. There are 4 files of family-based raw data: the 3 files of parent booklet data and the file of 9 year admin data such as return dates. These raw data files all start in csv format. For each of these files in turn, carry out the following actions:
    1. Import into SPSS
    2. Sort in ascending order of family identifier FamilyID
    3. Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
    5. Carry out basic recoding of categorical variables where necessary.
    6. Drop raw data variables that are not to be retained in the datasets.
    7. Save as an SPSS data file.
  2. In the 3 parent booklet data files, in addition to the steps mentioned above, transform and derive further variables as follows:
    1. Add reversed versions of behaviour items, where needed.
    2. Convert raw twin-pair neither/elder/younger/both items to twin-specific yes/no items
  3. Merge the 4 files of family-based data together using FamilyID as the key variable.
  4. Double enter twin-specific items in the family-based data as follows:
    1. Compute twin identifier itempid2 for the elder twin by appending 1 to the FamilyID. Save as the elder twin part of the family data.
    2. Re-compute itempid2 for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
    3. Combine the elder and younger twin parts together by adding cases. Sort in ascending order of itempid2 and save as the double entered family data file.
  5. There are 3 files of twin-based raw data: the file of child booklet data, the file of teacher booklet data, plus the admin data file containing twin IDs and birth orders. These raw data files all start in csv format. For each of these files in turn, carry out the following actions:
    1. Import into SPSS
    2. Sort in ascending order of twin identifier TwinID
    3. Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
    5. Carry out basic recoding of categorical variables where necessary (and add reversed versions of behaviour variables where needed)
    6. In the file of twin IDs and birth orders, compute the alternative twin identifier itempid2 as the FamilyID followed by the twin order (1 or 2).
    7. In the file of twin booklet data, convert test item responses into test item scores by recoding.
    8. Drop raw data variables that are not to be retained in the datasets.
    9. Save as an SPSS data file.
  6. Merge the 3 twin data files together using TwinID as the key variable.
  7. Double enter the twin and teacher data flags, as follows:
    1. Sort in ascending order of itempid2 and save as the twin 1 part. (Note that by this stage the twin variables already have names ending in 1.)
    2. Change the flag variable names by changing the ending from 1 to 2. Change the values of itempid2 to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of itempid2 and save with just the double entered variables as the twin 2 part.
    3. Merge the twin 1 and twin 2 parts using itempid2 as the key variable. The double entered data flags can now be used to select twin pairs having data.
  8. Merge this twin data file with the double entered parent data file, using itempid2 as the key variable. This dataset now contains all the raw data.
  9. Use the parent data flag, and the double entered twin and teacher data flags, to filter the dataset and delete any cases without any 9 Year data. Add the overall 9 Year data flag variable inineyr.
  10. Save a working SPSS data file ready for the next script (filename i1merge in the \working files\ subdirectory).

Script 2: Deriving new variables

The main purpose of this script (filename i2_derive.sps) is to derive new variables including scales, composites and twin ages. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page.
  3. Sort in ascending order of scrambled twin ID id_twin.
  4. Save and drop the raw ID variables.
  5. Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, autism, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate.
  6. Use variable inineyr to filter the dataset and delete cases added from the reference dataset that do not have 9 Year data.
  7. Derive best estimates of the ages of the twins when the parent, twin and teacher data were completed, using the full range of available dates in the raw data.
  8. Derive combined parent-teacher behaviour items, in which the more extreme score from the parent and teacher is used. These are the variables with names starting with impsy (for some psychopathy-related items) and imsdq (for selected SDQ items).
  9. Compute various scales from the behaviour items, from the parent booklet, the twin booklet and the teacher booklet (and in some cases, from the combined parent-teacher items described above). The behaviour scales include SDQ, APSD, proactive and reactive aggression, CAST and motivational scales.
  10. Compute various scales from environment measures, from the parent booklet, the twin booklet and the teacher booklet. These environment scales include chaos, parental feelings and discipline, and CEQ.
  11. Derive a total score variable for each of the 4 tests in the twin booklet.
  12. Derive standardized cognitive and academic achievement composites as follows:
    1. Apply a filter (exclude1=0 & exclude2=0) to remove exclusions
    2. Standarise the necessary component items and scores
    3. Compute the mean of the appropriate standardised items/scores
    4. Standardise the mean, to make the final version of each composite
    5. Remove the filter
  13. Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
  14. Save a working SPSS data file ready for the next script (filename i2derive), dropping all temporary variables that had been used during derivation of scales and composites.

Script 3: Labelling the variables

The purpose of this script (filename i3_label.sps) is to label all the variables in the dataset, and to add value labels where appropriate. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Label all the variables.
  3. Add value labels to all integer-valued categorical variables having 3 or more categories.
  4. Save a working SPSS data file ready for the next script (filename i3label).

Script 4: Double entering the data

The purpose of this script (filename i4_double.sps) is to double-enter all the twin-specific data in the dataset. Note that twin-specific item variables from the parent booklet are already correctly double-entered at this stage (this was achieved in script 1). The variables to be double entered in the current script are all item and derived variables from the twin and teacher booklets. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Create the twin 2 part (for the co-twin) as follows:
    1. Rename the appropriate item and derived variables (from the twin and teacher booklets) by changing the suffix from 1 to 2.
    2. Modify the id_twin values so they will match the co-twin (change the final digit from 1 to 2 or vice versa).
    3. Re-sort in ascending order of id_twin and save as the twin 2 part, keeping only the renamed variables.
  3. Re-open the data file saved at the end of the previous script: this already serves as the twin 1 part of the dataset.
  4. Merge in the twin 2 part, using id_twin as the key variable. The dataset is now double entered.
  5. Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
  6. Save an SPSS data file (filename i4double in the \working files\ subdirectory).
  7. Save another copy as the main 9 Year dataset, with filename idoub945.