TEDS Data Dictionary

Processing the 2 Year Data

Contents of this page:

Introduction

This page describes how the 2 Year analysis dataset is created. The starting point is the raw data, in cleaned and aggregated form. (Prior processes of data collection, data entry, data cleaning and aggregation are taken for granted here.) There are two main sources of data for the 2 Year analysis dataset:

  1. Booklet data collected in the study, and booklet return dates (from admin data). These are stored in tables in the Access database file called 2yr.accdb.
  2. Background data: twin sexes, zygosity variables, twin birth dates, medical exclusion and overall exclusion variables. Some of these variables are from the 1st Contact dataset, but the source of most of the background variables is the TEDS admin database, where they may occasionally be updated. Rather than exporting the source variables from the admin database then importing them in the creation of every dataset, this is done separately in a reference dataset containing all the background variables. This reference dataset is used here to add the background variables, ready-made.

Converting raw data from these sources into the dataset involves two main processes: firstly, where appropriate, raw data must be "exported" into files that can be used by SPSS; secondly, the data files are combined and restructured, using SPSS, into a form suitable for analysis. The latter involves a lengthy series of steps, which are saved and stored in SPSS scripts (syntax files).

General issues involved in creating TEDS datasets are described in the data processing summary page. The raw 2 year data files are described in more detail on another page.

Note that this page describes the current version (post-2013) of the 2 Year dataset, constructed using SPSS scripts. The earlier versions were constructed using SAS scripts. Variable names and variable coding have been retained, although some redundant variables have been dropped and a few new variables added. The new scripts incorporate additional cleaning steps, some minor corrections to the computation of some derived variables, and changes to the ways that exclusion variables are added.

Exporting raw data

Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.

Over time, the study booklet data, stored in the Access 2yr.accdb database file, have been subject to occasional changes, even after the end of data collection. In earlier years, changes were caused by late returns of booklets, and more recently changes have occasionally been caused by data cleaning or data restructuring changes. If these changes have been made, then the data should be re-exported before a new version of the dataset is created using the SPSS scripts. The booklet data stored in the database tables are exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves (except in the case of the ReturnDates table). Each query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields. The queries used to export the data are as follows:

Query or table name Source of data Database table(s) involved Exported file name
Export Adult parent booklets Adult Adult1.csv
Export Child1,
Export Child2,
Export Child3
twin booklets Child1, Child2, Child3 Child1.csv, Child2.csv, Child3.csv
ReturnDates booklet return dates ReturnDates return_dates.csv

A convenient way of exporting these data files is to run a macro saved in the Access database. See the data files summary page and the 2 Year data files page for further information about the storage of these files.

Processing by scripts

Having exported the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.

Script 1: Merging raw data sources

The main purpose of this script (filename B1_merge.sps) is to merge raw data files together and do some basic recoding. The script carries out these tasks in order:

  1. There are 2 files of family-based raw data: the file of parent booklet data plus the file of booklet return dates. These raw data files both start in csv format. For each of these files in turn, carry out the following actions:
    1. Import into SPSS
    2. Sort in ascending order of family identifier FamilyID
    3. Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
    5. Carry out basic recoding of categorical variables where necessary
    6. Transform some raw variables into more user-friendly compound variables where appropriate. For example, a time interval recorded in three raw data variables representing days, weeks and months is transformed into a single variable representing total number of days.
    7. In the parent booklet data file, add a data flag variable showing the presence of parent data.
    8. Drop raw data variables that are not to be retained in the datasets.
    9. Save as an SPSS data file.
  2. There are 4 files of twin-based raw data: the three files of child booklet data plus the admin data file containing twin IDs and birth orders. These raw data files all start in csv format. For each of these files in turn, carry out the following actions:
    1. Import into SPSS
    2. Sort in ascending order of twin identifier TwinID
    3. Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
    5. Carry out basic recoding of categorical variables where necessary
    6. Transform some raw variables into more user-friendly compound variables where appropriate.
    7. In the first twin booklet data file, add a data flag variable showing the presence of twin data.
    8. In the file of twin birth orders, compute the alternative twin identifier atempid2 as the FamilyID followed by the twin order (1 or 2).
    9. Drop raw data variables that are not to be retained in the datasets.
    10. Save as an SPSS data file.
  3. Merge the 2 files of family-based data together using FamilyID as the key variable.
  4. Double enter the family-based data as follows:
    1. Compute twin identifier atempid2 for the elder twin by appending 1 to the FamilyID. Save as the elder twin part of the family data.
    2. Re-compute atempid2 for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
    3. Combine the elder and younger twin parts together by adding cases. Sort in ascending order of atempid2 and save as the double entered family data file.
  5. Merge the 4 twin data files together using TwinID as the key variable.
  6. Double enter the twin data flags as follows:
    1. Append 1 to the variable name, to denote twin 1. Sort in ascending order of atempid2 and save as the twin 1 part.
    2. Change the variable names by appending 2 instead of 1. Change the values of atempid2 to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of atempid2 and save with just the double entered variables as the twin 2 part.
    3. Merge the twin 1 and twin 2 parts using atempid2 as the key variable. The double entered twin data flags can now be used to select twin pairs having data.
  7. Merge this twin data file with the double entered parent data file, using atempid2 as the key variable. This dataset now contains all the raw data.
  8. Use the parent data flag, and the double entered twin data flags, to filter the dataset and delete any cases without any 2 Year data. Add the overall 2 Year data flag variable btwoyear.
  9. Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page.
  10. Sort in ascending order of scrambled twin ID id_twin.
  11. Save and drop the raw ID variables.
  12. Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate. Use btwoyear to filter the dataset and delete cases added from the reference dataset that do not have 2 Year data.
  13. Save a working SPSS data file ready for the next script (filename b1merge in the \working files\ subdirectory).

Script 2: Deriving new variables

The main purpose of this script (filename B2_derive.sps) is to create scales and other new derived variables. See 2 Year derived variables for more details of how the variables were computed. The script carries out these tasks in order:

  1. Open the dataset file b1merge saved by the last script.
  2. Clean some inconsistent or anomalous item variable values by recoding, as follows:
    1. In two-part questions, typically having a yes/no initial question followed by an if-yes part, remove inconsistencies by recoding the if-yes part to missing (or zero if appropriate) if the initial response was no.
    2. In the same types of questions, where affirmative responses were given in the if-yes part and the initial response was missing, recode the initial response to yes.
    3. If more than one response was given for the marital status question, recode the responses to missing. Combine the 8 raw response variables into a single nominal marital status variable.
  3. Compute new date variables, from day/month/year item variables. Reduce the number of date variables in the dataset by making a best estimate of the completion date for each booklet section, substituting missing dates with other available dates where necessary.
  4. Use the new date variables and the twin birth dates to compute twin age variables.
  5. Derive Behar behaviour scales.
  6. Derive MCDI language scores (word use, vocabulary, sentence complexity and grammar).
  7. Derive item and total scores for each of the Parca activities (the 5 parent-administered tests plus the parent-report measure). Add an overall total score for the 5 parent-administered tests.
  8. Derive standardised composite scores from the MCDI and Parca measure scores. These composites are equally-weighted means of scores that have been standardised on the non-excluded sample of twins.
  9. Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
  10. Save a working SPSS data file ready for the next script (filename b2derive in the \working files\ subdirectory).

Script 3: Label variables

The purpose of this script (filename B3_label.sps) is simply to label the variables and add value labels. The script carries out these tasks in order:

  1. Open the dataset file b2derive saved by the last script.
  2. Label all the variables.
  3. For all categorical variables having more than 2 categories, add value labels.
  4. Save a working SPSS data file ready for the next script (filename b3label in the \working files\ subdirectory).

Script 4: Double enter the twin data

The purpose of this script (filename B4_double.sps) is to double enter the twin variables from the child booklets, add a few new variables that require double entered data, and save the final dataset. Note that twin-related variables from the parent booklet were double entered in the first script above. The script carries out these tasks in order:

  1. Open the dataset file b3label saved by the last script.
  2. Rename all item and derived variables from the twin booklet, by appending 1 to the name.
  3. Sort in ascending order of id_twin and save as the twin 1 part of the data.
  4. Create the twin 2 part (for the co-twin) as follows:
    1. Rename the twin booklet variables by changing the suffix from 1 to 2.
    2. Modify the id_twin values so they will match the co-twin (change the final digit from 1 to 2 or vice versa).
    3. Re-sort in ascending order of id_twin and save as the twin 2 part, keeping only the renamed twin booklet variables.
  5. Merge the twin 1 and twin 2 parts, using id_twin as the key variable. The dataset is now double entered.
  6. Derive twin-pair age difference variables (in some pairs, the two twin booklets were completed on different dates).
  7. Derive new exclusion variables, specific to the 2 Year study, based on twin age and age difference criteria.
  8. Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
  9. Save an SPSS data file (filename b4double in the \working files\ subdirectory).
  10. Save another copy as the main 2 Year dataset, with filename bdoub945.