TEDS Data Dictionary

Processing the 8 Year Data

Contents of this page:

Introduction

This page describes how the 8 Year analysis dataset is created. The starting point is the raw data, in cleaned and aggregated form. (Prior processes of data collection, data entry, data cleaning and aggregation are taken for granted here.) There are two main sources of data for the 8 Year analysis dataset:

  1. Questionnaire data, and study admin data. These are stored in tables in the Access database file called 8yr.accdb.
  2. Background data: twin sexes, zygosity variables, twin birth dates, medical exclusion and overall exclusion variables. Some of these variables are from the 1st Contact dataset, but the source of most of the background variables is the TEDS admin database, where they may occasionally be updated. Rather than exporting the source variables from the admin database then importing them in the creation of every dataset, this is done separately in a reference dataset containing all the background variables. This reference dataset is used here to add the background variables, ready-made.

Converting raw data from these sources into the dataset involves two main processes: firstly, where appropriate, raw data must be "exported" into files that can be used by SPSS; secondly, the data files are combined and restructured, using SPSS, into a form suitable for analysis. The latter involves a lengthy series of steps, which are saved and stored in SPSS scripts (syntax files).

General issues involved in creating TEDS datasets are described in the data processing summary page. The raw 8 year data files are described in more detail on another page.

Exporting raw data

Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.

The questionnaire data and the administrative data, stored in the Access 8yr.accdb database file, have been subject to occasional changes, even after the end of data collection. In earlier years, changes were caused by late returns of questionnaires, and more recently changes have occasionally been caused by data cleaning or data restructuring. If these changes have been made, then the data should be re-exported before a new version of the dataset is created using the SPSS scripts. The data stored in the database tables are exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. Each query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields. The queries also modify the format of the data values in some columns, so that they are saved in a format that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and true/false columns (changed to 1/0 values). The queries used to export the data are as follows:

Query name Source of data Database table(s) involved Exported file name
Export Questionnaire parent questionnaires Questionnaire Questionnaire.csv
Export 8yr admin 8 year admin data yr8Progress 8yrAdmin.csv

A convenient way of exporting these data files is to run a macro that has been saved for this purpose within the Access database. See the data files summary page and the 8 Year data files page for further information about the storage of these files.

Processing by scripts

Having exported the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.

Script 1: Merging raw data sources

The main purpose of this script (filename H1_merge.sps) is to merge various raw data files together, so as to create a dataset with one row of data per family. This script also carries out some basic processing of item variables. The script carries out these tasks in order:

  1. There are 2 files of family-based raw data: the file of questionnaire data and the file of 8 year admin data such as return dates. These raw data files both start in csv format. For each of these files in turn, carry out the following actions:
    1. Set the variable names
    2. Set the displayed width and number of decimal places for each variable
    3. Recode -99 values (signifying missing data) to SPSS "system missing" values, in all relevant item variables
    4. Set the variable level (nominal/ordinal/scale) for each variable
    5. Carry out basic recoding of item variables where needed
    6. In the questionnaire data file, add reversed versions of all the CAST items
    7. Sort in ascending order of family identifier FamilyID
  2. Merge together the 2 files described above, using FamilyID as the key variable.
  3. Filter the dataset to delete cases added from the admin data that do not have 8 Year data.
  4. Save a working SPSS data file ready for the next script (filename H1merge).

Script 2: Double entry

The purpose of this script (filename H2_double.sps) is to double enter the dataset. Note that the data are all parent-reported, and in the raw data items refer specifically to the elder twin (variable names ending in 1) and the younger twin (variable names ending in 2). In the raw data and the dataset used up to this point, there is one row of data per family. This script reorganises the data so that there is one row per twin, and duplicates the data for double entry, as described below. The script also scrambles the IDs. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Create the elder twin part of the dataset by carrying out the following steps.
    1. Add variable htwin, with value 1 denoting elder twin
    2. Add twin identifier htempid2, derived by appending 1 to the value of FamilyID
    3. Derive the random variable, assigning the value 0 or 1 randomly (but with equal probability) to each elder twin
    4. Save as the elder twin dataset
    5. Note that in this part of the dataset, as in the raw questionnaire data, variables with names ending in 1 denote the elder twin, while variables with names ending in 2 denote the younger twin.
  3. Now create the younger twin part of the dataset as follows.
    1. Change the value of htwin to 2, denoting the younger twin
    2. Change the value of htempid2, by changing the final digit from 1 to 2
    3. Reverse the values of the random variable, changing 1 to 0 and vice versa
    4. In all twin-specific item variables, swap over the elder and younger twin values by renaming the variables. As a result, variables with names ending in 1 now denote the younger twin, while variables with names ending in 2 denote the elder twin.
    5. Save as the younger twin dataset
  4. Combine the elder and younger twin parts together in the same dataset by adding cases.
  5. Sort in ascending order of htempid2, and save.
  6. Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page.
  7. Sort in ascending order of scrambled twin ID id_twin.
  8. Save a working SPSS data file ready for the next script (filename H2double), dropping the raw ID variables.

Script 3: Deriving new variables

The purpose of this script (filename H3_derive.sps) is to compute scales and other derived variables from the item data. See derived 8 Year variables for full details. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, autism, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate.
  3. Use variable heightyr to filter the dataset and delete cases added from the reference dataset that do not have 8 Year data.
  4. Derive the twin age when the questionnaire was completed by the parents. In cases where the date on the questionnaire is missing, use the return date logged in the admin data.
  5. Compute the CAST, Conners and RPQ scales.
  6. Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
  7. Save a working SPSS data file ready for the next script (filename H3derive).

Script 4: Labelling variables

The purpose of this script (filename H4_label.sps) is to label all the variables, add value labels where appropriate, and to save the final dataset with variables placed in systematic order. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Add variable labels to all variables
  3. Add value labels to every integer-valued categorical variable having 3 or more categories
  4. Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
  5. Save an SPSS data file (filename H4label in the \working files\ subdirectory).
  6. Save another copy as the main 8 Year dataset, with filename hdb9456.