TEDS Data Dictionary

Processing the 1st Contact Data

Contents of this page:

Introduction

This page describes how the 1st Contact analysis dataset is created. The starting point is the raw data, in cleaned and aggregated form. (Prior processes of data collection, data entry, raw data cleaning and aggregation are taken for granted here.) There are two main sources of data for the 1st Contact analysis dataset:

  1. Booklet data collected in the study, booklet return dates, and Acorn code data. These are stored in tables in the Access database file called 1c.accdb.
  2. Background data: twin sexes, zygosity variables, twin birth dates, medical exclusion and overall exclusion variables. Most of the source data for these background variables are stored in the TEDS admin database, where they are occasionally updated. These may be exported from the admin database then imported for dataset creation, but generally the background variables are available ready-made for merging from a separate reference dataset.

Converting raw data from these sources into the dataset involves two main processes: firstly, raw data must be "exported" from the databases into files that can be used by SPSS; secondly, the data files are combined and restructured, using SPSS, into a form suitable for analysis. The latter involves a lengthy series of steps, which are saved and stored in SPSS scripts (syntax files).

General issues involved in creating TEDS datasets are described in the data processing summary page. The raw 1st Contact data files are described in more detail on another page.

Note that this page describes the current version (post-2013) of the 1st Contact dataset, constructed using SPSS scripts. The earlier versions were constructed using SAS scripts. Variable names and variable coding have been retained, although some redundant variables have been dropped and a few new variables added. The new scripts incorporate additional cleaning steps, some minor corrections in the computation of some derived variables, and changes to the way that exclusion variables are added.

Exporting raw data

Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.

The study booklet data, stored in the Access 1c.accdb database file, have been subject to occasional changes, even after the end of data collection. The 1st Contact study has been extended more than once, to try to obtain data from families that did not respond first time around; hence new data have been added to the database. Further changes have occasionally been made due to data cleaning or data restructuring. Whenever these data are changed in any way, they should be re-exported before a new version of the dataset is created using the SPSS scripts. The data stored in the database tables are exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. Each query selects appropriate columns from the relevant tables, excluding any fields not needed for the dataset. The queries used to export the booklet data are as follows:

Query name Database table Exported file name
Export Part1 Part1Part1.csv
Export Part2 Part2Part2.csv
Export Part3 Part3Part3.csv
Export Part4 Part4Part4.csv
Export Part5 Part5Part5.csv
Export Part6 Part6Part6.csv
Export Part7 Part7Part7.csv
Export ReturnDates FirstContactProgressReturndates.csv

A convenient way of exporting these data files is to run a macro that is within the Access database. See the data files summary page and the 1st Contact data files page for further information about the storage of these files.

Processing by scripts

Having exported the raw data as described above, a new version of the dataset is made by running SPSS scripts (syntax files). The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon. The functions of each script are described below.

Script 1: Merging raw data sources

The main purpose of this script (filename A1_merge.sps) is to merge raw data files together, name the item variables, set variable display properties, and do some basic recoding and transforming of variables.

For further information about changes to names and coding of item variables, refer to the annotated booklet (pdf).

The script carries out these tasks in order:

  1. There are 8 files of family-based raw data: the seven files of booklet data plus the file containing Acorn data. These raw data files all start in csv format. For each of these files in turn, carry out the following actions:
    1. Import into SPSS
    2. Sort in ascending order of family identifier FamilyID
    3. Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
    5. Carry out basic recoding of categorical variables where necessary
    6. Transform some raw variables into more user-friendly compound variables where appropriate. For example, a time interval recorded in three raw data variables representing days, weeks and months is transformed into a single variable representing total number of days.
    7. Drop raw data variables that are not to be retained in the datasets.
    8. Save as an SPSS data file.
  2. Merge together the 7 SPSS files of booklet data, using FamilyID as the key variable. Add the acontact variable to show that cases in this merged file all have 1st Contact data.
  3. Add (by merging) the SPSS file of Acorn data, again using FamilyID as the key variable. This creates a dataset containing all the raw data, and with one row of data per family.
  4. Using the acontact variable as a filter, delete cases (added from the Acorn data) that do not have 1st Contact booklet data.
  5. Save a working SPSS data file ready for the next script (filename a1merge in the \working files\ subdirectory).

Script 2: Double entering the data

The main purpose of this script (filename A2_double.sps) is to double-enter the data. The script carries out these tasks in order:

  1. Open the dataset file a1merge saved by the last script. This dataset so far contains just one row of data per twin pair, in which twin variables refer specifically to the older and younger twin.
  2. Convert this into a dataset for the set of elder twins as follows:
    1. Compute twin identifier atempid2: multiply family identifier FamilyID by 10, and add 1 (to denote elder twin).
    2. For a few cases, flagged by value 2 in item variable atwinord, where elder twin details are thought to have been recorded for the younger twins and vice versa, recompute atempid2 by changing the last digit from 1 to 2.
    3. Compute the random variable, assigning values 0 and 1 randomly (but with equal probability) to the elder twins.
    4. Save this dataset as the elder twin part.
  3. Now convert this same dataset into a dataset for the younger twin as follows:
    1. Re-compute twin identifier atempid2 by changing the final digit to 2 (to denote younger twins).
    2. For the few cases flagged by atwinord=2, where twin data need to be reversed, change the last digit of atempid2 to 1.
    3. Reverse the values of the random variable for the younger twin, by recoding 0 to 1 and vice versa.
    4. For all twin-specific item variables, swap the elder and younger twin values around by a process of variable renaming.
    5. Save this dataset as the younger twin part.
  4. Merge cases from the elder and younger twin parts, making a larger dataset containing one row of data for each twin. Sort in ascending order of atempid2 and save.
  5. Anonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page. Drop the raw ID variables.
  6. Sort in ascending order of id_twin (anonymised twin ID).
  7. Merge in essential background variables not already present, from the reference dataset of background variables. The background variables added here include twin sexes, medical exclusions and the standard exclusion variables (all already double entered) and twin birth date (for deriving ages). This merging is done using id_twin as the key variable.
  8. Save a working SPSS data file ready for the next script (filename a2double in the \working files\ subdirectory).

Script 3: Cleaning the raw data

The main purpose of this script (filename A3_clean.sps) is to clean the raw data by recoding variables, where anomalies or inconsistencies can be detected. The script carries out these tasks in order:

  1. Open the dataset file a2double saved by the last script.
  2. Clean up inconsistent or incomplete responses in multiple-part questions, consisting of an initial question typically followed by one or more "if yes then" questions:
    1. Where the initial response is "no", but the follow-up responses are affirmative, remove the inconsistency by recoding the follow-up responses to missing. This assumes that the initial response is more likely to be accurate than the follow-up response.
    2. Where the initial response is missing, but the follow-up responses are affirmative, assume that the initial response should be "yes" and recode accordingly.
  3. Clean up inconsistent or incomplete responses in the six questions asking for the sex, relationship to twins and marital status of the respondent and partner:
    1. Where a response is missing but the likely response can be deduced from other related responses, recode the missing response accordingly. For example, if the responses show the respondent is female, the natural mother of the twins, and cohabiting with the other parent, it can usually be deduced that the partner is male, the father of the twins, and also cohabiting with the other parent.
    2. Where a single response is inconsistent with the 5 other responses, and the "correct" response can reasonably be deduced, recode the response accordingly. For example, if the respondent is female, the natural mother, and married to the other parent, while the partner is male and the natural father, then the partner's marital status can reasonably be recoded to "married to the other parent" if some other response is recorded.
    3. Where inconsistent responses cannot be resolved as above, and it is difficult to deduce the "correct" responses, recode inconsistent responses to missing.
  4. Compute the total numbers of older and younger siblings.
  5. Clean up inconsistent responses in the sibling data:
    1. Where initial responses show there to be fewer than 3 older siblings present, but details of a third older sibling are recorded, then remove the inconsistency by deleting the 3rd older sibling's details (recode them to missing). Repeat for similar sets of responses for the second and first older siblings, where initial responses show fewer than 2 or 1 older siblings respectively.
    2. Repeat for the younger siblings (details of up to 2 younger sibs may be recorded).
    3. If responses for any younger sibling show that the sibling is in fact older than the twins, then delete the responses relating to that sibling (recode them to missing).
    4. Repeat for older siblings who are apparently younger than the twins.
    5. If responses for any older or younger sibling show that their date of birth is within 300 days of the twins' date of birth, and the sibling's parents are the same as for the twins, then delete the responses relating to that sibling.
  6. Compute cleaned versions of various quantitative variables, removing outliers (generally more than 3.6SD above or below the mean) and inappropriate zero values by recoding them to missing.
  7. Compute recoded, ordinal versions of some quantitative variables.
  8. For all new derived variables, as for item variables in the previous script, set the variable level (nominal/ordinal/scale), width and number of decimal places.
  9. Save a working SPSS data file ready for the next script (filename a3clean in the \working files\ subdirectory).

Script 4: add new derived variables

The main purpose of this script (filename A4_derive.sps) is to compute various types of derived variables. For full details of how new variables are derived, see the 1st Contact derived variables page. The script carries out these tasks in order:

  1. Open the dataset file a3clean saved by the last script.
  2. Add variable sexdif to flag twin pairs having opposite sexes (needed for the zygosity algorithm).
  3. Use the zygosity algorithm to compute derived variables atempzyg, aalgzyg, aalg2zy based on item data in the 1st Contact zygosity questionnaire.
  4. Compute perinatal outlier exclusion flag variables aperi1, aperi2, aperi3, aperi4, aperi5, aperinat from various item variables relating to pregnancy and birth of the twins.
  5. Convert raw day/month/year item variables into date values in new variables.
  6. Derive the ages (when the 1st Contact booklet was completed) of the twins, respondent, partner, natural mother, and all siblings, from various date variables.
  7. Derive the age of the mother when her first child was born (amagechl), from various parent, twin and sibling items including their birth dates.
  8. Compute various other derived variables relating to the male and female parents, based on respondent and partner item variables. These derived variables include household type, and qualification and employment categories for the female and male parents.
  9. Derive composite variables for SES (ases), twin medical risk (atwmed1/2) and mother medical risk (amedtot). Each of these composites is derived from a range of other item and derived variables, and is standardised on the non-excluded sample of twin pairs (exclude1=0 & exclude2=0).
  10. For all new derived variables, as for other variables in previous scripts, set the variable level (nominal/ordinal/scale), width and number of decimal places.
  11. Drop all temporary and redundant variables that have been used in the computation of new derived variables. Date variables are dropped at this point, having been used to derive ages.
  12. Save a working SPSS data file ready for the next script (filename a4derive in the \working files\ subdirectory).

Script 5: add variable and value labels

The main purpose of this script (filename A5_label.sps) is to add variable labels to all variables, and value labels where appropriate. For a full list of variables, including labels and descriptions of value coding, see the 1st Contact variables list page. The script carries out these tasks in order:

  1. Open the dataset file a4derive saved by the last script.
  2. Add a descriptive variable label to every variable in the dataset.
  3. For every categorical variable having 3 or more response categories, add value labels to describe the numbered categories.
  4. Place the variables into a logical and systematic order. The variable order generally follows the order in which respective items appear in the 1st Contact booklet; additional derived variables appear at the end of the dataset.
  5. Save a backup copy of this dataset (filename a5label) in the \working files\ subdirectory.
  6. Save another copy as the main 1st Contact dataset, with filename adb9456.