TEDS Data Dictionary

Processing the In Home Data

Contents of this page:

Introduction

This page describes how the In Home analysis dataset is created. The starting point is the raw data, in cleaned and aggregated form. (Prior processes of data collection, data entry, data cleaning and aggregation are taken for granted here.) There are two main sources of data for the In Home analysis dataset:

  1. Data collected during the study, including admin data. These are stored in tables in the Access database file called inhome.accdb.
  2. Background data: twin sexes, zygosity variables, twin birth dates and some exclusion variables. Some of these variables are from the 1st Contact dataset, but the source of most of the background variables is the TEDS admin database, where they may occasionally be updated. Rather than exporting the source variables from the admin database then importing them in the creation of every dataset, this is done separately in a reference dataset containing all the background variables. This reference dataset is used here to add the background variables, ready-made.

Converting raw data from these sources into the dataset involves two main processes: firstly, where appropriate, raw data must be "exported" into files that can be used by SPSS; secondly, the data files are combined and restructured, using SPSS, into a form suitable for analysis. The latter involves a lengthy series of steps, which are saved and stored in SPSS scripts (syntax files).

General issues involved in creating TEDS datasets are described in the data processing summary page. The raw in home data files are described in more detail on another page.

Exporting raw data

Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.

The study data, including admin data, stored in the Access inhome.accdb database file, have been subject to occasional changes, even after the end of the study. These changes have sometimes been due to data cleaning or data restructuring changes. If these data have been changed in any way, then they should be re-exported before a new version of the dataset is created using the SPSS scripts. The data stored in the database tables are exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. Each query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields. The queries also modify the format of the data values in some columns, so that they are saved in a format that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and true/false columns (changed to 1/0 values). The queries used to export the data are as follows:

Query name Source of data Database table(s) involved Exported file name
(query not used; exported directly from table) twin tests Child Child.csv
Export Parent parent questionnaires Parent Parent.csv
Export PostVisit post-visit questionnaires PostVisit PostVisit.csv
Export Inhome admin In home study admin data InhomeProgress Inhome Admin.csv

A convenient way of exporting these data files is to run the saved macro called export raw data in the inhome.accdb Access database. See the data files summary page and the In Home data files page for further information about the storage of these files.

Processing by scripts

Having exported the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.

Script 1: Merging raw data sources

The main purposes of this script (filename E1_merge.sps) are to import the raw data files into SPSS; to carry out basic item variable formatting such as naming, setting variable levels and recoding; and to merge the various data files together, so as to create a basic dataset with one row of data per twin. This script also double-enters the twin-specific items from the parent booklet, and converts the IDs into scrambled form. The script carries out these tasks in order:

  1. There are 3 files of twin-based raw data: the file of child test data from the score sheets, the file of post-visit questionnaire data completed by testers, plus the admin data file containing twin IDs and birth orders. These raw data files all start in csv format. For each of these files in turn, carry out the following actions:
    1. Import into SPSS
    2. Sort in ascending order of twin identifier TwinID
    3. Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
    5. Carry out basic recoding of variables where necessary.
    6. In the child test data file, check for completeness of the data present and derive a data flag.
    7. In the file of twin IDs and birth orders, compute the alternative twin identifier etempid2 as the FamilyID followed by the twin birth order (1 or 2).
    8. Drop any raw data variables that are not to be retained in the datasets.
    9. Save as an SPSS data file.
  2. Merge the 3 twin data files together using TwinID as the key variable.
  3. Double enter the twin and post-visit data flags, as follows:
    1. Sort in ascending order of etempid2 and save as the twin 1 part. (Note that by this stage the twin variables already have names ending in 1.)
    2. Change the data flag variable names by changing the ending from 1 to 2. Change the values of etempid2 to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of etempid2 and save with just the renamed variables as the twin 2 part.
    3. Merge the twin 1 and twin 2 parts using etempid2 as the key variable. The double entered data flags can now be used to select twin pairs having data.
    4. Save this file of merged twin-level data.
  4. There are 2 files of family-based raw data: the file of parent questionnaire data and the file of in home admin data such as visit dates and 'low' flags. These raw data files both start in csv format. For each of these files in turn, carry out the following actions:
    1. Import into SPSS
    2. Sort in ascending order of family identifier FamilyID
    3. Recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
    5. Carry out basic recoding of categorical variables where necessary. This includes recoding parent vocabulary responses into item scores.
    6. Drop any raw data variables that are not to be retained in the datasets.
    7. Save as an SPSS data file.
  5. Merge the 2 files of family-based data together using FamilyID as the key variable.
  6. Double enter the family-based data as follows:
    1. Compute twin identifier etempid2 for the elder twin by appending 1 to the FamilyID. Compute the Random variable, randomly assigning the value 0 or 1 to each elder twin in the dataset. Save as the elder twin part of the family data.
    2. Re-compute etempid2 for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
    3. Combine the elder and younger twin parts together by adding cases. Sort in ascending order of etempid2 and save as the double entered family data file.
  7. Merge this double entered family data file with the twin data file, using etempid2 as the key variable. This dataset now contains all the raw data.
  8. Recode missing values to zero in all flag variables that show the presence or absence of data.
  9. Use the double entered twin data flags to filter the dataset and delete any twin pairs without any in-home twin test data. Compute the overall in-home data flag variable edata.
  10. Pseudonymise the family and twin IDs; the algorithm for scrambling IDs is described on another page.
  11. Sort in ascending order of scrambled twin ID id_twin.
  12. Save and drop the raw ID variables.
  13. Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, all of which are already double entered where appropriate.
  14. Merge in additional variables from the 4 Year dataset (language and Parca scores) using id_twin as the key variable. Use variable edata to filter the dataset and delete cases added from the other datasets that do not have In Home data.
  15. Save a working SPSS data file ready for the next script (filename E1merge in the \working files\ subdirectory).

Script 2: Deriving new variables

The purpose of this script (filename E2_derive.sps) is to derive scale and composite variables from the item data, and to derive useful background variables such as exclusion variables and twin ages. See derived In Home variables for full details of how each variable is derived. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Derive the age of the twins at the time of the visit.
  3. Derive a general-purpose exclusion variable (eexclude) for analysis of the In Home data.
  4. Derive test total scores for those measures where item scores are recorded in the data. These scores include the parent vocabulary, and twin tests BAS, Goldman Fristoe, Non-Word Repetition and Phonological Awareness.
  5. Derive a log-transformed version of the Goldman Fristoe total score.
  6. Where appropriate, derive total scores from related McCarthy test scores (Word Knowledge, Verbal Memory and Numerical Memory).
  7. Derive McCarthy "index scores" as the weighted totals of scores from groups of tests.
  8. Using variables copied from the 4 Year booklet dataset, create variables to categorise twins who are "low-language" (dllang1/2) and "low-Parca" (dlparca1/2). Note that these variables derived from existing 4 Year data are similar but not identical to the historical flag variables that have been imported into the dataset from the In Home admin data (dlowlan1/2 and dlowpar1/2 respectively).
  9. Derive corresponding flag variables to categorise "control" families in which neither twin is low-language or low-Parca. These variables are econtrola for definitions based on the historical twin labels, and econtrolb for definitions based on the variables derived from the 4 Year booklet data.
  10. Creat standardised versions of test scores and composites derived so far. These are standardised on the distributions for "control" families, as defined using the 4 Year booklet variables (variable econtrolb). The steps taken were as follows:
    1. Apply a filter to remove exclusions (eexclude=0) and to include only control families (econtrolb=1).
    2. Determine the mean and standard deviation for each variable to be standardised.
    3. Remove the filter.
    4. For all cases (not just controls), compute the standardised version of each variable by subtracting the mean then dividing by the standard deviation.
  11. Derive an "articulation" composite as the mean of two standardised scores (log-transformed Goldman Fristoe, and Non-Word Repetition). Standardise this composite to the control distribution using the same method as described above.
  12. Derive language and non-verbal composites. There are two versions of each: a "long" version derived from a larger number of component scores, and a "short" version derived from a smaller number of component scores. Each composite is derived as a mean of the relevant standardised component scores. Each composite itself is then standardisation to the control distribution using the same method as described above.
  13. Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
  14. Save a working SPSS data file ready for the next script (filename E2derive in the \working files\ subdirectory).

Script 3: Labelling the variables

The purpose of this script (filename E3_label.sps) is to label all the variables in the dataset, and to add value labels to integer-valued categorical items (those having 3 or more categories). The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Label all the variables.
  3. Add value labels for every integer-valued categorical variable (whether nominal or ordinal) having 3 or more different categories.
  4. Save a working SPSS data file ready for the next script (filename E3label) in the \working files\ subdirectory).

Script 4: Double entering the twin data

The purpose of this script (filename E4_double.sps) is to double-enter all the twin-specific data in the dataset. This includes the twin test data, and the post-visit ratings data. (Note that the few twin-specific items in the parent data have already been double entered, in the first script.) The post-visit data describes families not twins, but these data are specific to testers who were linked to individual twins. Each family has two sets of post-visit data linked via the testers to the two twins, so these variables are specific to twins and must be double-entered.

The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Create the twin 2 part (for the co-twin) as follows:
    1. Rename the variables (from the twin test score sheets and post-visit data) by changing the suffix from 1 to 2.
    2. Modify the id_twin values so they will match the co-twin (change the final digit from 1 to 2 or vice versa).
    3. Re-sort in ascending order of id_twin and save as the twin 2 part, keeping only the renamed variables.
  3. Re-open the data file saved at the end of the previous script: this already serves as the twin 1 part of the dataset.
  4. Merge in the twin 2 part, using id_twin as the key variable. The dataset is now double entered.
  5. Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
  6. Save an SPSS data file (filename E4double in the \working files\ subdirectory).
  7. Save another copy as the main In Home dataset, with filename Edb9456.