TEDS Data Dictionary

Processing the 10 Year Data

Contents of this page:

Introduction

This page describes how the 10 Year analysis dataset is created. The starting point is the raw data, in cleaned and aggregated form. (Prior processes of data collection, data entry, data cleaning and aggregation are taken for granted here.) There are three main sources of data for the 10 Year analysis dataset:

  1. Twin web test data. These are stored in .csv text files, one file per test/activity plus a family-based file containing web status information and the brief parent web questionnaire.
  2. Teacher questionnaire data and study admin data. These are stored in tables in the Access database file called 10yr.accdb.
  3. Background data: twin sexes, zygosity variables, twin birth dates, medical exclusion and overall exclusion variables. Some of these variables are from the 1st Contact dataset, but the source of most of the background variables is the TEDS admin database, where they may occasionally be updated. Rather than exporting the source variables from the admin database then importing them in the creation of every dataset, this is done separately in a reference dataset containing all the background variables. This reference dataset is used here to add the background variables, ready-made.

Converting raw data from these sources into the dataset involves two main processes: firstly, where appropriate, raw data must be "exported" from the Access database into files that can be used by SPSS; secondly, the data files are combined and restructured, using SPSS, into a form suitable for analysis. The latter involves a lengthy series of steps, which are saved and stored in SPSS scripts (syntax files).

General issues involved in creating TEDS datasets are described in the data processing summary page. The raw 10 year data files are described in more detail on another page.

Exporting raw data

Exporting involves copying the cleaned and aggregated raw data from the Access database where they are stored, into csv files that can be read into SPSS. The process of exporting raw data is described in general terms in the data processing summary page.

The data stored in the Access 10yr.accdb database file, have been subject to occasional changes, even after the end of data collection. In earlier years, changes were caused by late returns of teacher questionnaires, and more recently changes have occasionally been caused by data cleaning or data restructuring changes. If these changes have been made, then the data should be re-exported before a new version of the dataset is created using the SPSS scripts. The data stored in the database tables are exported indirectly, by means of saved "queries" (or views), rather than directly from the tables themselves. Each query selects appropriate columns from the relevant tables, excluding inappropriate data such as verbatim text fields. The queries also modify the format of the data values in some columns, so that they are saved in a format that can easily be read by SPSS; examples are date columns (changed to dd.mm.yyyy format) and true/false columns (changed to 1/0 values). The queries used to export the data are as follows:

Query name Source of data Database table(s) involved Exported file name
Export Teacher teacher questionnaires Teacher teacher.csv
Export 10yr admin 10 year admin data yr10Progress 10yrAdmin.csv

A convenient way of exporting these data files is to run a macro that has been saved for this purpose within the Access database. See the data files summary page and the 10 Year data files page for further information about the storage of these files.

Preparing raw web data

Data from the twin web tests were stored on the web server during the course of each cohort of the study (cohorts 1 and 2). For each test, the data for all twins that had completed the test were collected into an "analysis file" that was downloaded from the web server. These analysis files, one per web activity per cohort, were the original twin web data files. There were 16 such files, 8 per cohort (for the 8 activities PIAT, Maths, Vocabulary, Picture Completion, Raven, General Knowledge, Maths and Reading Questionnaire, and Author Recognition). Additionally, there were two family- or parent-based files per cohort: one containing the data from the brief parent web questionnaire (administered directly after consent), and a "family status" file contain data describing the status of the various web activities.

Subsequently, for each of these data files, the two cohort files were aggregated together, so that there is now a single file for each web activity. At the same time, the data from the Maths and Reading Questionnnaire were merged into a single file with the data from the Author Recognition task; and the data from the web status file were merged into a single file with the data from the parent web questionnaire. Hence there are 8 aggregated web data files in total. These files contain too many variables to be conveniently imported into Access database tables alongside the admin and questionnaire data. Instead, they are stored separately as csv text files. When the original files were aggregated in this way, identifying fields other than IDs (e.g. names) were removed.

Processing by scripts

Having exported the raw data as above, a new version of the dataset is made by running the scripts described below. The scripts must be run strictly in sequence. To run each script, simply open it in SPSS, select all the text, and click on the Run icon.

Script 1: Merging raw data sources

The main purpose of this script (filename J1_merge.sps) is to merge various raw data files together, so as to create a basic dataset with one row of data per twin. The script also carried out some simple processing of the raw data, such as recoding and renaming the raw item variables, and creating some essential derived variables including scrambled IDs. The script carries out these tasks in order:

  1. There are 2 files of family-based raw data: firstly the file containing parent web questionnaire data plus web family status data, and secondly the 10 year admin data (containing teacher return dates). These raw data files start in csv format. For each of these files in turn, carry out the following actions:
    1. Import into SPSS
    2. Sort in ascending order of family identifier FamilyID
    3. Where applicable, recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
    5. Drop raw data variables that are not to be retained in the datasets.
    6. Save as an SPSS data file.
  2. Merge the 2 files of family-based data together using FamilyID as the key variable.
  3. Double enter twin-specific items in the family-based data as follows:
    1. Compute twin identifier atempid2 for the elder twin by appending 1 to the FamilyID. Compute the Random variable. Save as the elder twin part of the family data.
    2. Re-compute atempid2 for the younger twin by appending 2 to the FamilyID. Reverse the values of the Random variable. Swap over elder and younger twin values in any twin-specific variables in the family data (do this by renaming variables). Save as the younger twin part of the family data.
    3. Combine the elder and younger twin parts together by adding cases. Sort in ascending order of atempid2 and save as the double entered family data file.
  4. There are 9 files of twin-based raw data: the admin data file containing twin IDs and birth orders; the file of teacher questionnaire data; and the 7 files containing twin web data from the 7 web activities (PIAT, Maths, Ravens Matrices, Vocabulary, Picture Completion, General Knowledge, and the Maths and Reading Questionnaire which also includes the Author Recognition task). These raw data files start in csv format. For each of these files in turn, carry out the following actions:
    1. Import into SPSS
    2. Sort in ascending order of twin identifier TwinID
    3. In the teacher questionnaire file, recode default values of -99 (missing) and -77 (not applicable) to SPSS "system missing" values
    4. For each variable, change the name, set the displayed width and number of decimal places, and set the SPSS variable level (nominal/ordinal/scale)
    5. Save as an SPSS data file.
  5. Using TwinID as the key variable, merge together all 9 twin data files into a single file. Derive a twin web data flag to indicate whether data are present from at least one of the web activities.
  6. Double enter twin web and teacher data flags, as follows:
    1. Compute the alternative twin identifier atempid2 as the FamilyID followed by the twin order (1 or 2).
    2. Sort in ascending order of atempid2 and save as the twin 1 part.
    3. Change the flag variable names by changing the ending from 1 to 2. Change the values of atempid2 to match the co-twin (change the final digit from 1 to 2 or vice versa). Re-sort in ascending order of atempid2 and save with just the renamed variables as the twin 2 part.
    4. Merge the twin 1 and twin 2 parts using atempid2 as the key variable. The double entered data flags can now be used to select twin pairs having data.
    5. Sort in ascending order of atempid2 and save.
  7. Merge this twin data file with the double entered parent data file, using atempid2 as the key variable. This dataset now contains all the raw data.
  8. Use the parent data flag, and the double entered twin and teacher data flags, to filter the dataset and delete any cases without any 10 Year data. Add the overall 10 Year data flag variable jtenyear.
  9. Transform the family and twin IDs; the algorithm for scrambling IDs is described on another page.
  10. Sort in ascending order of scrambled twin ID id_twin.
  11. Save and drop the raw ID variables.
  12. Merge in essential background variables, from a separate reference dataset, using id_twin as the key variable. These include twin birth dates (for deriving ages), 1st Contact reference variables, twin sexes and zygosities, medical exclusions and overall exclusion variables, all of which are already double entered where appropriate.
  13. Use variable jtenyear to filter the dataset and delete cases added from the reference dataset that do not have 10 Year data.
  14. Save a working SPSS data file ready for the next script (filename j1merge in the \working files\ subdirectory).

Script 2: Recoding the web item data

The purpose of this script (filename J2_recode.sps) is to make the web item data easier to use, particularly by recoding missing item responses and item scores; and by identifying and excluding probable random responders. The recoding procedures are designed to be consistent with those used in the 16 year web study. The tasks carried out in this script are necessary for some of the derivations of new variables in the next script.

  1. Open the data file saved at the end of the previous script.
  2. Derive variables to measure the variability in a twin's responses, in each web test that has a uniform response format across all items. Although these are temporary variables, designed for use in the next script, it is necessary to compute them here before the item response re-coding that follows.
  3. For each of the 6 cognitive web tests (PIAT, Maths, Ravens, Vocabulary, Picture Completion, General Knowledge), where appropriate, carry out the following recoding stages:
    1. Identify the branching route followed by each twin in the test.
    2. For any item skipped due to upward branching, recode missing item responses to -3 (all 6 tests) and recode missing item scores to 1 (PIAT and Maths only; already done on the web server for other tests).
    3. Detect instances of twin tests in which branching errors have occurred, leading to items being erroneously skipped and credited. Where identified, mark these instances as test exclusions by changing the test status variable value from 2 to 3.
    4. For any item skipped due to a discontinue rule, recode missing item responses to -2, and recode missing item scores to 0 (all 6 tests).
    5. For any item forfeited due to a timeout rule (PIAT, Maths and Picture Completion only), recode item responses to -1 and item answer times to missing. (Item scores have already been set to 0 on the web server, hence they do not require recoding for this purpose.)
    6. Recode the few remaining missing item responses to -4, signifying that the item probably "crashed", i.e. the item malfunctioned during the web test (all 6 tests).
    7. Recode the few remaining missing item scores to 0.
    8. Identify instances of twin tests that appear to be seriously compromised by loss of data, due to repeated item timeouts and/or item crashes. Where identified, mark these instances as test exclusions by changing the test status variable value from 2 to 3.
  4. In the Author Recognition test data, recode missing item response and scores to 0. (All items were presented on one web page; items not selected have missing values in the raw data.)
  5. In all web activities where item times were recorded, recode outlying (very high) times to missing, so as not to distort mean times. This includes item response times, item download times, and (for PIAT only) item reading times. Anomalous outliers are thought to have occurred particularly when an item was started just before midnight and completed just after midnight.
  6. Derive the mean item answer time for each web test.
  7. For twins who finished all of the web tests, derive the total time taken.
  8. Derive a total score for each of the twin web tests. For the Mathematics and Ravens tests, derive also sub-test category scores.
  9. In each web test, identify twins who appear to have answered randomly or without effort. The identification procedures are described in more detail in the web data cleaning page. Where identified, mark such instances of twin tests as exclusions by changing the test status variable value from 2 to 4.
  10. For any twin web test identified as an exclusion, either because of random responding (test status = 4) or because of loss of data (test status = 3), exclude the test data by recoding the test data flag (from 1 to 0) and all test items and scores (to missing).
  11. Re-compute the twin web data flag jcdata1 to take account of these new exclusions.
  12. Save a working SPSS data file ready for the next script (filename j2recode in the \working files\ subdirectory).

Script 3: Deriving new variables

The purpose of this script (filename J3_derive.sps) is to derive scales, composites and other new variables from the item data. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Derive the ages of the twins when the twin web tests were started and when the teacher questionnaires were returned.
  3. For the PIAT, Maths, Ravens, Vocabulary and General Knowledge tests, derive "adjusted" test scores. First derive temporary "adjusted" item scores: for any item skipped due to a discontinue rule, the default score of 0 is replaced by the "chance" score that would be obtained, on average, by selecting an answer at random. The adjusted total score is then the sum of the adjusted item scores.
  4. Derive standardized cognitive and academic achievement composites as follows:
    1. Apply a filter (exclude1=0) to remove exclusions
    2. Standarise the necessary component items and scores
    3. Compute the mean of the appropriate standardised items/scores
    4. Standardise the mean, to make the final version of each composite
    5. Remove the filter
  5. Derive Maths and Literacy environment scales from items of the twin web questionnaire.
  6. Derive classroom environment scales from the CEQ and chaos items in the teacher questionnaire.
  7. Drop any temporary variables that have been used to derive the new variables. Date variables are dropped at this point, having been used to derive ages.
  8. Save a working SPSS data file ready for the next script (filename j3derive), dropping all temporary variables that were used in derivations.

Script 4: Labelling the variables

The purpose of this script (filename J4_label.sps) is to label all the variables, and to add value labels where appropriate. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Label all the variables.
  3. Add value labels to all integer-valued categorical variables having 3 or more categories.
  4. Save a working SPSS data file ready for the next script (filename j4label).

Script 5: Double entering the twin data

The purpose of this script (filename J5_double.sps) is to double-enter all the twin-specific data in the dataset. Note that twin-specific item variables from the family admin data are already correctly double-entered at this stage (this was achieved in script 1). The variables to be double entered in the current script are all items from the teacher questionnaire and the twin web tests. The script carries out these tasks in order:

  1. Open the data file saved at the end of the previous script.
  2. Create the twin 2 part (for the co-twin) as follows:
    1. Rename the appropriate item and derived variables (from the twin web tests and teacher questionnaires) by changing the suffix from 1 to 2.
    2. Modify the id_twin values so they will match the co-twin (change the final digit from 1 to 2 or vice versa).
    3. Re-sort in ascending order of id_twin and save as the twin 2 part, keeping only the renamed variables.
  3. Re-open the data file saved at the end of the previous script: this already serves as the twin 1 part of the dataset.
  4. Merge in the twin 2 part, using id_twin as the key variable. The dataset is now double entered.
  5. Place the dataset variables into a logical and systematic order (do this using a KEEP statement when saving the dataset).
  6. Save an SPSS data file (filename j5double in the \working files\ subdirectory).
  7. Save another copy as the full 10 Year dataset, with filename Jdoub945.