TEDS Data Dictionary

Cleaning Raw Questionnaire Data

Contents of this page:

Introduction

This page describes data cleaning methods that have been applied to the TEDS booklet or questionnaire data, across all studies involving paper booklets (cleaning of web study data is described on another page). "Data cleaning" refers to processes of identifying and correcting (or removing) unclean data. Unclean data are errors or anomalies which are in some way not fit for use in analysis. Examples of unclean data are variable values with invalid or apparently nonsensical values, duplicated rows of data, rows of data for unidentifiable individuals, and rows of 'blank' data. Unclean data may arise when participants record inappropriate data (usually accidentally but potentially maliciously), when data are copied from the original paper booklet to an electronic record (data entry), or during some other data processing stage such as data transfer.

The process of data cleaning overlaps to some extent with the process of quality control during data entry (see the data entry page for details). This page describes general methods of checking and error-correction that were used both during quality control and subsequently in further attempts to clean the data. The type of data entry (generally either manual keying or optical scanning) determines to some extent the types of error that may occur and the types of data cleaning that might be needed. This page is therefore divided into sections relating to each of the main data entry methods that have been used.

In all studies, after thorough checking, the original uncleaned raw data files have been replaced by the cleaned and aggregated raw data files; the data files are discussed in detail in another page. The analysis datasets are constructed from the cleaned raw data. Further descriptions of cleaning in web data files are described in the web data cleaning page.

While the general processes of data cleaning have been documented (as on this page), it has proved impossible to keep detailed records of every individual data cleaning action, many of which have been made by hand on a case-by-case basis.

Data entered manually at the keyboard

Data entered by NOP Numbers

Most of the data entry work in the early TEDS studies was contracted out to NOP Numbers, a commercial company. The booklets returned to TEDS were transported in batches to the NOP Numbers premises, where they were entered by manual keying using their customised software. The results from each batch of data entry were returned in formatted Excel spreadsheets, which have been retained as original raw data files.

NOP Numbers entered the following data:

  • 1st Contact booklets, for most of the 1995 and 1996 cohorts
  • 2 Year booklets, for the 1995 cohort
  • 3 Year booklets (all cohorts)
  • 4 Year booklets (all cohorts)
  • 7 Year interviews and booklets, for cohort 1

The data entry of the booklets (at least up to age 4) at NOP Numbers incorporated some quality control measures to ensure clean data entry. This quote is from the manager who oversaw data entry for most of the TEDS booklet data:

From memory the error rate we had at NOP Numbers was 0.2% for market research questionnaires (i.e. with column numbers). That is 2 key strokes in 1000. My recollection is that the TEDS data was slightly higher than this due to the free format and alpha work rather than purely numeric.

Computer checks were as follows:

  • The Data Entry manager set up a complete template of the job. This entails putting in checks for every column punched on the document. i.e. for gender that only a punch 1 or 2 was allowed and that it could be single punched only also.
  • After initial data entry 10% of all documents were 100% verified to check for systematic errors or particular problems that individuals may have. If any were encountered these were rectified in the verified data and the individual coached through the document again.
  • We had 1 Data Entry Manager, 1 Data Entry Deputy Manager and 3 super visors for 25 hourly paid staff.

Some notes remaining from correspondence between TEDS and NOP Numbers during the 3 Year study show that corrections were made to some invalid numeric codes in categorical data items. There were also checks on anomalous dates, where the year recorded in one part of a booklet did not match the year recorded elsewhere in the same booklet. The notes also provided evidence of some checking of family and twin IDs. These notes demonstrate that anomalies discovered at TEDS in the raw data were fed back to NOP Numbers in order to improve the data entry process.

There is no record of measurements of the accuracy of the TEDS data entered at NOP, either for the booklets or for the 7 year telephone interviews. Once the files of entered data were returned from NOP to TEDS, some data cleaning was clearly done, although contemporary records of this are very limited. It is apparent that some duplicates were removed (arising if the same booklet was entered twice, or if a family returned two booklets). Also, some invalid IDs were either removed or corrected - part of the checking process involved comparison of IDs, parent names and twin names from the booklets against those recorded in the TEDS admin database.

The data entered by NOP have more recently been subjected to thorough, basic cleaning checks as listed below. These checks and corrections were carried out directly on the cleaned raw data (stored in Access databases), by means of database queries; the original (uncleaned) raw data files have been left unaltered. Where data problems were found, they were corrected if possible (this was sometimes possible with IDs and dates, for example); more generally, invalid data values were removed and replaced with the value -99, signifying missing data. These are the types of discrepancies that were checked:

  • Invalid or duplicated family or twin IDs
  • Invalid values in numerically-coded categorical data
  • Out-of-range or infeasible values in quantitative data
  • Obviously invalid dates, e.g. a date of 31st April, or a booklet dated earlier than it was sent out
  • Missing responses that were not correctly encoded with the value -99
  • 'Not applicable' responses that were not correctly encoded with the value -77
  • Discontinue rule not correctly enforced in the 7 year twin telephone tests

Some of the parent booklet data in these studies were collected purely for admin purposes, for example family contact details, sibling details, and contact details for friends and relatives (collected for tracing purposes). After data entry, these data were separated from the analysis data and incorporated into the TEDS admin database. There is no record of how these data might have been cleaned. Over the course of time, these admin data have continually been updated after new contacts with families, so it is impossible to trace any changes made in the data at the time of collection. The original uncleaned versions of these data do still exist in the raw data files, however.

Data entered in TEDS

The data entered by TEDS staff on TEDS premises, by manual keying, includes the following (see the data entry page for a more detailed list):

  • 1st Contact booklets, for the 1994 and early 1995 cohorts
  • 2 Year booklets, for the 1994 cohort
  • 4 Year in home visit data
  • 12 and 14 Year twin booklets, and parent NC/SLQ questionnaires
  • 16 year GCSE/exam results
  • 18 year questionnaires including A-level results
  • TEDS21 questionnaires, paper versions
  • TEDS26 questionnaires, paper versions

Often, late returns of booklets/questionnaires have been entered in TEDS, if they were returned too late to be entered either by NOP (see above) or by scanning (see below). Usually, these late returns were entered directly into the Access databases containing the aggregated cleaned data (whereas the bulk of the data may have been entered into separate files first).

In all TEDS studies, family background and administrative data have been manually entered into the TEDS admin database. This includes the following items, which are routinely incorporated into the datasets from current admin data:

  • Twin sexes and birthdates (updated from family reports)
  • Twin zygosities (updated from DNA tests)
  • Medical exclusions (updated from parent reports of twin medical conditions)
  • Return dates for booklets, questionnaires and other data
  • Records of problems affecting the quality of twin tests (phone or web)

Generally, manual data entry in TEDS is done using 'forms' in Access databases. As described in the data entry page, a suitably designed Access database can help to control the accuracy of entered data in a number of ways. This usually means that very little data cleaning is required. For example, there should be no duplicated or invalid IDs, and no invalid or out-of-range data item values. Appropriate data values such as -99 (missing) and -77 (not applicable) can be inserted automatically or by default where needed.

In the TEDS26 study, where paper questionnaires were returned by the twins these were entered by TEDS staff into the same Qualtrics system that was used by twins for the web version of the questionnaire. This ensured that all questionnaire data were aggregated together in the data file downloaded from Qualtrics.

For some of the older data entered in TEDS (1st Contact, 2 Year and In Home), there were apparently fewer controls on accuracy at the time of data entry, and checking revealed some anomalies that required cleaning. The checks described above, for data entered by NOP, were all repeated for these data. In addition, the following corrections were made:

  • Removal of any identifying information
  • Corrections to data flags, which indicate the presence or absence of particular sections of data
  • Removal of redundant or temporary data (for example, temporary tables used during coding) that had been retained in the original Access databases
  • Removal of derived variables (these are now derived in the datasets, not in the raw data)
  • Corrections to some field/variable names, for greater compatibility with other software (e.g. removal of spaces, removal of numeric digits at the start of names)

Optically scanned data

Since the 7 Year study, optical scanning has routinely been used for entering bulk quantities of booklets/questionnaires, where appropriate. The following have been scanned, (by Group Sigma unless stated otherwise):

  • 7 Year parent booklets and twin score sheets (cohort 2 onwards)
  • 7 Year teacher questionnaires: scanned by NOP Numbers in cohort 2, then subsequently by Group Sigma
  • 8 Year CAST questionnaire
  • 9 Year booklets (parent, twin and teacher)
  • 10 Year teacher questionnaires
  • 12 Year parent and teacher booklets
  • 14 Year parent booklets
  • 16 Year behaviour/LEAP study parent and twin booklets
  • TEDS21 parent questionnaire (paper version)

The data scanned by Group Sigma were generally dealt with in large batches. Returned raw data files were checked for anomalies and errors as soon as possible, while the paper copies were still accessible. This means that in addition to cleaning the data, the data entry itself could sometimes by corrected or improved. The checks were carried out by importing the raw text files into SPSS, then running scripts to check for various types of anomalies. Further scripts were used to aggregate the batches of data into a single file for each booklet, and Excel files were saved for purposes of ID checking. The following types of discrepancies were checked:

  • Invalid or missing IDs. To check validity, IDs were compared with a master list, and names were checked in case of doubt.
  • Duplicated data. This can occur when a single booklet is scanned twice, or when a subject returns more than one copy of a booklet. In the latter case, the rule is generally to keep the first copy returned.
  • Invalid data values. These are corrected where possible (by referring to the original paper booklet), or else recoded to missing.
  • Out of range values. These may be found in items such as dates, and heights and weights. They can often be corrected by referring to the original paper copy, unless the subject has written down a nonsensical value.
  • Entire booklets without any recorded responses. These are deleted from the data.
  • Discontinue rules in the 7 Year twin score sheets (for the telephone tests). These were corrected if not properly enforced by the tester.
  • Inconsistencies in multi-part questions. There were many of these in the 7 and 9 Year parent booklets, often of the type 'if yes, then ...'. These inconsistencies were corrected where possible, using the values -77 (not applicable) and -99 (missing) where appropriate.