TEDS Data Dictionary

Data Files

Contents of this page:

Introduction

This page gives a general description of the various files used to store and process the phenotypic data for the TEDS datasets. For more details of the files used for a specific TEDS study, click one of the links on the left of this page.

The files involved in creating the TEDS phenotypic datasets fall into three categories:

  1. Raw data files
  2. Dataset files, used for analysis
  3. Syntax files (scripts), used to convert raw data into analysis datasets

These types of files, and their organisation in storage, are described in more detail below.

File storage

The primary copies of all the TEDS data files are securely stored within the KCL network, under KCL guidelines for storage of confidential data. For reasons of confidentiality and security, details of the network locations are not given here. The phenotypic data files, at least in their raw state, contain identifiable personal data relating to TEDS participants. Access to the file storage is therefore restricted to TEDS admin staff who require access for administering TEDS studies and processing the data. The file storage is specific to the TEDS study and is not accessible to other KCL staff.

Within the centralised TEDS file storage outlined above, all phenotypic data (raw data, scripts, datasets) are stored within a folder named \SYSTEM\.

Within the \SYSTEM\ folder, the files are organized in the following subfolders:

  • \SYSTEM\Rawdata\
    This folder contains the raw data collected during each of the main TEDS studies. These raw data have been cleaned and aggregated. Aggregation involved combining multiple files, for the same data collection, into a single file. Cleaning involved elimination of obvious data errors and anomalies, and elimination of unnecessary identifying variables. Original, uncleaned raw data files, such as those generated during initial phases of data entry, have not been retained.
  • \SYSTEM\Datasets\
    This folder contains the processed datasets, one per main TEDS study, containing variables ready for analysis. There are also sub-directories containing some of the working files that are produced during the creation of the analysis datasets. These are generally SPSS data files.
  • \SYSTEM\Scripts\
    This folder contains the syntax files (scripts) that are used to convert the raw data into the analysis dataset for each study.

Each of these three folders contains a set of sub-directories, one for each TEDS study: \1c\ (1st Contact), \2yr\ (2 Year), and so on. Each of these may contain further subdirectories, which will be outlined below.

Backup copies of the phenotypic data files are stored separately on the TEDS archive server, located on the KCL network. This storage is devoted to TEDS data and is not shared; it is accessible only by TEDS admin staff. Files stored here include old versions of datasets, datasets merged and prepared for specific projects and old scripts. This storage is also used for various backup files including the genotypic data.

Raw data files

The raw data folder for each study (e.g. \SYSTEM\Rawdata\7yr\ for the 7 Year study) typically contains the following files and subfolders. For more specific details for a given study, follow the link at the top left of this page.

  • An Access database file (e.g. 7yr.accdb for the 7 Year study).
    This database contains the cleaned and aggregated raw data from booklets and questionnaires in the study. There are separate database tables for logically distinct data collections (e.g. parent, twin, teacher). Some larger questionnaires have data split between two or more tables, because an Access table can hold no more than 255 fields. In addition to the questionnaire data, the database typically also contains tables for admin data such as return dates. The Access database also contains queries and macros to enable the data conveniently to be exported for building the analysis dataset.
  • \Export\ sub-directory.
    This folder contains csv files of cleaned raw data, which have been exported from the Access database. The scripts convert these files into the analysis dataset.
  • \web data files\ or similarly named subfolders, where appropriate.
    In studies involving data collection via the web, the raw data were not entered in the same way as for paper questionnaires, but were downloaded in large files from the web server. The original files have undergone cleaning and aggregation, but have not been added to the Access database because of their large size. The files stored here are typically delimited text files, with one file per web activity.

The Access files are maintained in the most recent Access format, and have been updated when newer versions have appeared. The contents of the tables have been exported (in the \Export\ subfolder) in csv file format, providing further protection against future software compatibility issues.

The web data files, like the exported files from the database, are stored in delimited plain-text files, usually csv. Text files of this sort have the advantages that they can accommodate unlimited numbers of fields (variables), and that they are not dependent on proprietary software that might limit how they may be opened.

Dataset files

The dataset folder for each study (e.g. \SYSTEM\Datasets\7yr\ for the 7 Year study) generally contains the following items. For more detail, follow the links at the top left of this page for each study.

  • The current version of the main analysis dataset for the study. This is an SPSS data file.
  • The \working files\ sub-directory,
    containing various intermediate files created (by the scripts) during the processing of data to make the analysis dataset. These, too, are SPSS data files.

These dataset files are all constructed using SPSS scripts, and saved as SPSS data files, with the .sav file extension.

Each of these main dataset files contains all the available variables for the given study (or sometimes for a group of studies carried out before the next main study). Such dataset files are not distributed directly to researchers in this form. The dataset shared with a researcher will be tailored for their research project, containing only those variables needed for the planned analysis, and usually containing longitudinal data selected from more than one main study dataset. Furthermore, participant data in distributed datasets must be identifiable, which generally involves transformation of the ID variables (see scrambling IDs).

Syntax files

The scripts folder for each study (e.g. \SYSTEM\Scripts\7yr\ for the 7 Year study) contains a set of syntax files used to make the dataset.

Generally, the processing of the data involves very many lines of syntax, which is why the syntax has been split into several script files. The scripts must be run in the correct order, and usually the script file names contain numbering to make the order clear. Where possible, each script carries out a logically related set of processing steps, e.g. importing and merging data, labelling variables, creating scales, etc.

The scripts are SPSS syntax files (with the .sps file extension). SPSS scripts are in fact plain text files, so they can be opened and read using a text editor such as Notepad. However, they must be saved with the correct file extension in order to be recognised by the SPSS software.

These scripts, like all the files described on this page, are only accessible to TEDS admin staff. They are not shared because they contain details of processing that sometimes relates to identifiability of the data. However, the processing details are broadly described in the "data processing" pages for each study, and parts of the syntax are published in the "derived variables" pages for each study, within this data dictionary.