The Components of GP2’s Fourth Data Release
Back to Blog Feed

The Components of GP2’s Fourth Data Release

By Hampton Leonard, Mike A. Nalls, Dan Vitale, Mathew Koretsky, Kristin Levine, Mary B. Makarious, Zih-Hua Fang, and Peter Heutink | , , |
Author(s)
  • Co-lead, Complex Disease Data Analysis

    Hampton Leonard, MS

    Data Tecnica International / National Institutes of Health | USA

    Hampton has a background in data science and machine learning, which she applies to large multi-omic datasets in the neurodegenerative disease space. She is passionate about investigating differences on both clinical and omic levels and how these differences can affect clinical trial outcomes.

  • Lead, Complex Disease Data Analysis

    Mike A. Nalls, PhD

    Data Tecnica International | USA

    Mike founded Data Tecnica in early 2017 after over a decade of experience in large dataset analytics and methods research in healthcare and other scientific fields. 350+ peer-reviewed publications (before the age of 40) in the field of applied statistics in large datasets, brain diseases, and genomics. He is a strong advocate of open science, collaboration, and transparency in ... Read More

  • Working Group Participant

    Dan Vitale, MS

    Data Tecnica International | USA

    Dan is a data science consultant for Data Tecnica International, consulting primarily for the Laboratory of Neurogenetics at the National Institute on Aging of the National Institutes of Health. His work is focused on open science, automation and development of genetic analytic pipelines and software, and machine learning. 

  • Working Group Participant

    Mathew Koretsky, BSc

    National Institutes of Health | USA

    Mathew is a postbaccalaureate fellow at the Center for Alzheimer’s Disease and Related Dementias at the National Institutes of Health. His work focuses on the development of genetic data processing pipelines as well as applying machine learning and data science techniques to genomic datasets in the neurodegenerative disease space.

  • Working Group Participant

    Kristin Levine, MS

    Data Tecnica International | USA

    Kristin is a data scientist with Data Tecnica International, consulting primarily for the Center for Alzheimer's and Related Dementias (CARD) at the National Institutes of Health. A writer turned data scientist, she is passionate about open science, democratizing research tools, and making data as clear and accessible as possible.

  • Co-lead, Data and Code Dissemination

    Mary B. Makarious, BSc

    National Institutes of Health | USA

    Mary is a graduate student at the Laboratory of Neurogenetics (LNG) at the National Institutes of Health, National Institute on Aging under Drs. Andrew Singleton and Mike Nalls. She studied Bioinformatics and Neuroscience at Loyola University in Chicago, IL, USA before coming to work at LNG, where she has been for the past 2 years. Her work involves applying machine learning an... Read More

  • Working Group Participant

    Zih-Hua Fang, PhD

    German Center for Neurodegenerative Diseases | Germany

    Zih-Hua completed her bachelor's in Taiwan, her PhD with the University of Wageningen and AgroParisTech in France, and her postdoc at the ETH Zurich. Her research interests and experience focus on animal breeding genetics and genomics. Zia-Hua has six years' experience in bioinformatics and statistical modeling and experience with short and long read whole genome sequencing.

  • Foto Heutink
    Lead, Monogenic Data Analysis

    Peter Heutink, PhD

    German Center for Neurodegenerative Diseases | Germany

    Peter was trained as a molecular biologist at the University of Amsterdam and completed his PhD at the department of Clinical Genetics of the Erasmus Medical Center Rotterdam. In 1994, Peter began his own research group to focus on neurodegenerative diseases such as frontotemporal dementia and Parkinson’s disease. In 2003, he moved to the VU University Medical Center in Amste... Read More

In February 2023, GP2 announced the fourth data release on the Terra platform in collaboration with AMP® PD. This release is packed full of new data and resources.

This release includes 2,583 additional new complex disease participants, adding to the previous releases from the Complex and Monogenic Networks. The complex disease data now consists of a total of 17,485 genotyped participants (9,429 PD; 6,648 Controls, and 1,408 ‘Other’). The cohorts added to GP2 in this release are:

Genetically-determined ancestry of complex disease GP2 participants is broken into ten ancestry groups (the nine groups below plus a small number of Finnish Europeans); the table below details the genetically-determined ancestry of complex disease participants in this release that have passed quality control and been imputed. These numbers include samples from previous releases that have been reclustered using the new cluster file and gone through quality control along with the newly genotyped and shared samples unique to this current release.

Future data releases will continue to grow the diversity of participants available. You can check out our dashboard to see our progress.

An important update for this release is a minor change to one of the thresholds used in our QC pipeline. In order to share more samples, we have removed genotyped variants that are consistently underperforming in quality before going through the QC process. A list of the poor performing variants can be found in the meta data directory of the Tier 2 buckets and on Github. We have also used a less stringent sample call rate, changing from 0.98 to 0.95. In doing so, more samples can be used for your analyses, without sacrificing quality.

Another update for this release is an update to the clinical data. The phenotype column (“Phenotype”) has been updated to only include labels ‘PD’, ‘Control’, or ‘Other’. An additional column has been added, ‘other_pheno’, which provides a more detailed diagnosis label for participants with an ‘Other’ label. Some of these more detailed diagnosis labels include unaffected and affected individuals via targeted PD genetic recruitment, SWEDDs, and prodromal PD as well as other disorders such as essential tremor (ET), progressive supranuclear palsy (PSP), dementia with Lewy bodies (DLB), and multiple system atrophy (MSA).

Copy number variant (CNV) calls for all genotyped samples passing quality control (gene-level plus 250kb flanking regions) have been updated to include all samples in release 4. This data has been clustered using a custom GP2 genotype clustering file (available in the utils directories under both tier 1 and tier 2 data access). Both the cluster file and the pipeline used to predict the probabilistic CNV calls can be found on the GP2 Github for use with data outside of GP2. For more information regarding clustering using the custom GP2 genotyping clustering file and the copy number variant probabilistic calls, please see the “The Components of GP2’s Third Data Release” on the GP2 blog.

More information on the structure of the complex disease genotype and clinical data is detailed in the blog post ‘The Components of GP2’s First Data Releaseas well as in the README that has been updated for this release and is available on the official GP2 Terra workspaces. The monogenic PD WGS data is also detailed in the same README.