The Components of GP2’s Seventh Data Release
Back to Blog Feed

The Components of GP2’s Seventh Data Release

By Hampton Leonard, Mike A. Nalls, Dan Vitale, Mathew Koretsky, Kristin Levine, Mary B. Makarious, Lietsel Jones, Zih-Hua Fang, and J Solle | , , |
Author(s)
  • Co-lead, Complex Disease Data Analysis

    Hampton Leonard, MS

    Data Tecnica International / National Institutes of Health | USA

    Hampton has a background in data science and machine learning, which she applies to large multi-omic datasets in the neurodegenerative disease space. She is passionate about investigating differences on both clinical and omic levels and how these differences can affect clinical trial outcomes.

  • Lead, Complex Disease Data Analysis

    Mike A. Nalls, PhD

    Data Tecnica International | USA

    Mike founded Data Tecnica in early 2017 after over a decade of experience in large dataset analytics and methods research in healthcare and other scientific fields. 400+ peer-reviewed publications in the field of applied statistics in large datasets, brain diseases, and genomics. He is a strong advocate of open science, collaboration, and transparency in science.

  • Working Group Participant

    Dan Vitale, MS

    Data Tecnica International | USA

    Dan is a data science consultant for Data Tecnica, consulting primarily for the Laboratory of Neurogenetics and CARD at the National Institute on Aging of the National Institutes of Health. His work is focused on open science, automation and development of genetic analytic pipelines and software, and machine learning.

  • Working Group Participant

    Mathew Koretsky, BSc

    National Institutes of Health | USA

    Mat is a post-baccalaureate student at the National Institutes of Health. He is passionate about pipeline development and meaningful applications of computer science in the biomedical research space.

  • Working Group Participant

    Kristin Levine, MS

    Data Tecnica International | USA

    Kristin works with the Data Tecnica and National Institute on Aging (NIA) teams on data and code sharing plus real-world data analysis of biobanks and healthcare systems. She is also an accomplished writer, now applying her communication skills to scientific domains.

  • Co-lead, Data and Code Dissemination

    Mary B. Makarious, BSc

    National Institutes of Health | USA

    Mary is a graduate student participating in the NIH graduate partnership program in collaboration with the University College London. She is a rising star in biomedical data science, with a background in genomics, machine learning, and open science platforms. She is also passionate about increasing representation in research and empowering scientists to analyze their own data.

  • Lietsel Jones

    DataTecnica/National Institutes of Health | USA

    Lietsel is an analyst with Data Tecnica with a keen interest in the intersection between epidemiology and genetics. She is also a clinical data manager with GP2 working to collect and harmonize large clinical datasets from worldwide contributors.

  • Lead, Monogenic Data Analysis

    Zih-Hua Fang, PhD

    German Center for Neurodegenerative Diseases | Germany

    The lead of the monogenic data analysis efforts in GP2, they are making significant contributions to GP2’s efforts to study monogenic and familial Parkinson’s disease.

  • Justin Solle
    Co-lead, Operations and Compliance

    J Solle, MBA

    The Michael J. Fox Foundation | USA

    J is the implementation Program Lead for GP2, co-lead for the Operations & Compliance Working Group, and a member of the Operations Committee. J joined the Michael J. Fox Foundation in March 2021 and is the Director of Clinical Research, leading the implementation of GP2.

Overview

In April 2024, GP2 announced the seventh data release on the Terra and the Verily® Workbench platforms in collaboration with AMP® PD. This release includes >9,000 additional genotyped participants. 

  • The genotype array data, including locally-restricted samples, now consists of a total of 54,180 genotyped participants (28,729 PD cases, 15,834 Controls, and 9,617 ‘Other’ phenotypes),
    • When removing the locally-restricted samples, these now consist of 40,740 (20,507 PD cases, 11,841 Controls, and 8,392 ‘Other’ phenotypes)
  • 17,496 total individuals who have deep clinical phenotyping information also have matching genetic information

What’s New In This Release?

  • Additional genotyped and imputed samples
  • Additional clinical data for 4,911 individuals 

Locality-restricted GDPR samples via the Verily Viewpoint Workbench

We are continuing to pilot granting access to locally-restricted samples, otherwise known as samples governed by the General Data Protection Regulation (GDPR) policy, through our collaboration with the Verily Viewpoint Workbench.

At this time, as GP2 continues to roll out data sharing solutions for GDPR protected data, release 7 data with regional restrictions will be available to only GP2 consortium members and partners. As testing and implementation continue in early 2024, this solution will be available to the broader research community. All release 7 samples can be found on Workbench, meanwhile all release 7 samples not governed by GDPR requirements can be found on the community workbench on Terra (like all previous releases). To gain access to the full release on VWB you must:

  1. Have approved GP2 Tier 2 access
  2. Fill out the GDPR-governed sample request form 
  3. Be a GP2 consortium member (contributing cohort, GP2 partner, or project analyses team member)

Clinical Data

This release contains deep clinical phenotyping data for an additional 4,911 individuals in this release. This information consists of 

  • Age at diagnosis and onset
  • Primary, current, and latest diagnoses
  • Cognitive exams such as the Mini-Mental State Examination (MMSE) and the Montreal Cognitive Assessment (MoCA)
  • Movement Disorder Society-Sponsored Revision of the Unified Parkinson’s Disease Rating Scale (MDS-UPDRS)
  • Detailed “other” phenotypes, such as Lewy body Dementia (LBD)

In this release, each of the 17,496 individuals who have clinical information also have matching genetic information. 

Individual-Level Data

We now capture the data from a total of 87 cohorts, 14 of which are new to this release. Please refer to the GP2 Cohort Dashboard for more information on the cohorts that have been shared.

Genetically-determined ancestry of array genotyped GP2 participants is broken into 11 ancestry groups; the table below details the genetically-determined ancestry of genotyped participants in this release that have passed quality control and been imputed. These numbers include samples from previous releases that have been reclustered using the new cluster file and gone through quality control along with the newly genotyped and shared samples unique to this current release.

Whole Genome Sequences called by DeepVariant-GLnexus

Additional GP2 sequencing data is estimated for release in Q3 2024.

Future data releases will continue to grow the diversity of participants available. You can check out our cohort dashboard to see our progress. For users with tier 2 access already, you can explore the data further on our cohort browser, expanded on in a previous blog post.

As always, please refer to the README that accompanies each GP2 release for further details regarding pipelines, data, and analyses!