Introduction to Using Workspaces and GP2 Data

September 2, 2022

By GP2 Complex Disease Data Analysis Working Group

Complex Disease Genetics Research Collaboration Research Operations

Graphic with a blue-to-green gradient background featuring the text: "How to Analyze GP2 Data on Terra" in bold white letters. Below the text, part of a computer screen is visible, displaying the "Welcome to Terra Community Workbench."

The data generated by both the Global Parkinson’s Genetics Program (GP2) and the Accelerating Medicines Partnership in Parkinson’s Disease (AMP^® PD) are hosted on Terra, a cloud platform for bioinformatics developed by the Broad Institute in collaboration with Microsoft and Verily. Terra allows you to do analysis directly on the cloud, removing the need to download data to your personal computer. This allows us to adhere to high data protection and privacy standards and facilitates open science and reproducibility through structured shareable code and results.

Terra supports two types of analysis: Jupyter notebooks and WDL workflows. Notebooks are documents broken down into cells containing code snippets that allow you to run an analysis sequentially. Notebooks support a number of languages including Python, are useful for a range of analyses, and are great for visualizations and documentation. Eventually, you may find that you need more computing power or a longer run time for your analyses than a notebook can provide. In this case, you can use a Terra workflow instead. These are built using the WDL (pronounced “widdle”) language and are submitted just like batch jobs on any cloud server. Workflows are useful for high-power analyses such as alignment or GWAS with large sample sizes.

Example Notebook Structure

A screenshot of a Jupyter Notebook example, titled "Notebook Example." The first cell imports useful Python packages, specifically pandas as pd and numpy as np. The second cell defines a simple arithmetic operation x = 4 + 7 and prints the result, which is 11. The third cell prints the message "Notebooks are awesome for visualization and reproducibility of analyses."

Example WDL Structure

A screenshot of a coding interface for a workflow titled "print-my-name." The script displayed is a simple example that defines a workflow called PrintName and a task called hello, which takes a first and last name as input and prints "Hello [first name] [last name]!"

When you get access to GP2 data, you also get access to the official GP2 Terra workspaces. Both the Tier 1 and Tier 2 workspaces will have the README file for the most recent GP2 data release. We highly recommend reading this as it provides explanations for all the files present in each release. Keep checking these workspaces often as more code resources, notebooks, and workflows related to GP2 data will be hosted there as they are developed.

To get started, you will need to submit a project idea to GP2 for review. Once you are approved, your institution will set up a billing project and workspace or GP2 will set one up for you on your behalf. After the workspace is set up, you can start by making a new notebook for your analysis. To run a notebook, you will need to create a cloud environment. You can customize your cloud environment to suit your needs, but if you are planning on using all the currently available imputed complex disease GP2 data, make sure to request at least 100 GB of persistent disk space so you have rough room to work with the 82 GB of data, as of release 2.0. You can always play around with the options to find what works best, but keep in mind that more resources are associated with higher costs.

Cloud Environment Example

Once you have your cloud environment running and a new notebook set up, you are ready to begin your analysis. To set up your notebook, you may want to take advantage of some useful packages and variables provided below and on the GP2 official workspaces.

Useful Packages

Environment Setup

To look at the data available and move data over from the cloud to your workspace, you will use the gsutil tool set. For example, to list the data available in the gp2tier2 bucket you can use a command like this

Storage bucket. The command is: bash Copy code ! gsutil ls gs://gp2tier2/release2_06052022/ This command uses gsutil, a tool for interacting with Google Cloud Storage, to list the contents of the "gp2tier2" bucket, specifically within the "release2_06052022" directory.

This will list all the available files in the release 2 folder

You can change the paths to look into each directory, for example if you change the path to this

A screenshot of a terminal command used to list the data available in a specific directory within a Google Cloud Storage bucket. The command is: bash Copy code ! gsutil ls gs://gp2tier2/release2_06052022/imputed_genotypes/ This command uses gsutil to list the contents of the "imputed_genotypes" directory within the "gp2tier2/release2_06052022" bucket.

It lists the folders for the imputed genotype data separated by predicted ancestry

A screenshot of a terminal command output listing the directories in the "imputed_genotypes" folder of the Google Cloud Storage bucket "gp2tier2/release2_06052022/". The directories include: AAC AFR AJ AMR CAS EAS EUR SAS Each line starts with "gs://gp2tier2/release2_06052022/imputed_genotypes/" followed by the directory name.

Once you have decided on the files you need, you can copy them to your workspace like this:

First, make a directory to copy the files in.

A screenshot of a terminal command to create a new directory. The command is: ! mkdir gp2_data This command uses the mkdir function to create a new directory called "gp2_data." The background is a dark terminal window, with the command text displayed clearly.

Then copy them into your new directory. In this example, we are copying over PD summary statistics from Nalls et al. 2019 without 23andMe data.

Check to make sure that everything was copied over correctly by taking a look at the file:

A screenshot of a terminal command used to check the contents of a file. The command is: ! head gp2_data/META5_no23_with_rsid2.txt This command uses the head function to display the first few lines of the file "META5_no23_with_rsid2.txt" located in the "gp2_data" directory. The file likely contains summary statistics from a genetic study, and the command allows a quick preview of its contents.

Now your data is ready to be analyzed!

This was a short example of how to use workspaces and notebooks with GP2 data, for a more in-depth Terra tutorial, please see the GP2 Learning Management System and complete ‘Course 1: Using Terra to Access Data and Perform Analyses’.

Meet the authors

GP2 Complex Disease Data Analysis Working Group