Introduction to Using Workspaces and GP2 Data

September 2, 2022

By GP2 Complex Disease Data Analysis Working Group

Graphic with a blue-to-green gradient background featuring the text: "How to Analyze GP2 Data on Terra" in bold white letters. Below the text, part of a computer screen is visible, displaying the "Welcome to Terra Community Workbench."

The data generated by both the Global Parkinson’s Genetics Program (GP2) and the Accelerating Medicines Partnership in Parkinson’s Disease (AMP® PD) are hosted on Terra, a cloud platform for bioinformatics developed by the Broad Institute in collaboration with Microsoft and Verily. Terra allows you to do analysis directly on the cloud, removing the need to download data to your personal computer. This allows us to adhere to high data protection and privacy standards and facilitates open science and reproducibility through structured shareable code and results.

Terra supports two types of analysis: Jupyter notebooks and WDL workflows. Notebooks are documents broken down into cells containing code snippets that allow you to run an analysis sequentially. Notebooks support a number of languages including Python, are useful for a range of analyses, and are great for visualizations and documentation. Eventually, you may find that you need more computing power or a longer run time for your analyses than a notebook can provide. In this case, you can use a Terra workflow instead. These are built using the WDL (pronounced “widdle”) language and are submitted just like batch jobs on any cloud server. Workflows are useful for high-power analyses such as alignment or GWAS with large sample sizes.

Example Notebook Structure

A screenshot of a Jupyter Notebook example, titled "Notebook Example." The first cell imports useful Python packages, specifically pandas as pd and numpy as np. The second cell defines a simple arithmetic operation x = 4 + 7 and prints the result, which is 11. The third cell prints the message "Notebooks are awesome for visualization and reproducibility of analyses."

Example WDL Structure

A screenshot of a coding interface for a workflow titled "print-my-name." The script displayed is a simple example that defines a workflow called PrintName and a task called hello, which takes a first and last name as input and prints "Hello [first name] [last name]!"

When you get access to GP2 data, you also get access to the official GP2 Terra workspaces. Both the Tier 1 and Tier 2 workspaces will have the README file for the most recent GP2 data release. We highly recommend reading this as it provides explanations for all the files present in each release. Keep checking these workspaces often as more code resources, notebooks, and workflows related to GP2 data will be hosted there as they are developed.

A screenshot titled "Welcome to GP2 Tier 1!" provides instructions for setting up a workspace to access and analyze GP2 data. It outlines steps for creating a workspace, including cloning the workspace via the "snowman" menu (three dots in a circle) in the upper-right, selecting a billing project, and running a notebook titled "Getting Started with GP2 Tier 1 Data." The screenshot also mentions a GP2 Tier 1 GCP storage bucket and provides the storage path, as well as information about a notebook available for pulling data from the GCP bucket into the workspace.

To get started, you will need to submit a project idea to GP2 for review. Once you are approved, your institution will set up a billing project and workspace or GP2 will set one up for you on your behalf. After the workspace is set up, you can start by making a new notebook for your analysis. To run a notebook, you will need to create a cloud environment. You can customize your cloud environment to suit your needs, but if you are planning on using all the currently available imputed complex disease GP2 data, make sure to request at least 100 GB of persistent disk space so you have rough room to work with the 82 GB of data, as of release 2.0. You can always play around with the options to find what works best, but keep in mind that more resources are associated with higher costs.

Cloud Environment Example

A screenshot of a cloud environment configuration interface. It shows options for setting up a cloud compute profile, including selecting the number of CPUs, memory (GB), and whether to enable GPUs. It includes fields for a startup script and choosing the compute type (e.g., Standard VM). There is an option to enable autopause after a specified period of inactivity, and the user can select the location for the virtual machine. The interface also includes a section for configuring a persistent disk, where the user can specify the disk type and disk size in GB. Costs for running or paused cloud compute and disk storage are displayed at the top.

Once you have your cloud environment running and a new notebook set up, you are ready to begin your analysis. To set up your notebook, you may want to take advantage of some useful packages and variables provided below and on the GP2 official workspaces.

Useful Packages

A screenshot of a Python script for importing various packages and libraries. The code includes imports for system interaction (os, sys), data manipulation with Pandas (pandas), visualization with Seaborn (seaborn), NumPy for numerical operations (numpy), and Matplotlib for plotting (matplotlib). Additional imports include StringIO for handling string inputs, FireCloud API interaction (firecloud.api), and tools for working with Google Cloud services, such as bigquery for querying data and urllib.parse for building URLs. There are also imports for displaying HTML with IPython. The structure of the script sets up the environment for data analysis and visualization.

Environment SetupA screenshot of a Python script for setting up billing project and workspace variables. The script uses environment variables to define the billing project ID, workspace namespace, workspace name, and workspace bucket. It assigns these variables using os.environ to access the corresponding environment settings: GOOGLE_PROJECT, WORKSPACE_NAMESPACE, WORKSPACE_NAME, and WORKSPACE_BUCKET. Additionally, the script fetches workspace attributes using fapi.get_workspace() and stores the result in the WORKSPACE_ATTRIBUTES variable, accessing the attributes in the JSON response. This setup is useful for managing cloud-based resources in a workspace environment.

To look at the data available and move data over from the cloud to your workspace, you will use the gsutil tool set. For example, to list the data available in the gp2tier2 bucket you can use a command like this

Storage bucket. The command is: bash Copy code ! gsutil ls gs://gp2tier2/release2_06052022/ This command uses gsutil, a tool for interacting with Google Cloud Storage, to list the contents of the "gp2tier2" bucket, specifically within the "release2_06052022" directory.

This will list all the available files in the release 2 folder

A screenshot of a terminal command output listing the contents of a Google Cloud Storage bucket directory, "gp2tier2/release2_06052022/". The listed directories and files include: README_release2_06052022.txt clinical_data/ cnvs/ (copy number variants) imputed_genotypes/ meta_data/ raw_genotypes/ summary_statistics/ Each line begins with "gs://gp2tier2/release2_06052022/", indicating the Google Cloud Storage path for each file or directory in this release of data.

You can change the paths to look into each directory, for example if you change the path to this

A screenshot of a terminal command used to list the data available in a specific directory within a Google Cloud Storage bucket. The command is: bash Copy code ! gsutil ls gs://gp2tier2/release2_06052022/imputed_genotypes/ This command uses gsutil to list the contents of the "imputed_genotypes" directory within the "gp2tier2/release2_06052022" bucket.

It lists the folders for the imputed genotype data separated by predicted ancestry

A screenshot of a terminal command output listing the directories in the "imputed_genotypes" folder of the Google Cloud Storage bucket "gp2tier2/release2_06052022/". The directories include: AAC AFR AJ AMR CAS EAS EUR SAS Each line starts with "gs://gp2tier2/release2_06052022/imputed_genotypes/" followed by the directory name.

Once you have decided on the files you need, you can copy them to your workspace like this:

First, make a directory to copy the files in.

A screenshot of a terminal command to create a new directory. The command is: ! mkdir gp2_data This command uses the mkdir function to create a new directory called "gp2_data." The background is a dark terminal window, with the command text displayed clearly.

Then copy them into your new directory. In this example, we are copying over PD summary statistics from Nalls et al. 2019 without 23andMe data.

A screenshot of a terminal command used to copy a file from a Google Cloud Storage bucket to a local directory. The command is: ! gsutil cp gs://gp2tier2/release2_06052022/summary_statistics/META5_no23_with_rsid2.txt gp2_data/ This command uses gsutil cp to copy the file "META5_no23_with_rsid2.txt" from the "summary_statistics" directory in the Google Cloud Storage bucket to the local "gp2_data" directory. The file contains summary statistics from Nalls et al., 2019, excluding data from 23andMe.

Check to make sure that everything was copied over correctly by taking a look at the file:

A screenshot of a terminal command used to check the contents of a file. The command is: ! head gp2_data/META5_no23_with_rsid2.txt This command uses the head function to display the first few lines of the file "META5_no23_with_rsid2.txt" located in the "gp2_data" directory. The file likely contains summary statistics from a genetic study, and the command allows a quick preview of its contents.A screenshot showing the output of a terminal command displaying the first few lines of a data file. The table includes the following columns: MarkerName, Allele1, Allele2, Freq1, FreqSE, MinFreq, MaxFreq, Effect, StdErr, P-value, Direction, HetISq, HetChiSq, HetDf, HetPVal, and ID. The data shown in the table provides information on genetic markers, their allele frequencies, statistical effects, and p-values. Example rows: chr10:100000625 with allele1 "a" and allele2 "g," frequency data, effect size, standard error, and p-value. chr10:100000645 with allele1 "a" and allele2 "c," similar data presented. This output represents summary statistics from a genetic study.

Now your data is ready to be analyzed!


This was a short example of how to use workspaces and notebooks with GP2 data, for a more in-depth Terra tutorial, please see the GP2 Learning Management System and complete ‘Course 1: Using Terra to Access Data and Perform Analyses’.

Meet the authors

GP2 Complex Disease Data Analysis Working Group