GWAS summary data ecosystem
Developed at the MRC Integrative Epidemiology Unit (IEU) at the University of Bristol, this resource is a manually curated collection of complete GWAS summary datasets made available as open source files for download, or by querying a database of the complete data.
This project began as the underlying database for the MR-Base and LD Hub projects. These data now serve as an input source to a wider number of analytical tools that implement methods such as Mendelian randomization, fine mapping, colocalisation, GWAS visualisation etc. Please see the API page for a list of R and python packages that will connect to the data.
The database comprises mainly publicly available datasets, but also includes a number of private datesets whose access is controlled through OAuth2.0 authentication. Please contact us if you have datasets that you would like to add, whether public or private.
The GWAS VCF format
We have made all the public data available for download. We are using the GWAS VCF format to store the GWAS summary data to ensure alignment with the hg19 reference sequence, and to enable very fast querying. More information is available here: https://github.com/MRCIEU/gwas-vcf-specification and on biorxiv.
Citing this resource
Please ensure that the original data are cited whenever it is used. Pubmed IDs are made available through the meta data.
If you access the data through the OpenGWAS database API:
The MRC IEU OpenGWAS data infrastructure. Ben Elsworth, Matthew Lyon, Tessa Alexander, Yi Liu, Peter Matthews, Jon Hallett, Phil Bates, Tom Palmer, Valeriia Haberland, George Davey Smith, Jie Zheng, Philip Haycock, Tom R Gaunt, Gibran Hemani. bioRxiv 2020.08.10.244293v1. doi: 10.1101/2020.08.10.244293
The MR-Base platform supports systematic causal inference across the human phenome. Hemani G, Zheng J, Elsworth B, Wade KH, Baird D, Haberland V, Laurin C, Burgess S, Bowden J, Langdon R, Tan VY, Yarmolinsky J, Shihab HA, Timpson NJ, Evans DM, Relton C, Martin RM, Davey Smith G, Gaunt TR, Haycock PC, The MR-Base Collaboration. eLife 2018;7:e34408. doi: 10.7554/eLife.34408
If using GWAS-VCF files:
The variant call format provides efficient and robust storage of GWAS summary statistics. Matthew Lyon, Shea J Andrews, Ben Elsworth, Tom R Gaunt, Gibran Hemani, Edoardo Marcora. bioRxiv 2020.05.29.115824; doi: https://doi.org/10.1101/2020.05.29.115824
There is not currently a single file to download everything - it would be extremely large (many terrabytes) and constantly out of date as we continually add new studies. To programmatically download a file you can use the
wget command. For example, if you know the ID of the GWAS dataset (
ukb-b-19953 in the example below) then the following commands will download that file and its index:
A list of all datasets, along with their meta-data and IDs, is available through the API, R or python packages, so it should be straightforward to create a subset of datasets to be downloaded through a loop. If doing so, please be considerate and avoid doing this type of activity in parallel as it could impact our servers and make the resource difficult to access for others.
Many people have contributed to the IEU GWAS database project. Main parties are:
- Ben Elsworth
- Chris Zheng
- Gibran Hemani
- Jon Hallett
- Matt Lyon
- Peter Matthews
- Philip Haycock
- Tessa Alexander
- Tom Gaunt
- Valeriia Haberland
- Yi Liu
Sincere thanks also go to the many GWAS consortia who have made the GWAS data that they generated publicly available, and many members of the IEU who have contributed to curating these data.
The IEU GWAS database project is built upon many important open source software projects.
Many thanks to the NHGRI-EBI GWAS catalog and the studies contributing data to OpenGWAS
- DCCT / EDIC
- BioBank Japan
- UK Biobank
- Sclerostin genetics consortium
- UK Biobank+INTERVAL+UK BiLEVE
- Biobank Japan Project