Reference data overview


Data Provider URL to the original file Description
Human reference sequence ENSEMBL ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/ Chromosomes 1-22, X, Y and M human reference DNA sequence GRCh37(HG19).
Human genes, coding sequences and transcripts ENSEMBL ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/ All annotated protein coding genes for human reference sequence GRCh37(HG19). The annotation was done by ENSEMBL and includes annotations from RefSeq, CCDS as well as ENSEMBL itself.
HapMap variants ENSEMBL ftp://ftp.ensembl.org/pub/current_variation/gvf/homo_sapiens/ The goal of the International HapMap Project is to develop a haplotype map of the human genome, the HapMap, which will describe the common patterns of human DNA sequence variation (for more information about HapMap see http://hapmap.ncbi.nlm.nih.gov/). Please note that there are 12 different files (tracks) to be downloaded (one file for each population). It is recommended that you configure your workflows with the file from this population that best matches the ethnicity of the patient from which the sample was taken. You can find more about the population codes, which are part of the filename here: http://www.sanger.ac.uk/resources/downloads/human/hapmap3.html.


Data Provider URL to the original file Description
Variants found by the 1000 Genomes Project ENSEMBL ftp://ftp.ensembl.org/pub/current_variation/gvf/homo_sapiens/ The 1000 Genomes Project Phase 1 created an integrated map of genetic variations from 1092 human genomes[  et al., 2012]. Please note that there are 4 different files (tracks) to be downloaded (one file for each population). It is recommended that you configure your workflows with the file from the population that bests matches the ethnicity of patient from which the sample was taken. You can learn more about the population codes that are part of the filename here: http://www.ensembl.org/Help/Faq?id=328.
COSMIC variants Sanger Institute ftp://ftp.sanger.ac.uk/pub/CGP/cosmic/data_export/CosmicMutantExport_*.tsv.gz The mutation data was obtained from the Sanger Institute Catalogue Of Somatic Mutations In Cancer web site, http://www.sanger.ac.uk/cosmic Bamford et al (2004) The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website[Bamford et al., 2004]. The COSMIC database is a human, curated database.
dbSNP variants UCSC http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/snp*.txt.gz Human variants present in the Single Nucleotide Polymorphism Database (dbSNP), which includes smaller insertions, deletions, replacements, SNPs and MNVs. Please note that most variants in dbSNP are not validated and everybody can submit data to dbSNP. The collection of variants includes clinical relevant as well as common variants. Please note that the url must be modified according to what you would like to download - e.g. if you are interested in snp141Common.txt.gz, "*" in the url should be replaced with "141Common" (for a full list see http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/).


Data Provider URL to the original file Description
dbSNP variants UCSC http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/snp*Common.txt.gz Uniquely mapped variants that appear in at least 1% of the population or are 100% non-reference. Please note that the url must be modified according to what you would like to download - e.g. if you are interested in snp141Common.txt.gz, "*" in the url should be replaced with "141" (for a full list see http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/)
ClinVar database variants NCBI ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf/clinvar_00-latest.vcf.gz ClinVar is designed to provide a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.
PhastCons Conservation Scores UCSC http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/hg19.100way.phastCons/ Conservation track of UCSC from a multiple alignments of 100 species and measurements of evolutionary conservation using the phastCons algorithm from the PHAST package.
Human Gene Ontology (GO slim) file EBI http://www.ebi.ac.uk/QuickGO/GMultiTerm Gene Ontology file in slim format (only high level GO terms annotated) for the GO categories Molecular Function, Biological Process and Cellular Component annotated on human genes. The file was made using the QuickGO tool from the EBI (http://www.ebi.ac.uk/QuickGO/ GMultiTerm).