How reads are downloaded
Reads are downloaded in SRA (.sra) format using the NCBI SRA-toolkit. These files are typically 2.5x smaller than an equivalent zipped FASTQ format file.
NCBI's prefetch
utility is used for downloading the data, and the resulting file is then processed using 'fastq-dump'8.1.
Biological reads are imported. Technical reads are not. For paired reads, 2 biological reads are expected.
Sometimes runs in SRA cannot be downloaded. The affected runs are listed in a Problems panel together with a description of the problem. It is still possible to download the remaining runs.
The most common problems are:
- "The selected SRA reads contain no spots, and cannot be imported in the workbench.": The run has no associated sequencing data.
- "The selected SRA reads are dbGaP restricted.": For data protection reasons, you must request access to these reads. Requests and download cannot happen within the workbench, but you can follow the procedures here: http://www.ncbi.nlm.nih.gov/books/NBK5295/.
- "The selected SRA reads are made with an unsupported sequencing platform.": For example, Complete Genomics reads consist of eight regions separated by gaps of variable lengths, and should be analyzed by specialist tools.
Show Metadata for Selection
Information about SRA entries of interest can be downloaded to a CLC Metadata Table without downloading the sequence data. Select the rows of interest in the results table and then click on Show Metadata for Selection. Sequence data can be downloaded later if desired.
The first columns of the resulting CLC Metadata Table contain the same database identifiers as in the original results table. Later columns contain details associated with the biosample, which are pulled form SRA. In the side panel, to the right, the columns to show in the table can be configured.
Tips relating to retrieving sequence data later using the CLC Metadata Table:
- CLC Metadata Tables can be filtered so that only relevant rows are shown (Filtering tables).
- To copy just the accessions from the visible rows in the table, and retrieve these entries from SRA:
- Select only the "Run Accession" column in the side panel settings. (It can be fastest to click on the "Deselect All" button at the bottom of the column listing in the side panel and then re-select Run Accession.)
- Select all the rows and then go to Edit | Copy in the top level menu (or use Ctrl + C).
- In Search for Reads in SRA, choose "Accession" in the search options area and paste the copied accessions into that field using Edit | Paste from the top level menu (or use Ctrl + V).
- Remove the text "Run Accession" from the start of the pasted text and enter the text
OR
between each accession in the list.. - Run the search at SRA.
- Downloading reads using Search for Reads in SRA creates a new CLC Metadata Table with the resulting sequence lists associated to the relevant rows. Sequence lists can be associated with any CLC Metadata Table you wish. For more details, see Associating data elements with metadata.
Footnotes
- ... 'fastq-dump'8.1
- Downloading from SRA using Aspera is no longer supported. See https://github.com/ncbi/sra-tools/wiki/Avoid-using-ascp-directly-for-downloads