Disk space requirements
It is difficult to give general guidance on disk space requirements, but below, we describe some considerations relating to space requirements for sequencing reads and mappings, as these are typically the largest datasets generated when working with NGS data.
- After import, the reads will be in a Sequence List. Details relating to the size of Sequence Lists are below. The original sequence file (e.g. fasta, fastq, etc.) is not needed by the Workbench after import is complete.
- After import, reference sequences take up space. The original reference sequence files (e.g. fasta, embl, genbank), are not needed by the Workbench after import is complete.
- Reads mapped against a reference take up space within the mapping results. This is in addition to space taken up by the reads stored in a Sequence List.
- References also take up space within mapping results.
- Space is needed for temporary files created during analyses. Once an analysis is complete, the associated temporary files are deleted. Temporary files usually do not take up more space than the final result of the analysis. We highly recommend that temporary files are written to an area on the same machine as the Workbench is installed on. See
Temporary data for further details.
The formula for disk space usage relating to read sequences in a Sequence List is:
Bytes per read: 28 + (length of read name) + 0.25 x (length of read)
Note that you can discard read names during import to save some space.
If quality scores are present and imported, add: 6 + (length of read)
As an example, a data set of 5.2 million 35 bp reads imported by CLC Genomics Workbench using the Discard sequence names option and including quality scores would take up:
5,244,764 x ( (28 + 0 + 0.25 x 35) + (6 + 35) ) = 389 MB
For a stand-alone read mapping that included a 4.7 Mbp annotated reference sequence, this would take up 473 MB.