Disk space requirements

It is difficult to give general guidance on disk space requirements, but below, we describe some considerations relating to space requirements for sequencing reads and mappings, as these are typically the largest datasets generated when working with NGS data.

The formula for disk space usage relating to read sequences in a Sequence List is:

Bytes per read: 28 + (length of read name) + 0.25 x (length of read)

Note that you can discard read names during import to save some space.

If quality scores are present and imported, add: 6 + (length of read)

As an example, a data set of 5.2 million 35 bp reads imported by CLC Genomics Workbench using the Discard sequence names option and including quality scores would take up:

5,244,764 x ( (28 + 0 + 0.25 x 35) + (6 + 35) ) = 389 MB

For a stand-alone read mapping that included a 4.7 Mbp annotated reference sequence, this would take up 473 MB.