SAM and BAM mapping files

The CLC Genomics Workbench supports import and export of files in SAM (Sequence Alignment/Map) and BAM format, which are designed for storing large nucleotide sequence alignments. Read more and see the format specification at http://samtools.sourceforge.net/

The CLC Genomics Workbench includes support for importing SAM and BAM files from Complete Genomics.

Note! If you wish to import a SAM/BAM file as a sequence list without mapping information, please use the Standard import instead.

For a detailed explanation of the SAM and BAM files exported from CLC Genomics Workbench, please see SAM/BAM export format specification.

A SAM/BAM file that contains information associated with a mapping will include the read sequences, the name of the references used for the mapping, and information about the relationship between a given read sequence and the reference it mapped to. So, to import a mapping you need to provide the SAM/BAM file itself and also specify the reference sequences that are referred to within that file. The references can either be sequences already imported into the Workbench, or, if appropriately recorded in the SAM/BAM file, can be fetched from URLs specified in the SAM/BAM file.

With the reference sequences, the read data, and the information about how the reads are associated with a particular reference, the Workbench builds up the mapping. One has the option to build a track-based mapping, or a stand-alone mapping object. In the latter case, if there is only one reference sequence, the result will be a single read mapping (Image contig) or, where there is more than one reference sequence, a table of mappings (Image multicontig).

Please note that mappings within the CLC Genomics Workbench do not allow for an individual read sequence mapping to more than one location. Due to this, in cases where a SAM/BAM file contains multiple alignment records for a single read, only one such record will be used to build the mapping.

To import a SAM or BAM file containing mapping data:

        File | Import (Image import) | SAM/BAM Mapping Files (Image ngs_assembly_import)

This will open a dialog where you select the SAM/BAM file to import as well as the reference sequences to be used (Figure 6.13).

When you select the reference sequence(s) two options exist:

  1. Select a matching reference sequence that has already been imported into the Workbench. Click on the "Find in folder" icon (Image find_in_project) to localize the reference sequence.
  2. If the SAM/BAM file already contains information about where to find the reference sequence, tick the "Download references" box to automatically download the reference sequence.

The selected reference sequence(s) will be listed under "References in files" with "Name", "Length", and "Status". Whenever the correct reference sequence (with the correct name and sequence length) has been selected the "Status" field will indicate this with an "OK". The name and length of your reference sequence must match exactly the names and lengths of the references specified in the SAM/BAM file. If there are inconsistencies in the names or lengths of the reference sequences being chosen and those recorded in the SAM/BAM file, an entry will appear in the "Status" column indicating this. E.g "Length differs" or "Input missing"6.1.



Footnotes

... missing"6.1
If you are using a CLC Genomics Server to import files located on the Server (rather than locally), then checks for corresponding reference names and lengths cannot be carried out, so nothing will be reported in this section of the Wizard. This means you will be able to continue to launch the import with correct or incorrect reference sets specified. However, any inconsistencies in these will lead to the import task failing with an error related to this.