Demultiplex reads
Multiplexing techniques are often used When sequencing of different samples in one sequencing run. One method used is to tag the sequences with a unique identifier during the preparation of the sample for sequencing [Meyer et al., 2007].
With this technique, each sequence read will have a sample-specific tag, which is a specific sequence of nucleotides before and after the sequence of interest. This principle is shown in figure 23.13 (please refer to [Meyer et al., 2007] for more detailed information).
Figure 23.13: Tagging the target sequence. Figure from [Meyer et al., 2007].
The sample-specific tag - also called the barcode - can then be used to distinguish between the different samples when analyzing the sequence data.
The post-processing of the sequencing data to separate the reads into their corresponding samples based on their barcodes, can be done using the demultiplexing functionality of the CLC Genomics Workbench. Using this tool, sequences are associated with a particular samples when they contain an exact match to a particular barcode. Sequences that do not contain an exct match to any of the barcode sequences provided are classfied as not grouped and are put into a sequence list with the name "Not grouped".
Note that there is also an example using Illumina data here.
Before processing the data, you need to import it as described in Import high-throughput sequencing data.
Please note that demultiplexing is often carried out on the sequencing machine so that the sequencing reads are already separated according to sample. This is often the best option, if it is available to you. Of course, in such cases, the data will not need to be demuliplexed again after import into the CLC Genomics Workbench.
To de-multiplex your data, please go to:
Toolbox | NGS Core Tools () | Demultiplex Reads ()
This opens a dialog where you can specify the sequences to process.
When you click on the button labeled Next, you can then specify the details of how the demultiplexing should be performed. At the bottom of the dialog, there are three buttons, which are used to Add, Edit and Delete the elements that describe how the barcode is embedded in the sequences.
First, click Add to define the first element. This will bring up the dialog shown in 23.14.
Figure 23.14: Defining an element of the barcode system.
At the top of the dialog, you can choose which kind of element you wish to define:
- Linker. This is a sequence which should just be ignored - it is neither the barcode nor the sequence of interest. Following the example in figure 23.13, it would be the four nucleotides of the SrfI site. For this element, you simply define its length - nothing else.
- Barcode. The barcode is the stretch of nucleotides used to group the sequences. In this dialog, you simply need to specify the length of the barcode. The valid sequences for your barcodes need to be provided at a later stage in setting up this job.
- Sequence. This element defines the sequence of interest. You can define a length interval for how long you expect this sequence to be. The sequence part is the only part of the read that is retained in the output. Both barcodes and linkers are removed.
Figure 23.15: Processing the tags as shown in the example of figure 23.13.
If you have paired data, the dialog shown in figure 23.15 will be displayed twice - one for each part of the pair.
In case, where paired reads are expected to be barcoded in the same way (see example below), you would set the parameters for read1 (wizard step 3) and read2 (wizard step 4) to be the same.
Read1 : -Linker-Barcode1-Sequence
Read2 : -Linker-Barcode1-Sequence
However, if read2 of the pair is not expected to be the same as read1 in the pair, it is necessary to adjust these settings accordingly. For example, it is possible that read2 does not contain any barcode sequence at all. In this case, you would simply set the sequence parameter for the mate and exclude the barcode and linker parameters.
Clicking Next will display a dialog as shown in figure 23.16.
Figure 23.16: Specifying the barcodes as shown in the example of figure 23.13.
Barcodes can be entered manually or imported from a properly formatted CSV or Excel file:
Manually The barcodes can be entered manually by clicking the Add () button. You can edit the barcodes and the names by clicking the cells in the table. The barcode name is used when naming the results.
Import from CSV or Excel To import a file of barcodes, click on the Import () button. The input format consists of two columns: the first contains the barcode sequence, the second contains the name of the barcode. An acceptable csv format file would contain columns of information that looks like:
"AAAAAA","Sample1" |
"GGGGGG","Sample2" |
"CCCCCC","Sample3" |
The Preview column will show a preview of the results by running through the first 10,000 reads.
At the top, you can choose to search on both strands for the barcodes (this is needed for some 454 protocols where the MID is located at either end of the read).
Click Next to specify the output options. First, you can choose to create a list of the reads that could not be grouped. Second, you can create a summary report showing how many reads were found for each barcode (see figure 23.17).
Figure 23.17: An example of a report showing the number of reads in each group.
There is also an option to create subfolders for each sequence list. This can be handy when the results need to be processed in batch mode.
A new sequence list will be generated for each barcode containing all the sequences where this barcode is identified. Both the linker and barcode sequences are removed from each of the sequences in the list, so that only the target sequence remains. This means that you can continue the analysis by doing trimming or mapping. Note that you have to perform separate mappings for each sequence list.