Adapter trimming
Clicking Next will allow you to specify adapter trimming.
In order to trim for adapters, you have to create an adapter list first that must be supplied to the trim tool. A new adapter trim list can be created here:
File | New | Trim Adapter List
This will create a new empty trim adapter list. Add the adapter(s) that you would like to use for trimming by clicking on the button Add Row () found at the bottom of the View Area (figure 22.2). Provide the name and sequence of the adapter that should be trimmed away and adjust the parameters if relevant.
Guideline to simplify the creation of trim adapter list in some common cases.
Sequencing adapters When you are analyzing sequencing data, the adapters must be trimmed off before you proceed with further analysis. The removal of adapters is often done directly on the sequencing machine, but in some cases, some adapters remain on the sequenced reads. The presence of remaining adapters can lead to misleading results, so we recommend to trim them off the reads.
If you have Illumina sequencing data for which adapter sequences have not been trimmed, or have been trimmed incompletely, the adapter sequences can be removed within the CLC Genomics Workbench using the Illumina adapter sequences that can be found here:
http://support.illumina.com/downloads/illumina-customer-sequence-letter.html
With Illumina adapters, the 5'end adapter is not sequenced, so adapters will always be found on the 3'end and thus need to be searched for on the Minus strand.
TruSeq adapters are ligated to the DNA fragments using TA overhangs. When creating the Trim Adapter List with TruSeq adapters, the Universal adapter can be entered in forward orientation and searched on the Minus strand. The Universal adapter already has a "T" in the end from the T-overhang in the TA ligation, so no additional "T" needs to be added to this sequence. However, the TruSeq Index adapter must be reverse-complemented: to simplify the list, only copy and reverse complement the part of the adapter that is common to all adapter indexes. Then paste it in the Trim Adapter list and add a "T" to the right-hand end of the reverse c-complemented adapter sequence. Finally set the strand to Minus. A more detailed step-by step description on how to remove TruSeq adapters can be found here https://secure.clcbio.com/helpspot/index.php?pg=kb.page&id=377.
|
Click on the button labeled Finish to create the adapter trim list. You must now save the generated adapter trim list in the Navigation Area. You can do this by clicking on the tab and dragging and dropping the adapter trim list to the desired destination, or you can go to File in the menu bar and the choose Save as.
Figure 22.2: Create a new Adapter Trim List by clicking on the button labeled "Add Row" found at the bottom of the New Trim Adapter view.
Note: To create an Adapter List file with the adapter sequences that have traditionally been provided with the Genomics Workbench, please go to Edit | Preferences. In the Data section of this panel you will find the adapter sequences. Please select the rows of your desired adapter sequences, then click the Convert Trim Adapters button. This will create an Adapter List for use in the Adapter trimming tool.
You can also create an adapter list by importing a comma separated value (.csv) file of your Adapters. This import can be performed with the standard import using either the Automatic Import option or Force Import as Type: Trim Adapter List. To import a csv file, the names of all adapters must be unique - the Workbench is unable to accept files with multiple rows containing the same adapter name. Additionally, the text between each comma that designates a new column should be quoted. The expected import format for Adapter Lists appears as shown in figure 22.3:
Figure 22.3: The expected import format for Adapter Lists.
You can also create an Excel file (.xlsx or .xls) format. In this case, you include the same information per column as indicated above, but do not include the quotes within Excel.
At the bottom of the view, you have the following options:
- Add Rows. Add a new adapter. This will bring up a dialog as shown in figure 22.4.
- Delete Row. Delete the selected adapter.
- Edit Row. Edit the selected adapter. This can also be achieved by double-clicking the row in the table.
Figure 22.4: Adding a new adapter for adapter trimming.
The information to be added for each adapter is explained in the following sections, going into detail with the adapter trim. Once the adapters have been added to the list, it should be saved (), and you can select it as shown in figure 22.10.
Action to perform when a match is found
When the first match for a given adapter is found on a read sequence, the Trim tool acts based on the action specified for that adapter. There are three actions to choose from:
- Remove adapter The adapter and all the nucleotides 5' of the match are removed when a match is found. All the nucleotides 3' of the adapter match will be preserved in the read, which is retained in the trimmed reads list. If there are no nucleotides 3' of the adapter match, the read is placed in the list of discarded sequences (see Trim output).
- Discard when not found The read sequence is trimmed, as described above for the Remove adapter option, if a match is found. However, if no match is found, the read sequence is placed in the list of discarded sequences. This kind of adapter trimming is useful for small RNA sequencing where the remnants of the adapter is an indication that this is indeed a small RNA. Note that if an adapter list contains multiple adapters marked "Discard when not found", then only sequences containing all the adapters will be retained in the list of trimmed reads.
- Discard when found The read is placed in the list of discarded sequences. If no match is found, the read is retained in the list of trimmed reads. This can be used for quality checking the data for linker contamination, etc.
How are adaptors identified/detected?For each read sequence in the input, a Smith-Waterman alignment [Smith and Waterman, 1981] is carried out with each adapter sequence. Alignment scores are computed and compared to the minimum scores provided for each adapter when setting up the Adapter List (figure 22.4). A lower score is usual when considering matches to adapters found at the end of sequences, where the adapter sequence may be incomplete. Further details on the definition of internal and end matches are provided in the Alignment Scoring paragraph below. When the alignment score surpasses the relevant minimum score, the action specified for that adapter will be taken.
Alignment Scoring
- Matches An A,C,G or T in the adapter that matches an A,C,G or T respectively - or a corresponding ambiguity code letter - in a sequence is considered a match and will be awarded 1 point.
- Mismatches The penalty for mismatches between bases is specified in the Adapter List.
- Gap The penalty for gaps introduced into the alignment is specified in the Adapter List.
By default a mismatch costs 2 and a gap (insertion or deletion) costs 3. A few examples of adapter matches and corresponding scores are shown in figure 22.5.
Figure 22.5: Three examples showing a sequencing read (top) and an adapter (bottom). The examples are artificial, using default setting with mismatch costs = 2 and gap cost = 3.
Note that there is a difference between an internal match and an end match. The examples above are all internal matches where the alignment of the adapter falls within the read. Figure 22.6 shows a few examples with an adapter match at the end.
Figure 22.6: Four examples showing a sequencing read (top) and an adapter (bottom). The examples are artificial.
In the first two examples (d and e), the adapter sequence extends beyond the end of the read. This is what typically happens when sequencing small RNAs where you sequence part of the adapter. The third example (f) shows a case that could be interpreted both as an end match and an internal match. However, the workbench will interpret this as an end match, because it starts at beginning (5' end) of the read. Thus, the definition of an end match is that the alignment of the adapter starts at the read's 5' end. The last example (g) could also be interpreted as an end match, but because it is a the 3' end of the read, it counts as an internal match (this is because you would not typically expect partial adapters at the 3' end of a read). Also note, that if Remove adapter is chosen for the last example, the full read will be discarded because everything 5' of the adapter is removed.
Below (figure 22.7), the same examples are re-iterated showing the results when applying different scoring schemes. In the first round, the settings are:
- Allowing internal matches with a minimum score of 6
- Not allowing end matches
- Action: Remove adapter
Figure 22.7: The results of trimming with internal matches only. Red is the part that is removed and green is the retained part. Note that the read at the bottom is completely discarded.
A different set of adapter settings could be:
- Allowing internal matches with a minimum score of 11
- Allowing end match with a minimum score of 4
- Action: Remove adapter
Figure 22.8: The results of trimming with both internal and end matches. Red is the part that is removed and green is the retained part.
Strand settings
Each adapter is defined as either Plus or Minus. Note that all the definitions above regarding 3' end and 5' end also apply to the minus strand (i.e. selecting the Minus strand is equivalent to reverse complementing all the reads). The adapter in this case should be defined as you would see it on the plus strand of the reverse complemented read. The example below (figure 22.9) shows a few examples of an adapter defined on the minus strand.
It shows hits for an adapter sequence defined as CTGCTGTACGGCCAAGGCG
, searching on the minus strand.
Figure 22.9: An adapter defined as CTGCTGTACGGCCAAGGCG searching on the minus strand. Red is the part that is removed and green is the retained part. The retained part is 3' of the match on the minus strand, just like matches on the plus strand.
You can see that if you reverse complemented the adapter you would find the hit on the plus strand, but then you would have trimmed the wrong end of the read. So it is important to define the adapter as it is, without reverse complementing.
Trimming of 3' ends of the reads
To trim an adapter and everything to the 3' end of the adapter you will need to search for the reverse complement of the adapter on the negative strand. This is achieved by creating a new Trim Adapter List from the reverse complement of your adapter sequence, choosing the minus strand of your reads and run adapter trimming with the new Trim Adapter List as input.
Other adapter trimming options
When you run the trim, you specify the adapter settings as shown in figure 22.10.
Figure 22.10: Trimming your sequencing data for adapter sequences.
Select an trim adapter list (see Adapter trimming on how to create an adapter list) that defines the adapters to use.
You can specify if the adapter trimming should be performed in Color space. Note that this option is only available for sequencing data imported using the SOLiD import. When doing the trimming in color space, the Smith-Waterman alignment is simply done using colors rather than bases. The adapter sequence is still input in base space, and the Workbench then infers the color codes. Note that the scoring thresholds apply to the color space alignment (this means that a perfect match of 10 bases would get a score of 9 because 10 bases are represented by 9 color residues). Learn more about color space.
Checking the Search on both strands checkbox will search both the minus and plus strand for the adapter sequence. Note! If a match is found on the reverse strand the Trim action will reverse complement the read before trimming and output the trimmed reverse complement. Its intended use is for removal of multiplexing barcodes and primers.
Below you find a preview listing the results of trimming with the current settings on 1000 reads in the input file (reads 1001-2000 when the read file is long enough). This is useful for a quick feedback on how changes in the parameters affect the trimming (rather than having to run the full analysis several times to identify a good parameter set). The following information is shown:
- Name. The name of the adapter.
- Matches found. Number of matches found based on the strand and alignment score settings.
- Reads discarded. This is the number of reads that will be completely discarded. This can either be because they are completely trimmed (when the Action is set to Remove adapter and the match is found at the 3' end of the read), or when the Action is set to Discard when found or Discard when not found.
- Nucleotides removed. The number of nucleotides that are trimmed include both the ones coming from the reads that are discarded and the ones coming from the parts of the reads that are trimmed off.
- Avg. length This is the average length of the reads that are retained (excluding the ones that are discarded).