Extract and count tags

First step in the analysis is to import the data (see Import Sequencing Data).

The next step is to extract the tags and count them:

        Toolbox | Transcriptomics Analysis (Image expressionfolder) | Expression Profiling by Tags (Image tag_folder) | Extract and Count Tags (Image count_tags)

This will open a dialog where you select the reads that you have imported. Click Next when the sequencing data is listed in the right-hand side of the dialog.

This dialog is where you define the elements in your reads. An example is shown in figure  28.34.

Image extract_and_count_tags_step2
Figure 28.34: Defining the elements that make up your reads.

By defining the order and size of each element, the Workbench is now able to both separate samples based on bar codes and extract the tag sequence (i.e. removing linkers, bar codes etc). The elements available are:

Sequence
This is the part of the read that you want to use as your final tag for counting and annotating. If you have tags of varying lengths, add a spacer afterwards (see below).
Sample keys
Here you input a comma-separated list of the sample keys used for identifying the samples (also referred to as "bar codes"). If you have not pooled and bar coded your data, simply omit this element.
Linker
This is a known sequence that you know should be present and do not want to be included in your final tag.
Spacer
This is also a sequence that you do not want to include in your final tag, but whereas the linker is defined by its sequence, the spacer is defined by its length. Note that the length defines the maximum length of the spacer. Often not all tags will be exactly the same length, and you can use this spacer as a buffer for those tags that are longer than what you have defined as your sequence. In the example in figure 28.34, the tag length is 17 bp, but a spacer is added to allow tags up to 19 bp. Note that the part of the read that is extracted and used as the final tag does not include the spacer sequence. In this way you homogenize the tag lengths which is usually desirable because you want to count short and long tags together.

When you have set up the right order of your elements, click Next to set parameters for counting tags as shown in figure  28.35.

Image extract_and_count_tags_step3
Figure 28.35: Setting parameters for counting tags.

At the top, you can specify how to tabulate (i.e. count) the tags:

Raw counts
This will produce the count for each tag in the data.
Sage Screen trimmed counts
This will produce trimmed tag counts. The trimmed tag counts are obtained by applying an implementation of the SAGEscreen method ([Akmaev and Wang, 2004]) to the raw tag counts. In this procedure, raw counts are trimmed using probabilistic reasoning. In this procedure, if a tag with low count has a neighboring tag with high count, and it is likely, based on the estimated mutation rate, that the low count tags have arisen through sequencing errors of the tags with higher count, the count of the less abundant tag will be attributed to the higher abundant neighboring tag. The implementation of the SAGEscreen method is highly efficient and provides considerable speed and memory improvements.
Next, you can specify additional parameters for the alignment that takes place when the tags are tabulated:
Allowing indels
Ticking this box means that, when SAGEscreen is applied, neighboring tags will, in addition to tags which differ by nucleotide substitutions, also include tags with insertion or deletion differences.
Color space
This option is only available if you use data generated on the SOLiD platform. Checking this option will perform the alignment in color space which is desirable because sequencing errors can be corrected. Learn more about color space in Color space.
At the bottom you can set a minimum threshold for tags to be reported. Although the SAGEscreen trimming procedure will reduce the number of erroneous tags reported, the procedure only handles tags that are neighbors of more abundant tags. Because of sequencing errors, there will be some tags that show extensive variation. There will by chance only be a few copies of these tags, and you can use the minimum threshold option to simply discard tags. The default value is two which means that tags only occurring once are discarded. This setting is a trade-off between removing bad-quality tags and still keeping tags with very low expression (the ability to measure low levels of mRNA is one of the advantages of tag profiling over for example micro arrays ['t Hoen et al., 2008]).

Note! If more samples are created, SAGEscreen and the minimum threshold cut-offs will be applied to the cumulated counts (i.e. all tags for all samples).

Clicking Next allows you to specify the output of the analysis as shown in figure 28.36.

Image extract_and_count_tags_step4
Figure 28.36: Output options.

The options are:

Create expression samples with tag counts
This is the primary result showing all the tags and respective counts (an example is shown in figure 28.37). For each sample defined via the bar codes, there will be an expression sample like this. Note that all samples have the same list of tags, even if the tag is not present in the given sample (i.e. there will be tags with count 0 as shown in figure 28.37). The expression samples can be used in further analysis by the expression analysis tools for statistical analyses etc.
Create sequence lists of extracted tags
This is a simple sequence list of all the tags that were extracted. The list is simple with no counts or additional information.
Create list of reads which have no tags
This list contains the reads from which a tag could not be extracted. This is most likely bad quality reads with sequencing errors that make them impossible to group by their bar codes. It can be useful for troubleshooting if the amount of real tags is smaller than expected.

Image counttable
Figure 28.37: The tags have been extracted and counted.

Finally, a log can be shown of the extraction and count process. The log gives useful information such as the number of tags in each sample and the number of reads without tags.