Sort sequences by name

With this functionality you will be able to group sequencing reads based on their file name. A typical example would be that you have a list of files named like this:
...
A02__Asp_F_016_2007-01-10
A02__Asp_R_016_2007-01-10
A02__Gln_F_016_2007-01-11
A02__Gln_R_016_2007-01-11
A03__Asp_F_031_2007-01-10
A03__Asp_R_031_2007-01-10
A03__Gln_F_031_2007-01-11
A03__Gln_R_031_2007-01-11
...
In this example, the names have five distinct parts (we take the first name as an example): To start mapping these data, you probably want to have them divided into groups instead of having all reads in one folder. If, for example, you wish to map each sample separately, or if you wish to map each gene separately, you cannot simply run the mapping on all the sequences in one step.

That is where Sort Sequences by Name comes into play. It will allow you to specify which part of the name should be used to divide the sequences into groups. We will use the example described above to show how it works:

        Toolbox | NGS Core Tools (Image ngsfolder) | Multiplexing (Image multiplex_group) | Sort Sequences by Name (Image multiplex)

This opens a dialog where you can add the sequences you wish to sort. You can also add sequence lists or the contents of an entire folder by right-clicking the folder and choose: Add folder contents.

When you click Next, you will be able to specify the details of how the grouping should be performed. First, you have to choose how each part of the name should be identified. There are three options:

In the example above, it would be sufficient to use a simple split with the underscore _ character, since this is how the different parts of the name are divided.

When you have chosen a way to divide the name, the parts of the name will be listed in the table at the bottom of the dialog. There is a checkbox next to each part of the name. This checkbox is used to specify which of the name parts should be used for grouping. In the example above, if we want to group the reads according to sample ID and gene name, these two parts should be checked as shown in figure 23.12.

Image sortbyname_step2
Figure 23.12: Splitting up the name at every underscore (_) and using the sample ID and gene name for grouping.

At the middle of the dialog there is a preview panel listing:

This preview cannot be changed. It is shown to guide you when finding the appropriate settings.

Click Next if you wish to adjust how to handle the results. If not, click Finish. A new sequence list will be generated for each group. It will be named according to the group, e.g. Asp016 will be the name of one of the groups in the example shown in figure 23.12.


Subsections