Definition of sample grouping
When new sample files have been detected, they are grouped into samples based on the filenames, and optionally also based on information in a sample sheet. For analysis workflows where sample sheets are not mandatory, defining a sample sheet is recommended if the sequencer supports it as it allows greater flexibility in how the samples are named when the workflow contains multiple importers (multimodal workflows) where multiple samples should be submitted to the same workflow.
Determining sample ID
In order to determine the sample ID for a given sample file, it is assumed that the filename is formatted as follows:
<sample ID><suffix>
The sample ID is what is left after removing the suffix added to the sample ID by the sequencer. To exemplify we will look at files produced by an Illumina® sequencer which takes the form:
<sample ID>_S<#>_L<###>_R<1 or 2>_001.fastq.gz
For example:
ID1234_S1_L001_R1_001.fastq.gz ID1234_S1_L001_R2_001.fastq.gz ID5678_S2_L001_R1_001.fastq.gz ID5678_S2_L001_R2_001.fastq.gz
Removing the Illumina suffix from these four files leaves us with two samples with IDs ID1234
and ID5678
, respectively, each consisting of two files.
Pairing multimodal samples
If a sample sheet is not defined, multimodal samples can only be paired if the filenames of the sample files follow one of the following formats (underscore can be replaced by hyphen), and they have the same sample ID:
<sample type>_<sample ID><suffix>
<sample ID>_<sample type><suffix>
For example, the following four files constitute two samples that should be submitted together to the same workflow, one to the DNA importer and one to the RNA importer:
DNA_ID1234_S1_L001_R1_001.fastq.gz DNA_ID1234_S1_L001_R2_001.fastq.gz RNA_ID1234_S2_L001_R1_001.fastq.gz RNA_ID1234_S2_L001_R2_001.fastq.gz
Sample sheet grouping
The connector only supports sample sheets produced by Illumina sequencers. For each sample in the sample sheet, it is possible to define a pair ID and a sample type. The former enables the plugin to determine which samples should be submitted together to the same workflow, and the latter enables the plugin to determine to which importer a given sample should be submitted.
For example, the following four files constituting two samples that should be submitted together can be grouped based in part on the filename (obtaining the sample ID) and in part on the information in the sample sheet (mapping the sample ID to its pair ID and sample type):
ID1234_S1_L001_R1_001.fastq.gz (Paid ID: XYZ513, Sample type: DNA) ID1234_S1_L001_R2_001.fastq.gz (Paid ID: XYZ513, Sample type: DNA) ID1234_S2_L001_R1_001.fastq.gz (Paid ID: XYZ513, Sample type: RNA) ID1234_S2_L001_R2_001.fastq.gz (Paid ID: XYZ513, Sample type: RNA)
The sample sheet file must be located in a folder belonging to the same folder tree as the sample files (see example in figure 4.1).