Definition of sample grouping

When new sample files have been detected, they are grouped into samples based on the filenames, and optionally also based on information in a sample sheet. For analysis workflows where sample sheets are not mandatory, defining a sample sheet is still recommended if the sequencing machine supports it as it allows greater flexibility in how the samples are named when the workflow contains multiple importers (multimodal workflows) where multiple samples should be submitted to the same workflow.


Determining sample ID

In order to determine the sample ID for a given sample file, it is assumed that the filename is formatted as follows:

<optional prefix><sample ID><suffix>

The sample ID is what is left after removing the optional prefix and the suffix added to the sample ID by the sequencing machine. To exemplify we will look at files produced by an Illumina® sequencer which takes the form:

<sample ID>_S<#>_L<###>_R<1 or 2>_<###>.fastq.gz

For example:

ID1234_S1_L001_R1_001.fastq.gz
ID1234_S1_L001_R2_001.fastq.gz
ID5678_S2_L001_R1_001.fastq.gz
ID5678_S2_L001_R2_001.fastq.gz

Removing the Illumina suffix from these four files leaves us with two samples with IDs ID1234 and ID5678, respectively, each consisting of two files.

Other types of sequencing machines might produce sample files taking a different form, e.g.:

<sequencing run ID>_<L##>_<sample ID>_<1 or 2>.fq.gz

For example:

SEQRUN1_L01_ID1234_1.fq.gz
SEQRUN1_L01_ID1234_2.fq.gz
SEQRUN1_L01_ID5678_1.fq.gz
SEQRUN1_L01_ID5678_2.fq.gz

Removing the prefix (consisting of the sequencing run ID and lane number) and suffix from these four files leaves us with two samples with IDs ID1234 and ID5678, respectively, each consisting of two files.


Pairing multimodal samples

If a sample sheet is not defined, multimodal samples can only be paired if the filenames of the sample files follow one of the following formats (the dash can be replaced with an underscore) and they have the same sample ID:

For example, the following four files constitute two samples that should be submitted together to the same workflow, one to the DNA importer and one to the RNA importer:

DNA-ID1234_S1_L001_R1_001.fastq.gz
DNA-ID1234_S1_L001_R2_001.fastq.gz
RNA-ID1234_S2_L001_R1_001.fastq.gz
RNA-ID1234_S2_L001_R2_001.fastq.gz

Likewise, the following four files also constitute two samples that should be submitted together to the same workflow, one to the DNA importer and one to the RNA importer:

SEQRUN1_L01_DNA-ID1234_1.fq.gz
SEQRUN1_L01_DNA-ID1234_2.fq.gz
SEQRUN1_L01_RNA-ID1234_1.fq.gz
SEQRUN1_L01_RNA-ID1234_2.fq.gz


Sample sheet grouping

The connector only supports sample sheets produced by Illumina sequencing machines. For each sample in the [Data] section of the sample sheet, it is possible to define a Pair_ID and a Sample_Type. The former enables the connector to determine which samples should be submitted together to the same workflow, and the latter enables the connector to determine to which importer a given sample should be submitted.

For example, the following four files constituting two samples that should be submitted together can be grouped based in part on the filename (obtaining the sample ID) and in part on the information in the sample sheet (mapping the sample ID to its pair ID and sample type):

ID1234_S1_L001_R1_001.fastq.gz (Pair ID: XYZ513, Sample type: DNA)
ID1234_S1_L001_R2_001.fastq.gz (Pair ID: XYZ513, Sample type: DNA)
ID1234_S2_L001_R1_001.fastq.gz (Pair ID: XYZ513, Sample type: RNA)
ID1234_S2_L001_R2_001.fastq.gz (Pair ID: XYZ513, Sample type: RNA)

If a pair ID has not been specified, the sample name or ID is used instead. If a sample type has not been specified, the value UNSPECIFIED_SAMPLE_TYPE is used instead.

The sample sheet file must be located in the same folder tree as the sample files to submit for analysis.