QIAseq Targeted RNAscan Panels use molecular barcode technology to quantify a large number of fusion genes and identify new fusion gene partners.
The concept of molecular barcoding is that during library preparation of the samples with a QIAseq Targeted RNAscan Panel, a Unique Molecular Index (UMI) is added to each read before amplification. The barcoded molecules are then amplified by PCR. Due to intrinsic noise and sequence-dependent bias, barcoded sequences may be amplified unevenly. Thus, target quantification can be better achieved by counting the number of Unique Molecular Indices (UMIs) in the reads rather than counting the number of total reads for each gene. Sequence reads having different UMIs represent different original molecules, while sequence reads having the same UMI are results of PCR duplication from one original molecule.
However, during secondary analyses of the sequenced reads, UMIs (and their attached common sequence used as identifier) can hinder the mapping of the reads to a reference sequence. The first steps in the Detect QIAseq RNAscan Fusions ready-to-use workflow consists in trimming the UMI while retaining the UMI barcoding information as an annotation on the read. Remaining PCR adapters are also trimmed before mapping the sequencing reads to the human transcriptome to perform a RNA-seq analysis of the reads.
In the latest stage of the workflow, the Detect QIAseq RNAscan Fusions workflow works in two major steps: first it detects all potential fusion genes, and then it evaluates (refines) the identified fusion events to increase the sensitivity and specificity of the calls.
Detection. The workflow first trims all remaining adapters from the reads. The trimmed reads are then mapped to the reference transcriptome sequence, and reads are grouped according to their Unique Molecular Index. The Detect Fusion Genes tool will identify fusion events based primarily on the number of fusion crossing reads, and subsequently on the number of fusion spanning reads. However, when determining whether a read actually crosses the fusion, the tool takes into account the length of the unaligned end, as well as exon boundaries (as at the RNA level, fusions usually happen at exon boundaries). Finally, other evidence, such as whether the unaligned end maps many places in the genome, are considered. Note that the parameters of the Define Fusion Genes tool when included in the workflow are configured differently than the default values of the tool used on its own. In particular, filters have been relaxed to not overlook any fusion.
The Detect Fusion Genes tool uses a binomial model to evaluate the fusions. The null hypothesis is that there is no fusion, i.e., the reads originate from the wild type transcript. Hence, a small p-value suggests a fusion transcript. Reads are assigned to either come from fusion or wild type transcripts based on how well they map to either. This assignment is based on mapping, and it will have an error rate (e) that we estimate from test data. In addition, we require a minimum number of reads to support any fusion breakpoint before considering it as a fusion. This guards against false positives due to low coverage. In addition, we require a minimum number of reads to support any fusion breakpoint before considering it as a fusion. This guards against false positives due to low coverage.
The Z-score and p-value are then calculated using a standard one-tailed binomial test and an "Assumed error rate". This Assumed error rate is a mapping error rate, i.e., the probability of an unaligned end mapping to another gene by random. The p-value represents the probability of spanning/crossing reads (indicating a fusion), under the null hypothesis where a fraction (i.e., the "Assumed error rate") of reads map there by chance.
The Detect Fusion Genes tool outputs a maximum of 200 identified fusions (the ones with the lowest p-value, i.e., the highest Z-score) in a fusion track. In addition to that track, the tool will also generate a set of "fusion references", i.e., a version of the input sequence track, gene track, mRNA track, CDS track and primer track, as well as the fusion breakpoints track, which are mapped on an artifical genome that includes both the wildtype and the fusion chromosomes. Note that the number of fusions output by the tool when used in the workflow has been limited to 200 to avoid having to compute too many subsequent fusion reference chromosomes.
Refinement. The workflow then re-maps the original trimmed reads against the fusion references (i.e., the transcriptome including putative fusion transcripts), and regroups all reads in new UMI groups. We expect that some previously unmapped or poorly mapped reads will now map directly to the fusion transcripts, resulting in a more accurate detection of fusion supporting reads.
The Refine Fusion Genes tool takes the fusions previously identified by the Detect Fusion Genes tool, and re-counts the number of fusion crossing reads as well as wildtype supporting reads. It then calculates the "refined" Z-score and p-value using the same binomial model as the Detect Fusion Genes tool.