The Single Cell RNA-Seq Analysis algorithm
Single Cell RNA-Seq Analysis uses the same algorithm as the RNA-Seq Analysis tool of the CLC Genomics Workbench. Briefly, the tool extracts the sequence of all transcripts from the provided mRNA track. Reads are then simultaneously aligned to both this transcriptome and the full genome (and spike-in sequences if these have been provided).
Each read may have multiple equally high scoring alignments, some to transcripts and others to the genome. These alignments are translated back into genomic coordinates. In many cases, all the alignments refer to the same genomic coordinates and the read is considered `uniquely mapped'. If there are more than 10 distinct alignments in genomic coordinates, then the read is discarded.
When a read can be aligned equally well to multiple transcripts or multiple genes, it is counted towards only one of these, with the `lucky' transcript being chosen by an Expectation Maximization (EM) method similar to RSEM and eXpress. This works as follows:
- An `ambiguity graph' is built that links transcripts that could have given rise to the same reads. At this stage all reads are considered together without reference to their barcodes or UMIs.
- The abundance of each transcript is estimated from this graph.
- The reads are distributed to the different transcripts according to their estimated abundances. Reads that map to genes, but are incompatible with known transcripts are ignored unless the option Count intronic reads is enabled. When the option is enabled, these reads are assigned to a gene based on the estimated abundances of the transcripts for each gene.
At this stage, if the option Group by UMIs is enabled, then reads with the same barcode and UMI are only counted once. After the first read has been assigned, subsequent reads with the same barcode and UMI are ignored.
The final gene expression is the sum of the expressions of the transcripts for that gene. When the option Count intronic reads is enabled, expression from introns and UTRs is also included.