Single Cell RNA-Seq Analysis
Single Cell RNA-Seq Analysis can be found in the Toolbox here:
Gene Expression () | Cell Preparation () | Single Cell RNA-Seq Analysis ()
The tool takes as input one or more sequence lists () of reads that have been annotated using Annotate Reads with Cell and UMI. It outputs an Expression Matrix with spliced and unspliced counts () for gene expressions, and optionally an Expression Matrix () for transcript expressions, a report, and unmapped reads.
Sample: All input sequence lists must originate from the same sample, which is set when executing the Annotate Reads with Cell and UMI tool (see Annotate Reads with Cell and UMI). This is because Single Cell RNA-Seq Analysis assumes that reads with the same cell barcode that are present in different inputs represent the same cell. The wizard does not allow executing the tool with inputs that are annotated with different samples.
It is important to provide all the data for a sample to Single Cell RNA-Seq Analysis at the same time. For example, if one sample was sequenced on 4 lanes of an Illumina sequencer, then all 4 lanes should be supplied together. This allows reads originating from the same cell, but coming from different lanes, to be analyzed jointly such that amplification duplicates are detected using UMIs and only give one count in the output Expression Matrix. |
Matrix with spliced and unspliced counts: The Expression Matrix with spliced and unspliced counts () is an extension of the Expression Matrix () containing separate information about the spliced and unspliced reads for each cell and gene. Reads mapping to transcripts are counted towards the spliced expression of a gene, while reads mapping to a gene but not a transcript, such as introns of known transcripts, or upstream/downstream of known transcripts, are counted towards the unspliced expression. The Expression Matrix with spliced and unspliced counts () can be used as input to any tool that accepts an Expression Matrix (). |
Filtering: The output matrix should be filtered by QC for Single Cell before being used in any other tool in the CLC Single Cell Analysis Module. This is because sequencing errors often lead to many barcodes that have few counts, and which do not represent real cells. If no filtering is performed, the large number of barcodes can cause downstream tools to run extremely slowly and results can be negatively affected by the added noise.
Barcode whitelists: In some protocols, the set of valid barcodes is known in advance, and available as a barcode whitelist. In CLC Single Cell Analysis Module, it is not possible to directly use such a list. Instead, QC for Single Cell is usually able to detect the barcodes that correspond to cells using the Empty droplets filter (see Empty droplets filter), and to prevent specific barcodes from being filtered away (see Choosing barcodes to retain). |
The tool requires a genome - supplied as References, and both a Gene track and a corresponding mRNA track. These data can obtained in two ways:
- Directly downloaded as tracks using the Reference Data Manager (see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Download_Genomes.html).
- Imported as tracks from fasta and gff/gff3/gtf files (see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Import_tracks.html).
The following additional options are available:
- Use spike-in controls. Includes spike-in controls in the output, which can be used downstream in the QC for Single Cell tool. A spike-in section is also added to the report produced by this tool.
- Spike-in controls The spike-in controls. To learn how to import spike-in control files, see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Import_RNA_spike_in_controls.html.
- Strand setting. This option controls whether the reads should be mapped in the same orientation as the transcript from which they originate (forward), in the reverse direction (reverse), or to both directions (both). The `forward' and `reverse' options allow assignment of reads to the correct gene in cases where overlapping genes are located on different strands. Without the strand-specific protocol, this would not be possible (see [Parkhomchuk et al., 2009]). For many single cell library preparations, one read of a pair, which is usually discarded, binds to the polyA tail of transcripts. This means that the remaining read should usually be mapped with strand specific `forward'.
- Coverage bias. The expected coverage bias determines whether it is possible to produce an Expression Matrix for transcript expressions, and also affects the quality control applied to the `Gene/transcript length coverage' section of the report.
- Unbiased. An Expression Matrix for transcript expressions can be produced. The expected coverage is uniform across the bodies of transcripts.
- Targeted. An Expression Matrix for transcript expressions cannot be produced. This is because several transcripts may be amplified by the same primers, meaning it is often not possible to determine the transcript of origin for a read. The expected coverage has no particular bias.
- 3' bias. An Expression Matrix for transcript expressions cannot be produced. This is because several transcripts may end at the same genomic position, meaning it is often not possible to determine the transcript of origin for a read. The expected coverage is 3' biased.
- Include intronic reads in total expression By default, the total expression of a gene is given by the spliced expression. When this option is enabled, the total expression is set instead to the sum of spliced and unspliced counts. This option is recommended for single nucleus RNA sequencing (snRNA-Seq), where data is usually analyzed by counting expression from both exons and introns [Bakken et al., 2018].
- Group by UMIs. When enabled, reads with the same cell barcode and UMI are counted as 1 such that the output expressions have no amplification bias. When disabled, reads with the same cell barcode and UMI are counted separately.
- Output report. When enabled, a detailed report is produced, see The Single Cell RNA-Seq Analysis report.
- Output transcript matrix. When enabled, an Expression Matrix for transcript expressions is produced.
- Output unmapped reads. When enabled, up to two lists with reads that did not map are produced, one for paired reads and one for single reads. The unmapped reads consist of those reads that did not map to the reference at all, or that mapped equally well to more than 10 distinct places in the reference sequence.
For paired reads, pairs that mapped to different genes are output in the unmapped paired reads list, while members of broken pairs are output in the unmapped single reads list.
Subsections