Single Cell RNA-Seq Analysis
Single Cell RNA-Seq Analysis can be found in the Toolbox here:
Cell Preparation () | Single Cell RNA-Seq Analysis ()
The tool takes as input one or more lists of reads that have been annotated using Annotate Reads with Cell and UMI. It outputs an Expression Matrix () for gene expressions, and optionally an Expression Matrix for transcript expressions and a report.
Note: The output Expression Matrix should be filtered by QC for Single Cell before being used in any other tool in the CLC Single Cell Analysis Module. This is because sequencing error often leads to many barcodes that have few counts, and which do not represent real cells. If no filtering is performed, the large number of barcodes can cause downstream tools to run extremely slowly and results can be negatively affected by the added noise.
Barcode whitelists: In some protocols, the set of valid barcodes is known in advance, and available as a barcode whitelist. In CLC Single Cell Analysis Module, it is not possible to directly use such a list. Instead, QC for Single Cell is usually able to detect the barcodes that correspond to cells using the Empty droplets filter, and to prevent specific barcodes from being filtered away (see Choosing barcodes to retain). |
It is important to provide all the data for a sample to the tool at the same time. For example, if one sample was sequenced on 4 lanes of an Illumina sequencer, then all 4 lanes should be supplied together. This allows reads originating from the same cell with the same UMI, but coming from different lanes, to be detected as amplification duplicates, such that they only give one count in the output Expression Matrix.
The tool requires a genome - supplied as References, and both a Gene track and a corresponding mRNA track. These data can obtained in two ways:
- Directly downloaded as tracks using the Reference Data Manager (see https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Download_Genomes.html).
- Imported as tracks from fasta and gff/gff3/gtf files (see https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Import_tracks.html).
The following additional options are available:
- Use spike-in controls. Includes spike-in controls in the output, which can be used downstream in the QC for Single Cell tool. A spike-in section is also added to the report produced by this tool.
- Spike-in controls The spike-in controls. To learn how to import spike-in control files, see https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Import_RNA_spike_in_controls.html.
- Strand setting. This option controls whether the reads should be mapped in the same orientation as the transcript from which they originate (forward), in the reverse direction (reverse), or to both directions (both). The `forward' and `reverse' options allow assignment of reads to the correct gene in cases where overlapping genes are located on different strands. Without the strand-specific protocol, this would not be possible (see [Parkhomchuk et al., 2009]). For many single cell library preparations, one read of a pair, which is usually discarded, binds to the polyA tail of transcripts. This means that the remaining read should usually be mapped with strand specific `forward'.
- Coverage bias. The expected coverage bias determines whether it is possible to produce an Expression Matrix for transcript expressions, and also affects the quality control applied to the `Gene/transcript length coverage' section of the report.
- Unbiased. An Expression Matrix for transcript expressions can be produced. The expected coverage is uniform across the bodies of transcripts.
- Targeted. An Expression Matrix for transcript expressions cannot be produced. This is because several transcripts may be amplified by the same primers, meaning it is often not possible to determine the transcript of origin for a read. The expected coverage has no particular bias.
- 3' bias. An Expression Matrix for transcript expressions cannot be produced. This is because several transcripts may end at the same genomic position, meaning it is often not possible to determine the transcript of origin for a read. The expected coverage is 3' biased.
- Count intronic reads By default, reads are only counted towards the expression of a gene if they map to transcripts. When this option is enabled, reads are additionally counted if they map to a gene but not a transcript. Such reads may map to introns of known transcripts, or be upstream/downstream of known transcripts. This option is recommended for single nucleus RNA sequencing (snRNA-seq), where data is usually analyzed by counting expression from both exons and introns [Bakken et al., 2018].
- Group by UMIs. When enabled, reads with the same cell barcode and UMI are counted as 1 such that the output expressions have no amplification bias. When disabled, reads with the same cell barcode and UMI are counted separately.
Subsections