Import Expression Matrix
Several formats can be imported into an Expression Matrix () using the following importers:
- AnnData: Import Expression Matrix in AnnData format ();
- Cell Ranger HDF5: Import Expression Matrix in Cell Ranger HDF5 format ();
- CSV: Import Expression Matrix in CSV/TXT format ();
- h5Seurat: Import Expression Matrix in h5Seurat format ();
- Loom: Import Expression Matrix in Loom format ();
- MEX: Import Expression Matrix in MEX format ();
- MEX archive: Import Expression Matrix in MEX format (archive) ();
- Parse Biosciences MTX: Import Expression Matrix in ParseBio MTX format ().
The importers can be found here:
Import () | Single Cell Data () | Import Expression Matrix ()
Some other commonly encountered formats are specific to a programming language or software package. These can usually be exported from that software package as Loom files. For example, .rds/.Robj formats are from the R programming language and can often be written to Loom using the LoomR package, or methods in the same R package that was used to generate the files.
General options
The following options are common to all expression matrix importers:
- Gene or Transcript track. Genes or transcripts in the imported data are matched with features in the provided track to the extent possible. When a match is found, the genomic coordinates of the gene/transcript will be recovered. Matches are only found when the identification of the gene/transcript in the imported data with the feature in the track is unambiguous: one-to-many and many-to-one matches between the imported data and the provided track are not supported. This means, for example, that if a gene is present on two chromosomes of the track, then neither set of genomic coordinates will be recovered.
Matching is used to:
- View the Expression Matrix as a Track. For more information on tracks, see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Tracks.html.
- Define the mitochondrial chromosome when calculating the proportion of reads mapped to mitochondria in the QC for Single Cell tool (Count-based and extra-chromosomal filters).
- Recover identifiers (e.g. ENSG00000243485 for ENSEMBL genes) when these are not present in the input data. As identifiers are often more specific than e.g. gene names, this can help when training Cell Type Classifiers using the Train Cell Type Classifier tool (Train Cell Type Classifier), and when predicting cell types using a Cell Type Classifier.
The matching algorithm works by choosing an approach from the following list that maximizes the number of one-to-one matches between features in the provided track and features in the imported data:
- Matching names from the track with identifiers from the imported data
- Matching identifiers from the track with identifiers from the imported data
- Matching names from the track with unversioned identifiers from the imported data. An unversioned identifier is obtained by removing anything from or after the first `.' in the identifier. For example, ENSG00000243485 is the unversioned identifier for ENSG00000243485.5.
- Matching identifiers from the track with unversioned identifiers from the imported data
- Matching names from the track with names from the imported data
- Matching identifiers from the track with names from the imported data
In the case of a tie, the first equally good approach from the above list is used. If no matches are found, check that the correct Gene or Transcript track has been supplied.
- Spike-in controls (Optional). Genes or transcripts in the imported data are also matched against the spike-in controls provided here. This is used when calculating the proportion of reads mapped to spike-in controls in the QC for Single Cell tool (Count-based and extra-chromosomal filters). It is also used to remove the spike-in controls from downstream analysis. For details on how to import spike-in controls, see http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Import_RNA_spike_in_controls.html.
- Cell format and Sample. How cells are identified. See Cell format in importers for more details. When a file contains multiple samples, it is recommended to extract the sample name from the cell name. This allows the QC for Single Cell to process each sample separately, enables coloring of cells by sample in the Dimensionality Reduction Plot, and may simplify configuration of batch correction.
Options for importing cell annotations and clusters
AnnData, h5Seurat, Loom, and ParseBio MTX can contain metadata about cells, and this can be imported as Cell Annotations () or Cell Clusters (). These importers share the following options:
- Create clusters for. A comma-separated list of attributes to be imported as Cell Clusters. Any other cells metadata will be imported as Cell Annotations.
- Map clusters to QIAGEN Cell Ontology. When this is enabled, clusters will be translated, if possible, to the QIAGEN Cell Ontology (see The QIAGEN Cell Ontology). The translation attempts to match each cluster with a QIAGEN cell type based on the name and known synonyms. For example, `alveolar epithelial cells' are also called `pneumocytes'. If this option is selected, the `alveolar epithelial cells' cluster, if present, will be named `pneumocytes'. This option can be useful when standardizing clusters from different sources. It is especially recommended if clusters will be used to extend a QIAGEN Cell Type Classifier using the Train Cell Type Classifier tool (Train Cell Type Classifier).
Options for importing spliced and unspliced counts
Loom and MEX formats can contain both the total expression, spliced, and unspliced counts. The importers can be configured with which type of data to import and produce either an Expression Matrix (), or an Expression Matrix with spliced and unspliced counts ().
- Import expressions. Enables import of total expression from the relevant file. This is needed when:
- spliced/unspliced counts are not available;
- the total expression of a gene cannot be obtained purely from the spliced and, if selected, unspliced counts. For example, the expression has been normalized.
- Import spliced/unspliced. Enables import of spliced and unspliced counts from the relevant file(s). If the file(s) do not contain spliced/unspliced counts, the import will fail with a relevant message.
- Include unspliced counts in total expression. By default, the total expression of a gene is obtained from the spliced counts. When this option is enabled, the unspliced counts are also added to the total expression. This option is recommended for single nucleus RNA sequencing (snRNA-Seq), where data is usually analyzed by counting expression from both exons and introns [Bakken et al., 2018]. This option has no effect when both `Import expressions' and `Import spliced/unspliced' are enabled, where the total expression is read directly from the file.
Subsections