Import Expression Matrix
The following expression matrix formats can be imported into an Expression Matrix ().
- Cell Ranger HDF5
- CSV
- Loom
- MEX
- MEX archive
Some other commonly encountered formats are specific to a programming language or software package. These can usually be exported from that software package as Loom files. For example:
- AnnData (h5ad) This format is defined by the AnnData package, and is used by Scanpy. It can be written to Loom using the `write_loom' method of the same package.
- .rds/.Robj Data formats from the R programming language. These can often be written to Loom using the LoomR package, or methods in the same R package that was used to generate the files.
Options common to all importers
Several options are common to all expression matrix importers. Figure 2.1 shows the Cell Ranger HDF5 importer, which only contains these general options.
Figure 2.1: The Cell Ranger HDF5 importer. The General options are common to all the expression matrix importers.
- Gene or Transcript track Genes or transcripts in the imported data are matched with features in the provided track to the extent possible. When a match is found, the genomic coordinates of the gene/transcript will be recovered. Matches are only found when the identification of the gene/transcript in the imported data with the feature in the track is unambiguous: one-to-many and many-to-one matches between the imported data and the provided track are not supported. This means, for example, that if a gene is present on two chromosomes of the track, then neither set of genomic coordinates will be recovered.
Matching is used to:
- View the Expression Matrix as a Track. For more information on tracks, see https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Tracks.html.
- Define the mitochondrial chromosome when calculating the proportion of reads mapped to mitochondria in the QC for Single Cell tool (Count-based and extra-chromosomal filters).
- Recover identifiers (e.g. ENSG00000243485 for ENSEMBL genes) when these are not present in the input data. As identifiers are often more specific than e.g. gene names, this can help when training Cell Type Classifiers using the Train Cell Type Classifier tool (Train Cell Type Classifier), and when predicting cell types using a Cell Type Classifier.
The matching algorithm works by choosing an approach from the following list that maximizes the number of one-to-one matches between features in the provided track and features in the imported data:
- Matching names from the track with identifiers from the imported data
- Matching identifiers from the track with identifiers from the imported data
- Matching names from the track with unversioned identifiers from the imported data. An unversioned identifier is obtained by removing anything from or after the first '.' in the identifier. For example, ENSG00000243485 is the unversioned identifier for ENSG00000243485.5.
- Matching identifiers from the track with unversioned identifiers from the imported data
- Matching names from the track with names from the imported data
- Matching identifiers from the track with names from the imported data
In the case of a tie, the first equally good approach from the above list is used. If no matches are found, check that the correct Gene or Transcript track has been supplied.
- Spike-in controls (optional) Genes or transcripts in the imported data are also matched against the spike-in controls provided here. This is used when calculating the proportion of reads mapped to spike-in controls in the QC for Single Cell tool (Count-based and extra-chromosomal filters). It is also used to remove the spike-in controls from downstream analysis. For details on how to import spike-in controls, see https://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Import_RNA_spike_in_controls.html
- Cell format Cells are identified by a combination of their barcode e.g. "AAGCT" and their sample name. This option allows the barcode and the sample name to be extracted separately from the name of the cell. By default the name of the cell is used as the barcode, and the sample name is the name of the imported file.
When a file contains multiple samples, it is recommended to extract the sample name from the cell name, as this allows the QC for Single Cell to process each sample separately, enables coloring of cells by sample in the Dimensionality Reduction Plot, and may simplify configuration of batch correction.
The Cell format is specified by using a mixture of keywords and text. The keywords are shown in figure 2.2. An example of their use is shown in figure 2.3.
Figure 2.2: Keywords that can be used to specify how to extract the barcode and sample name for a cell.
Figure 2.3: The top panel shows the results of importing a file with Cell format = {barcode}. After import the sample name is the name of the file that was imported, and the barcode is the entire name of a cell. In the bottom panel, Cell format = SRX41800{sample}_filter.{barcode}. Here the sample name and the barcode are extracted from the name of the cell, and other parts of the name are discarded.
Details specific to the CSV importer
The CSV/TXT importer supports import of text data in a full table format.
- Table layout choose whether the table has cells in columns and features in rows, or is transposed such that features are in columns and cells are in rows.
- Separator choose the column separator.
Working with spreadsheets Be careful to check that all the data is present before import if the file originates from a spreadsheet program. Such programs often impose limits on the number of rows and columns. |
Details specific to the Loom importer
Loom allows the exchange of data between different software packages.
A Loom file has an internal structure consisting of a main matrix, optional `layers' of the same size as the main matrix, row and column attributes (describing features and cells, respectively), and sparse graphs describing links between features or between cells. See https://linnarssonlab.org/loompy/format/index.html for details of the format.
The Loom importer expects the Loom format version 3.0.0 and imports only the main matrix, row attributes describing feature names and feature identifiers, and column attributes. All other information in the Loom file is ignored.
- Cell ID attribute A column attribute identifying the cell by its barcode and sample. The interpretation of this value is specified by the Cell format.
- Gene or transcript ID attribute A row attribute describing an identifier for a gene or transcript (e.g., ENSG00000243485 for ENSEMBL). If no identifiers are present, then it is also possible to set this to the same value as the Gene or transcript name attribute.
- Gene or transcript name attribute A row attribute describing the name for a gene or transcript. If no names are present, then it is also possible to set this to the same value as the Gene or transcript ID attribute.
- Create clusters for A comma-separated list of column attributes to be imported as Cell Clusters (). These attributes must be string arrays. Any other column attributes will be imported as Cell Annotations ().
Details specific to the MEX importer
The MEX importer requires three files to be supplied:
- Barcodes file A file with the extension .tsv and one row per barcode. Use the Cell format option to control how this barcode should be interpreted - for example if it also includes information about the sample.
- Feature file A file with the extension .tsv and 1, 2, or 3 tab-separated columns, with one line per feature. If only one column is supplied it is interpreted as the feature name. If two columns are supplied then the first column is the feature identifier and the second is the feature name. If three columns are provided, then the third column is interpreted as the type of the feature. Of the commonly used feature types, "Gene Expression", "Transcript Expression", and "Spike-in" are the most important. Other features, such as "Antibody Capture" will be silently ignored by most tools.
- Matrix file A file with the extension .mtx in the Matrix Market Exchange Coordinate Format, see https://math.nist.gov/MatrixMarket/formats.html for details of the format.
Additional options are:
- Name The name of the imported matrix. If Cell format is not configured to parse a sample name from each barcode in the barcodes file, then this will also be the sample name for all the imported barcodes.
- Files are in same directory This option is provided for convenience. When enabled, updating any one of the three files to a file in a new directory will lead to automatic updates of the other two files, if suitable candidates can be found in the same directory. This option only works for local files.
Details specific to the MEX archive importer
The MEX archive importer is provided for convenience. It accepts a .zip, .tar or .tar.gz file containing the three files required by the MEX importer. In order to uniquely identify each file, these must have a specific name:
- Barcodes file must be named barcodes.tsv
- Feature file must either be named features.tsv or genes.tsv
- Matrix file must be named matrix.mtx