HDF5 formats
AnnData, Cell Ranger HDF5, h5Seurat and Loom are HDF5 formats, with specific requirements regarding structure of the data. An HDF5 file is organized in a hierarchical structure with:
- groups, containing zero or more groups or datasets;
- datasets: multidimensional arrays of data elements.
Metadata for groups and datasets is stored in associated attribute lists. Groups and datasets can often be themselves semantically interpreted as attributes.
All HDF5 importers contain an Expression matrix option, used for specifying the HDF5 file to be imported.
AnnData importer
The expression matrix in an AnnData (h5ad) is in a sparse dataset `X', while features and cells are described using the `var' and `obs' groups, respectively. See https://anndata.readthedocs.io/ for more details.
The `_index' attribute on group `obs' defines the cell identification, and the interpretation of this is specified by the Cell format.
- Gene or transcript ID attribute (Optional). A `var' attribute describing an identifier for a gene or transcript (e.g., ENSG00000243485 for ENSEMBL).
- Gene or transcript name attribute (Optional). A `var' attribute describing the name for a gene or transcript. If left empty, the `_index' attribute on group `var' is used.
h5Seurat importer
A h5seurat file may contain multiple assays and each assay may contain multiple expression matrices, e.g., counts and normalized expressions. The matrices can be sparse or dense. See https://mojaveazure.github.io/seurat-disk/articles/h5Seurat-spec.html for more details.
Only one assay and matrix can be imported at a time. The h5Seurat importer expects the format version 4.0.0.
The `cell.names' attribute contains the cell identification, and the interpretation of this is specified by the Cell format. If the sample is not set through Cell format or Sample, the sample for each cell is read from the `orig.ident' attribute on group `meta.data'.
The gene or transcript names are read from the `features' attribute of the selected assay.
- Assay (Optional). The name of the assay to import. If left empty, the assay in the `active.assay' attribute will be used.
- Import expressions from (Optional). The matrix for the selected assay to import. The matrix may be sparse (e.g., `counts' or `data') or dense (e.g., `scale.data'). If left empty, the importer will use `counts'.
Loom importer
A Loom file has an internal structure consisting of a main matrix, optional `layers' of the same size as the main matrix and row and column attributes (describing features and cells, respectively). See https://linnarssonlab.org/loompy/format/index.html for details on the format.
The Loom importer expects the Loom format version 3.0.0.
- Spliced layer. The layer where the spliced counts are stored.
- Unspliced layer. The layer where the unspliced counts are stored.
- Cell ID attribute (Optional). A column attribute identifying the cell by its barcode and sample. The interpretation of this value is specified by the Cell format.
- Gene or transcript ID attribute (Optional). A row attribute describing an identifier for a gene or transcript (e.g., ENSG00000243485 for ENSEMBL).
- Gene or transcript name attribute. A row attribute describing the name for a gene or transcript. If no names are present, then it is also possible to set this to the same value as the Gene or transcript ID attribute.