HDF5 formats

AnnData, Cell Ranger HDF5, h5Seurat and Loom are HDF5 formats, with specific requirements regarding structure of the data. An HDF5 file is organized in a hierarchical structure with:

Metadata for groups and datasets is stored in associated attribute lists. Groups and datasets can often be themselves semantically interpreted as attributes.

The HDF5 file

All HDF5 importers contain an Expression matrix option, used for specifying the HDF5 file to be imported.

The AnnData, h5Seurat and Loom importers can be customized to import different attributes from the file. These attributes can be previewed by clicking the `Preview' button.

The preview (figure 4.4) shows the available attributes in a table. One column corresponds to one attribute, for either features or cells, as selected in the menu to the left.

Image hdf5_preview_matrix
Figure 4.4: Previewing cell attributes found under the `obs' group for the GSE201257 AnnData expression matrix from the Gene Expression Omnibus repository. The `_index' attribute defines the barcode, as shown in the tooltip.

Hovering the cursor over a column name, either at the top of the table or on the menu to the right, displays a tooltip with the type of data stored in the attribute (for example, boolean or integer) and if the attribute is always used by the importer (for example, for the barcode or the sample).

Right-clicking on the column name at the top of the table, or clicking on the edit icon (Image pencil_16_n_p), displays a menu from which the attribute can be added to or removed from relevant wizard options (figure 4.5).

Image hdf5_preview_matrix_menu
Figure 4.5: Previewing feature attributes found under the `var' group for the GSE201257 AnnData expression matrix from the Gene Expression Omnibus repository. Right-clicking on the `_index' column name displays a menu.

AnnData importer

The expression matrix in an AnnData (h5ad) is in a sparse dataset `X', while features and cells are described using the `var' and `obs' groups, respectively. See https://anndata.readthedocs.io/ for more details.

The `_index' attribute on group `obs' defines the cell identification, and the interpretation of this is specified by the Cell format.

h5Seurat importer

A h5seurat file may contain multiple assays and each assay may contain multiple expression matrices, e.g., counts and normalized expressions. The matrices can be sparse or dense. See https://mojaveazure.github.io/seurat-disk/articles/h5Seurat-spec.html for more details.

Only one assay and matrix can be imported at a time. The h5Seurat importer expects the format version 4.0.0.

The `cell.names' attribute contains the cell identification, and the interpretation of this is specified by the Cell format. If the sample is not set through Cell format or Sample, the sample for each cell is read from the `orig.ident' attribute on group `meta.data'.

The gene or transcript names are read from the `features' attribute of the selected assay.

Loom importer

A Loom file has an internal structure consisting of a main matrix, optional `layers' of the same size as the main matrix and row and column attributes (describing features and cells, respectively). See https://linnarssonlab.org/loompy/format/index.html for details on the format.

The Loom importer expects the Loom format version 3.0.0.