Pre-filtering data for Differential Expression

It is not generally recommended to pre-filter data for differential expression analysis. Instead, by default, a post-processing step filters low expression features to increase power. This is described in more detail in Filtering on average expression and is also a feature of the commonly-used DESeq2 package https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering.

EdgeR - another commonly-used algorithm for differential expression analysis - recommends filtering features with low expression before performing differential expression analysis. This is because 1) features with very low expression are unlikely to be biologically important, 2) the discreteness of counts can interfere with EdgeR's approximations https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf. To remove features unlikely to be of biological importance, we instead recommend post-filtering results on their "max group means" column.

In rare cases, pre-filtering can be desirable to speed up calculations and reduce memory consumption. It is possible to pre-filter Gene Expression data as follows (similar steps can be used for other types of expression data):

Pre-filtering may also be desirable to remove extreme outliers. However, in most cases, the "Downweight outliers" option described in Downweighting outliers is preferable, because a gene can be differentially expressed and also have an outlier measurement.