Pre-filtering data for Differential Expression
It is not generally recommended to pre-filter data for differential expression analysis. Instead, by default, a post-processing step filters low expression features to increase power. This is described in more detail in Filtering on average expression and is also a feature of the commonly-used DESeq2 package https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering.EdgeR - another commonly-used algorithm for differential expression analysis - recommends filtering features with low expression before performing differential expression analysis. This is because 1) features with very low expression are unlikely to be biologically important, 2) the discreteness of counts can interfere with EdgeR's approximations https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf. To remove features unlikely to be of biological importance, we instead recommend post-filtering results on their "max group means" column.
In rare cases, pre-filtering can be desirable to speed up calculations and reduce memory consumption. It is possible to pre-filter Gene Expression data as follows (similar steps can be used for other types of expression data):
- Run Create Expression Browser on all the samples, see Create Expression Browser.
- Export the Expression Browser to a format you can work with, for example Excel 2010 (.xlsx). It is easiest to deselect "Export all columns" and then in the next wizard step choose to "Export table as currently shown". For more details see Export of tables.
- Perform the filtering outside the workbench. For example, in Excel one might calculate the sum of the CPM values of all the samples for each feature, then filter the rows to show only those with a total CPM of at least 10. Copy the names of the retained features.
- Switch back to the workbench - open all the samples. For the first sample, filter such that "Name" "is in list" and then paste in the filtered names and click Filter. This may take a couple of minutes.
- Select all the rows, and choose to "Select Genes in Other Views". This might take a little time, but typically less than 1 minute. Now these rows are selected in all the open samples.
- Go through each sample and choose to "Create Track from Selection" - then save the new element.
Pre-filtering may also be desirable to remove extreme outliers. However, in most cases, the "Downweight outliers" option described in Downweighting outliers is preferable, because a gene can be differentially expressed and also have an outlier measurement.