Fitting a GLM to expression data
It is easiest to understand how the GLM model works through an example. Imagine an experiment looking at the effect of two drug treatments while controlling for the gender of a patient:
- Test differential expression due to Treatment with three groups: drugA, drugB, placebo
- While controlling for Gender with groups: Male, Female
In an abuse of mathematical notation, the underlying GLM for each gene looks like
where is the expression level for the gene in sample ; the combined term describes an arbitrarily chosen baseline expression level (of males being given a placebo); and the other terms , and are numbers describing the effect of each group with respect to this baseline. The accounts for differences in the library size between samples. For example, if a patient is male and given a placebo we predict the expression level to be
If instead he had been given drug B, we would predict the expression level to be augmented with the coefficient, resulting in
We assume that the expression levels follow a Negative Binomial distribution. This distribution has a free parameter, the dispersion. The greater the dispersion, the greater the variation in expression levels for a gene.
The most likely values of the dispersion and coefficients, , and , are determined simultaneously by fitting the GLM to the data. To see why this simultaneous fitting is necessary, imagine an experiment where we observe counts {3,10,4} for Males and {30,20,8} for Females. The most natural fit is for the coefficient to have a two-fold change and for the dispersion to be small, but an alternative fit has no fold change and a larger dispersion. Under this second fit the variation in the counts is greater, and it is just by chance that all three Female values are larger than all three Male values.