Fitting a GLM to expression data

It is easiest to understand how the GLM model works through an example. Imagine an experiment looking at the effect of two drug treatments while controlling for the gender of a patient:

In an abuse of mathematical notation, the underlying GLM for each gene looks like

$\displaystyle \log{y_i} = \mathrm{(placebo and Male)} + \mathrm{drugA} + \mathrm{drugB} + \mathrm{Female} + \mathrm{constant_i}$ (29.1)

where $ y_i$ is the expression level for the gene in sample $ i$; the combined term $ \mathrm{(placebo and Male)}$ describes an arbitrarily chosen baseline expression level (of males being given a placebo); and the other terms $ \mathrm{drugA}$, $ \mathrm{drugB}$ and $ \mathrm{Female}$ are numbers describing the effect of each group with respect to this baseline. The $ \mathrm{constant_i}$ accounts for differences in the library size between samples. For example, if a patient is male and given a placebo we predict the expression level to be

$\displaystyle \log{y_i} = \mathrm{(placebo and Male)} + \mathrm{constant_i}.$

If instead he had been given drug B, we would predict the expression level $ y_i$ to be augmented with the $ \mathrm{drugB}$ coefficient, resulting in

$\displaystyle \log{y_i} = \mathrm{(placebo and Male)} + \mathrm{drugB} + \mathrm{constant_i}.$

We assume that the expression levels $ y_i$ follow a Negative Binomial distribution. This distribution has a free parameter, the dispersion. The greater the dispersion, the greater the variation in expression levels for a gene.

The most likely values of the dispersion and coefficients, $ \mathrm{drugA}$, $ \mathrm{drugB}$ and $ \mathrm{Female}$, are determined simultaneously by fitting the GLM to the data. To see why this simultaneous fitting is necessary, imagine an experiment where we observe counts {3,10,4} for Males and {30,20,8} for Females. The most natural fit is for the coefficient $ \mathrm{Female}$ to have a two-fold change and for the dispersion to be small, but an alternative fit has no fold change and a larger dispersion. Under this second fit the variation in the counts is greater, and it is just by chance that all three Female values are larger than all three Male values.