REGRESSION TO THE MEAN
COMPARATIVE GENOMICS CENTRE

Mail Address: Comparative Genomics Centre,
Molecular Sciences Bldg 21, James Cook University,
Townsville, 4811, Queensland, Australia
Telephone: 61-7-4781 6265 Fax:  61-7-4781 6078


 
 
REGRESSION TO THE MEAN
Regression to the mean was first identified by Sir Francis Galton F.R.S.: half-cousin of Charles Darwin, geographer, meteorologist, tropical explorer, founder of differential psychology, inventor of fingerprint identification, pioneer of statistical correlation and regression, convinced hereditarian, eugenicist, proto-geneticist, and best-selling author (1822-1911). 

He correlated the heights of 930 adult children and their respective 250 parents, "correcting" for sex by increasing female heights by a factor of 1.08.

He accounted for genetic contribution of both parents by taking their mean (corrected) heights. He plotted the data (see below) and performed a least squares straight line fit (red line), but found that its slope was less than that expected if the height of children was on average the same as that of their parents (yellow line). He observed:

"It appeared from these experiments that the offspring did not tend to resemble their parents in size, but always to be more mediocre than they - to be smaller than than the parents, if the parents were large; to be larger than than the parents, if the parents were small."

He concluded:
"The explanation of it is as follows. The child inherits partly from his parents, partly from his ancestry."

 

Galton was wrong and he knew it, writing: 
"Notwithstanding this explanation, some suspicion may remain of a paradox lurking in my strongly contrasted results. How is it, I ask, that in each successive generation there proves to be the same number of men per thousand, who range between any limits of stature we please to specify, although tall men are rarely descended from equally tall parents, or the short from equally short?"

The phenomenon he reported is called regression to the mean, and it occurs whenever extreme measurements are reassessed. There are two explanations for it.


EXPLANATION ONE: REMEASURING EXTREMES
Any quantitative measurement is the sum of two values: the "real" value and the error of measurement. Imagine a population of objects for which a metric characteristic is measured (for example, lengths of steel) and conforms to a normal distribution with a certain mean and variance. Providing there is no change in systematic errors, if the population is remeasured, the mean and variation will remain more-or-less the same. But this apparent similarity obscures the fact that the error components of the measurement of each object will have changed. For objects near the mean of the population, some errors will have become smaller, some larger, resulting in a symmetrical scatter of values around the original mean of that subpopulation. The actual ranking of the individual objects in that subpopulation may have changed, however.

The situation is different for an object with the very biggest measurement in the whole population. It is more likely to have got to that position by: a) having a large "real" value, but also by b) having an error that was highly positive - pushing the perceived result to an extreme. Now if such an object was remeasured, it is unlikely to have such a large positive error in measurement a second time, so the new value may well be a little lower.  An object from a  population sampled on the basis of having a previous extreme measurement will therefore tend to have a repeat measurement a little less extreme than the original measurement, rather than to have a repeated measurement higher than the original measurement.


EXPLANATION TWO: A STATISTICAL ARTIFACT
It follows from the above, that the distribution of repeat measurements of objects selected on the basis of a previous extreme value will not lie symmetrically around the most common value (which is the initial measurement), but will be skewed towards the mean of the total population. The repeated measurement of objects selected on the basis of extremely low values will be skewed upwards slightly, while the repeated measurements of objects selected on the basis of extremely high values will be skewed downwards slightly. As observed above, there would not be skewing of repeat measurements of objects selected on the basis of having mediocre values.

There is another reason for the skewing of the distribution of anatomic and physiological measurements; there are absolute limits on the range of values achievable by biological systems. For the example of height, one can not be smaller than zero and there must be some realistic maximum height compatible with structural integrity. A consequence of this is that variation around an extremely low value is limited by its proximity to the absolute minimum (called a "left wall" effect) and variation around an extremely high value is limited by its proximity to the absolute maximum (a "right wall" effect).

These two factors (skewing from remeasuring extreme values and skewing from biological absolutes) both operate in the same direction (i.e. towards mediocrity) at each end of the spectrum of values.

Regression to the mean becomes apparent when a least squares line fit is performed. As this procedure positions the line of fit on the basis of minimalising the total of all the differences between each point and the line, it will pass through the mean of any subpopulation, rather than the mode. For example, when the line of fit passes through the population of children of extremely tall parents, it will not pass through the most common value (the mode - equal to the parental value) but instead will pass a little lower, through the mean, to minimise the displacement of the skewed readings from the line of fit.

This happens because least squares fitting assumes that the independent variable (the "x" value, in this case parental height) is without error, and assumes that the dependent variable (the "y" value, in this case, childrens' heights) contain all the error observed. The problem becomes obvious when one plots the height of parents against childrens' heights - i.e. swap the axes over and make the parental height the dependent variable. Now the heights of parents of extreme children also tend towards mediocrity - even if the same data set of the same parents and children are used for both analyses!


OTHER GENETICS LINKS: OTHER LINKS: Comparative Genomics Centre, James Cook University, Key words: Autoimmune diabetes, Type 1 diabetes mellitus, childhood diabetes, lupus, systemic lupus erythematosus, haemolytic anaemia, hemolytic anemia, Coombs' test, antinuclear antibodies, renal failure, glomerulonephritis, gastritis, type A gastritis, pernicious anemia.