Mail Address: Comparative
Molecular Sciences Bldg 21, James Cook University,
Townsville, 4811, Queensland, Australia
Telephone: 61-7-4781 6265 Fax: 61-7-4781 6078
|REGRESSION TO THE MEAN
Regression to the mean was first identified by Sir Francis Galton F.R.S.: half-cousin of Charles Darwin, geographer, meteorologist, tropical explorer, founder of differential psychology, inventor of fingerprint identification, pioneer of statistical correlation and regression, convinced hereditarian, eugenicist, proto-geneticist, and best-selling author (1822-1911).
He correlated the heights of 930 adult children and their respective 250 parents, "correcting" for sex by increasing female heights by a factor of 1.08.
He accounted for genetic contribution of both parents by taking their mean (corrected) heights. He plotted the data (see below) and performed a least squares straight line fit (red line), but found that its slope was less than that expected if the height of children was on average the same as that of their parents (yellow line). He observed:
"It appeared from these experiments that the offspring did not tend to resemble their parents in size, but always to be more mediocre than they - to be smaller than than the parents, if the parents were large; to be larger than than the parents, if the parents were small."
|Galton was wrong and he knew it, writing:
"Notwithstanding this explanation, some suspicion may remain of a paradox lurking in my strongly contrasted results. How is it, I ask, that in each successive generation there proves to be the same number of men per thousand, who range between any limits of stature we please to specify, although tall men are rarely descended from equally tall parents, or the short from equally short?"
The phenomenon he reported is called regression to the mean, and it occurs whenever extreme measurements are reassessed. There are two explanations for it.
The situation is different for an object with the very biggest measurement in the whole population. It is more likely to have got to that position by: a) having a large "real" value, but also by b) having an error that was highly positive - pushing the perceived result to an extreme. Now if such an object was remeasured, it is unlikely to have such a large positive error in measurement a second time, so the new value may well be a little lower. An object from a population sampled on the basis of having a previous extreme measurement will therefore tend to have a repeat measurement a little less extreme than the original measurement, rather than to have a repeated measurement higher than the original measurement.
There is another reason for the skewing of the distribution of anatomic and physiological measurements; there are absolute limits on the range of values achievable by biological systems. For the example of height, one can not be smaller than zero and there must be some realistic maximum height compatible with structural integrity. A consequence of this is that variation around an extremely low value is limited by its proximity to the absolute minimum (called a "left wall" effect) and variation around an extremely high value is limited by its proximity to the absolute maximum (a "right wall" effect).
These two factors (skewing from remeasuring extreme values and skewing from biological absolutes) both operate in the same direction (i.e. towards mediocrity) at each end of the spectrum of values.
Regression to the mean becomes apparent when a least squares line fit is performed. As this procedure positions the line of fit on the basis of minimalising the total of all the differences between each point and the line, it will pass through the mean of any subpopulation, rather than the mode. For example, when the line of fit passes through the population of children of extremely tall parents, it will not pass through the most common value (the mode - equal to the parental value) but instead will pass a little lower, through the mean, to minimise the displacement of the skewed readings from the line of fit.
This happens because least squares fitting assumes that the independent variable (the "x" value, in this case parental height) is without error, and assumes that the dependent variable (the "y" value, in this case, childrens' heights) contain all the error observed. The problem becomes obvious when one plots the height of parents against childrens' heights - i.e. swap the axes over and make the parental height the dependent variable. Now the heights of parents of extreme children also tend towards mediocrity - even if the same data set of the same parents and children are used for both analyses!