Data dredge

When a scientist wishes to test that substance A causes disease K, he can make a study of people who are or who are not exposed to A and compare the number of incidents of K in the two populations, thereby calculating the relative risk and the probability P that this occurred by accident. According to his lights he might deem the result to be a significant statistical association, but note that he has not “proved” that A causes K. It is a gamble, a throw of the die, that might pay off or might not. Not a very good bet.

Suppose therefore that he goes on to examine ten substances A to J and ten diseases K to T. He now has one hundred combinations of substance and disease. He therefore has a hundred times the number of chances of undercutting his chosen value of P. In epidemiology, this is more often than not set at 0.05.

Let assume that there are, in fact, no real relationships between the substances and the diseases and that it is all down to random numbers. If he sticks to his criterion the Poisson approximation applies and he can now expect 100P “successes”, or 5 successes using the conventional value. What he should do, of course, is adjust his criterion to account for the greater likelihood of random successes, but does he? More often than not, the five successes are published as “scientific facts” and the other 95 ignored, which is a form of publication bias.

Data dredging, however, will usually go even further than this. Large databases are often set up with hundreds of putative causes and effects. These lend themselves to retrospective “mining” for apparent associations. In addition to the fundamental statistical flaw outlined above, there are other problems:

1        The data gathered are often of an anecdotal nature. They are often in areas where people lie to themselves, let alone to others, especially with the modern pressures of political correctness (e.g. alcohol, tobacco or fat intake). People like to be helpful and give the answers they think questioners want to hear. There is little or no checking as to whether the recorded “facts” are true.

2        People are often equally vague about their illnesses. They might be either hypochondriacs or ostriches.

3        Many of the questions put to subjects are vague to the point of absurdity, such as what people have in their diet. Can you remember what and in what quantity you ate last week?

4        Questioners are liable to add biases of their own in recording responses.

5        Some of the variables are then calculated indirectly from these vague data, such as the intake of a particular vitamin being estimated from the reported diet, or factors such as deprivation and pollution being estimated from the postal code.

6        There are often indications of reverse causality (e.g. people might have abandoned a therapy through illness, rather than become ill because of abandoning the therapy).

7        Often researchers disguise the fact that a result is part of a data dredge and present it as a single experiment, but they sometimes inadvertently let the cat out of the bag by mentioning other variables.

8        The subjects of the data base are often unrepresentative of the population at large, e.g. they might be restricted to the medical professions or co-religionists.

 

John Brignell

January 2003

 

Further reading: Junk Science Judo.

Back to FAQs