Thinking in numbers


I read an intriguing and slightly baffling quote by Ernest Rutherford the other day; apparently, he once said: “If your experiment needs statistics, you ought to have done a better experiment.” I guess in his day the fields of psychology/psychiatry weren’t considered scientific. Statistics are so much at the centre of what I do, they’re really the sharpest tools in my PhD toolbox. However, statistics without knowledge, conceptual understanding and a thoughtful study design are worse than useless, potentially leading to all sorts of trouble like the dreaded Type 1 error (i.e. a false positive finding), which can easily occur if you simply run enough tests – see here for a good illustration of this. So in effect, statistics are a vital set of tools (at least in psychology/psychiatry – perhaps physicists will have some insight into what Ernest Rutherford was talking about) which need to be used in the right environment and in an appropriate way.

Having come a long way since learning about t-tests in my first year as an undergrad, it seems that hardly a day passes without me finding out something new about statistics. The funny thing is that I don’t really especially go out of my way to do so. It’s just that the more I look at real data sets full of real numbers describing real people, the clearer it becomes that those “textbook” examples you first see when you come across a new stats technique, can be fairly unhelpful. The problem is that so many methods rely on a number of important assumptions, like the residuals of the model you run being distributed at random (i.e. normally around the mean, like a bell curve) or variables not being overly related to one another in a given test. If these assumptions are violated, you need to backtrack and take another route. In the last couple of months, I’ve seen massively zero-inflated distributions, bi-modal distributions, heavily skewed distributions and not-positive-definite matrices. It gets to the point where seeing a normal distribution gives me a warm and fuzzy feeling inside.

In trying to learn how to correctly tackle data with these quirks, I occasionally get the impression that some other researchers either don’t notice these problems or perhaps ignore them. A bit like the impression I had when I discussed the problem of missing data previously. Though, to be fair, some of the techniques of how to address such problems and discussions around how best to use them are relatively new. Another problem is when there just aren’t enough details in a method section to provide a comprehensive “recipe” that you can follow, which is somewhat frustrating when you’re trying to learn how others use a given stats method. It would seem that my PhD examiners may be in for some pretty tedious, if thorough, methodological sections! But at least I feel like I’m putting in the effort to use the statistical technique the data call for and to understand what I’m doing.


On “making up” data (in a good way?)

I’ve been learning how to deal with missing data recently. As far as I can tell from reading papers, a large majority of researchers in psychology and psychiatry seem to largely ignore this issue, particularly in cross-sectional studies (those assessing participants at a single time point). Collecting data from human participants, particularly patient groups with psychiatric problems, can be tricky; no matter how thorough and careful the researchers collecting the data are, sometimes a participant just doesn’t know the answer, can’t understand the question or is unwilling to provide an answer for personal/emotional reasons, which is perfectly acceptable and not unusual. However, it seems logical to me that this problem might have effects on the quality of a dataset and the reliability of the conclusions drawn from it.

It appears that the default method for “dealing” with this issue is to exclude individuals with a certain amount of incomplete data from the entire analysis (known as case-wise deletion) or, if performing a number of tests, to exclude individuals only from the tests where they have that particular missing variable but include them in the tests where they have complete data (known as pair-wise deletion). For example, if you are looking at whether someone’s income or education is associated with whether they have experienced any depressive symptoms and a participant has not provided their income but has given their education, they would be included in the test of education & depression but not in the one for income & depression. This method generally results in a slightly different sample size for each variable tested.

However, these default deletion methods can be problematic if your data are not missing “completely at random”, that is, if your study participants who tend to have missing data differ in some way from those that do not. So for instance, in my example, if participants who have a lower income are less willing to report it, than your test of association between depression and income will be biased if you use only the case-wise or pair-wise deletion method. An even worse situation arises if those participants who are most likely to have depression, are the ones who are most likely to leave data missing; unfortunately, it can be tricky to identify this situation because of the missing data points… Indeed, you might expect that in a group of individuals affected by psychiatric problems, those with more severe problems or lower cognitive abilities would be more likely to be the ones who are unable or unwilling to respond to all the items. It seems to me that these issues and potential biases shouldn’t just be swept under the carpet and ignored. They need to be addressed statistically and discussed.

Despite being vaguely aware of these problems, until recently, whenever I heard the word “imputation”, all I could think was, “so is making up data really any better?”. But that’s before I knew what imputation actually is. There are a number of statistical ways of dealing with missing data (other than the deletion methods) and they generally fall under the umbrella term of imputation. The methods have varying assumptions (like normal distribution of the data) and some appear to be better than others – though it seems like the better ones are the most complicated and computationally intensive. I’m only beginning to understand what these are and how they work. I still feel that some of the methods I’ve been reading about are a bit too close to “making up data” for comfort; this includes methods such as substituting the sample mean for each individual missing data point or matching participants on some variables and copying data points from “similar” participants. On the other hand, a method that does a probabilistic imputation by creating a number of possible datasets with a number of possible values which all get analysed and the mean of the analyses is taken as the results, sounds a bit more clever and reliable. To be honest, I think all of these methods probably have their individual disadvantages and I think there’s a lot more reading I need to do before I fully understand how best to deal with this issue.

The thing I’m surprised by though is how infrequently I see papers mentioning attempts to use any of these methods or indeed how the (very pervasive) issue of missing data was addressed. I’m also quite surprised that I didn’t have any lectures on this subject during my undergrad studies or the more recent postgrad statistics course I did. It seems to me like a lot of people just don’t take this issue very seriously. Perhaps it isn’t that important after all? Or perhaps it’s just easier to ignore it as dealing with it is quite tricky and time-consuming?