I’ve been learning how to deal with missing data recently. As far as I can tell from reading papers, a large majority of researchers in psychology and psychiatry seem to largely ignore this issue, particularly in cross-sectional studies (those assessing participants at a single time point). Collecting data from human participants, particularly patient groups with psychiatric problems, can be tricky; no matter how thorough and careful the researchers collecting the data are, sometimes a participant just doesn’t know the answer, can’t understand the question or is unwilling to provide an answer for personal/emotional reasons, which is perfectly acceptable and not unusual. However, it seems logical to me that this problem might have effects on the quality of a dataset and the reliability of the conclusions drawn from it.
It appears that the default method for “dealing” with this issue is to exclude individuals with a certain amount of incomplete data from the entire analysis (known as case-wise deletion) or, if performing a number of tests, to exclude individuals only from the tests where they have that particular missing variable but include them in the tests where they have complete data (known as pair-wise deletion). For example, if you are looking at whether someone’s income or education is associated with whether they have experienced any depressive symptoms and a participant has not provided their income but has given their education, they would be included in the test of education & depression but not in the one for income & depression. This method generally results in a slightly different sample size for each variable tested.
However, these default deletion methods can be problematic if your data are not missing “completely at random”, that is, if your study participants who tend to have missing data differ in some way from those that do not. So for instance, in my example, if participants who have a lower income are less willing to report it, than your test of association between depression and income will be biased if you use only the case-wise or pair-wise deletion method. An even worse situation arises if those participants who are most likely to have depression, are the ones who are most likely to leave data missing; unfortunately, it can be tricky to identify this situation because of the missing data points… Indeed, you might expect that in a group of individuals affected by psychiatric problems, those with more severe problems or lower cognitive abilities would be more likely to be the ones who are unable or unwilling to respond to all the items. It seems to me that these issues and potential biases shouldn’t just be swept under the carpet and ignored. They need to be addressed statistically and discussed.
Despite being vaguely aware of these problems, until recently, whenever I heard the word “imputation”, all I could think was, “so is making up data really any better?”. But that’s before I knew what imputation actually is. There are a number of statistical ways of dealing with missing data (other than the deletion methods) and they generally fall under the umbrella term of imputation. The methods have varying assumptions (like normal distribution of the data) and some appear to be better than others – though it seems like the better ones are the most complicated and computationally intensive. I’m only beginning to understand what these are and how they work. I still feel that some of the methods I’ve been reading about are a bit too close to “making up data” for comfort; this includes methods such as substituting the sample mean for each individual missing data point or matching participants on some variables and copying data points from “similar” participants. On the other hand, a method that does a probabilistic imputation by creating a number of possible datasets with a number of possible values which all get analysed and the mean of the analyses is taken as the results, sounds a bit more clever and reliable. To be honest, I think all of these methods probably have their individual disadvantages and I think there’s a lot more reading I need to do before I fully understand how best to deal with this issue.
The thing I’m surprised by though is how infrequently I see papers mentioning attempts to use any of these methods or indeed how the (very pervasive) issue of missing data was addressed. I’m also quite surprised that I didn’t have any lectures on this subject during my undergrad studies or the more recent postgrad statistics course I did. It seems to me like a lot of people just don’t take this issue very seriously. Perhaps it isn’t that important after all? Or perhaps it’s just easier to ignore it as dealing with it is quite tricky and time-consuming?