Well, it seems I've made a hash of this. Let me try to explain this again, in a different way and we'll see if it might help clear things up.
The traditional way to explain McNemar's test vs. the chi-squared test is to ask if the data are "paired" and to recommend McNemar's test if the data are paired and the chi-squared test if the data are "unpaired". I have found that this leads to a lot of confusion (this thread being an example!). In place of this, I have found that it is most helpful to focus on the question you are trying to ask, and to use the test that matches your question. To make this more concrete, let's look at a made-up scenario:
You walk around a statistics conference and for each statistician you meet, you record whether they are from the US or the UK. You also record whether they have high blood pressure or normal blood pressure.
Here are the data:
mat = as.table(rbind(c(195, 5),
c( 5, 195) ))
colnames(mat) = c("US", "UK")
rownames(mat) = c("Hi", "Normal")
names(dimnames(mat)) = c("BP", "Nationality")
mat
# Nationality
# BP US UK
# Hi 195 5
# Normal 5 195
At this point, it is important to figure out what question we want to ask of our data. There are three different questions we could ask here:
- We might want to know if the categorical variables
BP
and Nationality
are associated or independent;
- We might wonder if high blood pressure is more common amongst US statisticians than it is amongst UK statisticians;
Finally, we might wonder if the proportion of statisticians with high blood pressure is equal to the proportion of US statisticians that we talked to. This refers to the marginal proportions of the table. These are not printed by default in R, but we can get them thusly (notice that, in this case, they are exactly the same):
margin.table(mat, 1)/sum(mat)
# BP
# Hi Normal
# 0.5 0.5
margin.table(mat, 2)/sum(mat)
# Nationality
# US UK
# 0.5 0.5
As I said, the traditional approach, discussed in many textbooks, is to determine which test to use based on whether the data are "paired" or not. But this is very confusing, is this contingency table "paired"? If we compare the proportion with high blood pressure between US and UK statisticians, you are comparing two proportions (albeit of the same variable) measured on different sets of people. On the other hand, if you want to compare the proportion with high blood pressure to the proportion US, you are comparing two proportions (albeit of different variables) measured on the same set of people. These data are both "paired" and "unpaired" at the same time (albeit with respect to different aspects of the data). This leads to confusion. To try to avoid this confusion, I argue that you should think in terms of which question you are asking. Specifically, if you want to know:
- If the variables are independent: use the chi-squared test.
- If the proportion with high blood pressure differs by nationality: use the z-test for difference of proportions.
- If the marginal proportions are the same: use McNemar's test.
Someone might disagree with me here, arguing that because the contingency table is not "paired", McNemar's test cannot be used to test the equality of the marginal proportions and that the chi-squared test should be used instead. Since this is the point of contention, let's try both to see if the results make sense:
chisq.test(mat)
# Pearson's Chi-squared test with Yates' continuity correction
#
# data: mat
# X-squared = 357.21, df = 1, p-value < 2.2e-16
mcnemar.test(mat)
# McNemar's Chi-squared test
#
# data: mat
# McNemar's chi-squared = 0, df = 1, p-value = 1
The chi-squared test yields a p-value of approximately 0. That is, it says that the probability of getting data as far or further from equal marginal proportions, if the marginal proportions actually were equal is essentially 0. But the marginal proportions are exactly the same, 50%=50%, as we saw above! The results of the chi-squared test just don't make any sense in light of the data. On the other hand, McNemar's test yields a p-value of 1. That is, it says that you will have a 100% chance of finding marginal proportions this close to equality or further from equality, if the true marginal proportions are equal. Since the observed marginal proportions cannot be closer to equal than they are, this result makes sense.
Let's try another example:
mat2 = as.table(rbind(c(195, 195),
c( 5, 5) ))
colnames(mat2) = c("US", "UK")
rownames(mat2) = c("Hi", "Normal")
names(dimnames(mat2)) = c("BP", "Nationality")
mat2
# Nationality
# BP US UK
# Hi 195 195
# Normal 5 5
margin.table(mat2, 1)/sum(mat2)
# BP
# Hi Normal
# 0.975 0.025
margin.table(mat2, 2)/sum(mat2)
# Nationality
# US UK
# 0.5 0.5
In this case, the marginal proportions are very different, 97.5%≫50%. Let's try the two tests again to see how their results compare to the observed large difference in marginal proportions:
chisq.test(mat2)
# Pearson's Chi-squared test
#
# data: mat2
# X-squared = 0, df = 1, p-value = 1
mcnemar.test(mat2)
# McNemar's Chi-squared test with continuity correction
#
# data: mat2
# McNemar's chi-squared = 178.605, df = 1, p-value < 2.2e-16
This time, the chi-squared test gives a p-value of 1, meaning that the marginal proportions are as equal as they can be. But we saw that the marginal proportions are very obviously not equal, so this result doesn't make any sense in light of our data. On the other hand, McNemar's test yields a p-value of approximately 0. In other words, it is extremely unlikely to get data with marginal proportions as far from equality as these, if they truly are equal in the population. Since our observed marginal proportions are far from equal, this result makes sense.
The fact that the chi-squared test yields results that make no sense given our data suggests there is something wrong with using the chi-squared test here. Of course, the fact that McNemar's test provided sensible results doesn't prove that it is valid, it may just have been a coincidence, but the chi-squared test is clearly wrong.
Let's see if we can work through the argument for why McNemar's test might be the right one. I will use a third dataset:
mat3 = as.table(rbind(c(190, 15),
c( 60, 135) ))
colnames(mat3) = c("US", "UK")
rownames(mat3) = c("Hi", "Normal")
names(dimnames(mat3)) = c("BP", "Nationality")
mat3
# Nationality
# BP US UK
# Hi 190 15
# Normal 60 135
margin.table(mat3, 1)/sum(mat3)
# BP
# Hi Normal
# 0.5125 0.4875
margin.table(mat3, 2)/sum(mat3)
# Nationality
# US UK
# 0.625 0.375
This time we want to compare 51.25% to 62.5% and wonder if in the population the true marginal proportions might have been the same. Because we are comparing two proportions, the most intuitive option would be to use a z-test for the equality of two proportions. We can try that here:
prop.test(x=c(205, 250), n=c(400, 400))
# 2-sample test for equality of proportions with continuity correction
#
# data: c(205, 250) out of c(400, 400)
# X-squared = 9.8665, df = 1, p-value = 0.001683
# alternative hypothesis: two.sided
# 95 percent confidence interval:
# -0.18319286 -0.04180714
# sample estimates:
# prop 1 prop 2
# 0.5125 0.6250
(To use prop.test()
to test the marginal proportions, I had to enter the numbers of 'successes' and the total number of 'trials' manually, but you can see from the last line of the output that the proportions are correct.) This suggests that it is unlikely to get marginal proportions this far from equality if they were actually equal, given the amount of data we have.
Is this test valid? There are two problems here: The test believes we have 800 data, when we actually have only 400. This test also does not take into account that these two proportions are not independent, in the sense that they were measured on the same people.
Let's see if we can take this apart and find another way. From the contingency table, we can see that the marginal proportions are:
% high BP: 190+15400% US: 190+60400
What we see here is that the
190 US statisticians with high blood pressure show up in both marginal proportions. They are both being counted twice and contributing no information about the differences in the marginal proportions. Moreover, the
400 total shows up in both denominators as well. All of the unique and distinctive information is in the two off-diagonal cell counts (
15 and
60). Whether the marginal proportions are the same or different is due only to them. Whether an observation is equally likely to fall into either of those two cells is distributed as a binomial with probability
π=.5 under the null. That was McNemar's insight. In fact, McNemar's test is essentially just a binomial test of whether observations are equally likely to fall into those two cells:
binom.test(x=15, n=(15+60))
# Exact binomial test
#
# data: 15 and (15 + 60)
# number of successes = 15, number of trials = 75, p-value = 1.588e-07
# alternative hypothesis: true probability of success is not equal to 0.5
# 95 percent confidence interval:
# 0.1164821 0.3083261
# sample estimates:
# probability of success
# 0.2
In this version, only the informative observations are used and they are not counted twice. The p-value here is much smaller, 0.0000001588, which is often the case when the dependency in the data is taken into account. That is, this test is more powerful than the z-test of difference of proportions. We can further see that the above version is essentially the same as McNemar's test:
mcnemar.test(mat3, correct=FALSE)
# McNemar's Chi-squared test
#
# data: mat3
# McNemar's chi-squared = 27, df = 1, p-value = 2.035e-07
If the non-identicallity is confusing, McNemar's test typically, and in R, squares the result and compares it to the chi-squared distribution, which is not an exact test like the binomial above:
(15-60)^2/(15+60)
# [1] 27
1-pchisq(27, df=1)
# [1] 2.034555e-07
Thus, when you want to check the marginal proportions of a contingency table are equal, McNemar's test (or the exact binomial test computed manually) is correct. It uses only the relevant information without illegally using any data twice. It does not just 'happen' to yield results that make sense of the data.
I continue to believe that trying to figure out whether a contingency table is "paired" is unhelpful. I suggest using the test that matches the question you are asking of the data.