Phương sai gộp chung thực sự có nghĩa là gì?

15

Tôi là một người mới trong số liệu thống kê, vì vậy các bạn có thể vui lòng giúp tôi ra khỏi đây.

Câu hỏi của tôi là như sau: phương sai gộp thực sự có nghĩa là gì?

Khi tôi tìm kiếm một công thức cho phương sai gộp trong internet, tôi tìm thấy rất nhiều tài liệu sử dụng công thức sau (ví dụ: ở đây: http://math.tntech.edu/ISR/Mathologists_Statistic/Int sinhtion_to_Statologists_Tests / thepage / newnode19 ):

S_{p}^{2} = \frac{S_{1}^{2} (n_{1} - 1) + S_{2}^{2} (n_{2} - 1)}{n_{1} + n_{2} - 2}

$\begin{equation} \label{eq:stupidpooledvar} \displaystyle S^2_p = \frac{S_1^2 (n_1-1) + S_2^2 (n_2-1)}{n_1 + n_2 - 2} \end{equation}$

Nhưng nó thực sự tính toán cái gì? Bởi vì khi tôi sử dụng công thức này để tính toán phương sai gộp của mình, nó cho tôi câu trả lời sai.

Ví dụ: xem xét các "mẫu cha" này:

2, 2, 2, 2, 2, 8, 8, 8, 8, 8

$\begin{equation} \label{eq:parentsample} 2,2,2,2,2,8,8,8,8,8 \end{equation}$

Phương sai của mẫu mẹ này là và giá trị trung bình của nó là . $S^2_p=10$ $\bar{x}_p=5$

Bây giờ, giả sử tôi chia mẫu cha mẹ này thành hai mẫu phụ:

Mẫu phụ đầu tiên là 2,2,2,2,2 với trung bình và phương sai . $\bar{x}_1=2$ $S^2_1=0$
Mẫu phụ thứ hai là 8,8,8,8,8 với trung bình và phương sai . $\bar{x}_2=8$ $S^2_2=0$

Bây giờ, rõ ràng, sử dụng công thức trên để tính phương sai gộp / cha của hai mẫu con này sẽ tạo ra 0, vì và . Vậy công thức này thực sự tính toán gì? $S_1=0$ $S_2=0$

Mặt khác, sau một số dẫn xuất dài, tôi đã tìm thấy công thức tạo ra phương sai chính xác / gộp là:

S_{p}^{2} = \frac{S_{1}^{2} (n_{1} - 1) + n_{1} d_{1}^{2} + S_{2}^{2} (n_{2} - 1) + n_{2} d_{2}^{2}}{n_{1} + n_{2} - 1}

$\begin{equation} \label{eq:smartpooledvar} \displaystyle S^2_p = \frac{S_1^2 (n_1-1) + n_1 d_1^2 + S_2^2 (n_2-1) + n_2 d_2^2} {n_1 + n_2 - 1} \end{equation}$

Trong công thức trên, và . $d_1=\bar{x_1}-\bar{x}_p$ $d_2=\bar{x_2}-\bar{x}_p$

Tôi đã tìm thấy một công thức tương tự với tôi, ví dụ ở đây: http://www.emathzone.com/tutorials/basic-statistic/combined-variance.html và cả trong Wikipedia. Mặc dù tôi phải thừa nhận rằng họ không giống hệt tôi.

Vì vậy, một lần nữa, phương sai gộp thực sự có nghĩa là gì? Không phải nó có nghĩa là phương sai của mẫu mẹ từ hai mẫu phụ sao? Hay tôi sai hoàn toàn ở đây?

Cảm ơn bạn trước.

EDIT 1: Có người nói rằng hai mẫu phụ của tôi ở trên là bệnh lý vì chúng có phương sai bằng không. Vâng, tôi có thể cho bạn một ví dụ khác. Xem xét mẫu cha mẹ này:

1, 2, 3, 4, 5, 46, 47, 48, 49, 50

$\begin{equation} \label{eq:parentsample2} 1,2,3,4,5,46,47,48,49,50 \end{equation}$

Phương sai của mẫu mẹ này là và giá trị trung bình của nó là $S^2_p=564.7$ . $\bar{x}_p=25.5$

Bây giờ, giả sử tôi chia mẫu cha mẹ này thành hai mẫu phụ:

Mẫu phụ đầu tiên là 1,2,3,4,5 với trung bình và phương sai . $\bar{x}_1=3$ $S^2_1=2.5$
Mẫu phụ thứ hai là 46,47,48,49,50 với trung bình và phương sai . $\bar{x}_2=48$ $S^2_2=2.5$

Bây giờ, nếu bạn sử dụng "công thức của tài liệu" để tính toán phương sai gộp, bạn sẽ nhận được 2,5, điều này hoàn toàn sai, bởi vì phương sai cha mẹ / gộp phải là 564.7. Thay vào đó, nếu bạn sử dụng "công thức của tôi", bạn sẽ có câu trả lời đúng.

Xin hãy hiểu, tôi sử dụng các ví dụ cực đoan ở đây để cho mọi người thấy rằng công thức thực sự sai. Nếu tôi sử dụng "dữ liệu bình thường" không có nhiều biến thể (trường hợp cực đoan), thì kết quả từ hai công thức đó sẽ rất giống nhau và mọi người có thể loại bỏ sự khác biệt do lỗi làm tròn, không phải vì chính công thức đó là Sai lầm.

variance mean pooling

— Hanciong
nguồn

Một số liên kết có liên quan để trợ giúp: stats.stackexchange.com/q/214834/3277 , stats.stackexchange.com/q/12330/3277 , stats.stackexchange.com/q/43159/3277 .

— ttnphns

13

Nói một cách đơn giản, phương sai gộp là một ước tính (không thiên vị) về phương sai trong mỗi mẫu, theo giả định / ràng buộc rằng các phương sai đó bằng nhau.

Điều này được giải thích, thúc đẩy và phân tích một số chi tiết trong mục Wikipedia cho phương sai gộp .

Nó không ước tính phương sai của một "mẫu meta" mới được hình thành bằng cách ghép hai mẫu riêng lẻ, như bạn nghĩ. Như bạn đã phát hiện ra, ước tính đòi hỏi một công thức hoàn toàn khác.

— Jake Westfall
nguồn

Giả định về "sự bình đẳng" (nghĩa là, cùng một dân số đã nhận ra các mẫu đó) nói chung là không cần thiết để xác định nó là gì - "gộp lại". Pooled đơn giản có nghĩa là trung bình, omnibus (xem bình luận của tôi cho Tim).

— ttnphns

@ttnphns Tôi nghĩ rằng giả định bình đẳng là cần thiết để tạo ra phương sai tổng hợp có ý nghĩa khái niệm (mà OP yêu cầu) vượt xa chỉ mô tả bằng lời nói về hoạt động toán học mà nó thực hiện trên phương sai mẫu. Nếu phương sai dân số không được giả định bằng nhau, thì không rõ chúng ta có thể coi phương sai gộp là ước tính nào. Tất nhiên, chúng ta chỉ có thể nghĩ về nó như là một sự hợp nhất của hai phương sai và để nó ở đó, nhưng điều đó hầu như không giác ngộ khi không có bất kỳ động lực nào để muốn kết hợp các phương sai ở nơi đầu tiên.

— Jake Westfall

Jake, tôi không đồng ý với điều đó, đưa ra câu hỏi cụ thể của OP, nhưng tôi muốn nói về định nghĩa của từ "gộp", đó là lý do tại sao tôi nói, "nói chung".

— ttnphns

@JakeWestfall Câu trả lời của bạn là câu trả lời tốt nhất cho đến nay. Cảm ơn bạn. Mặc dù tôi vẫn chưa rõ về một điều. Theo Wikipedia, phương sai gộp là một phương pháp để ước tính phương sai của một số quần thể khác nhau khi trung bình của mỗi dân số có thể khác nhau , nhưng người ta có thể cho rằng phương sai của mỗi dân số là như nhau .

— Hanciong

@JakeWestfall: Vì vậy, nếu chúng ta tính toán phương sai gộp từ hai quần thể khác nhau với các phương tiện khác nhau, thì nó thực sự tính toán gì? Bởi vì phương sai thứ nhất là đo lường sự thay đổi đối với giá trị trung bình thứ nhất và phương sai thứ hai liên quan đến giá trị trung bình thứ hai. Tôi không biết những thông tin bổ sung nào có thể thu được từ việc tính toán nó.

— Hanciong

10

Phương sai gộp được sử dụng để kết hợp với nhau phương sai từ các mẫu khác nhau bằng cách lấy trung bình có trọng số của chúng, để có được phương sai "tổng thể". Vấn đề với ví dụ của bạn là đây là trường hợp bệnh lý, vì mỗi mẫu phụ có phương sai bằng 0. Trường hợp bệnh lý như vậy có rất ít điểm chung với dữ liệu chúng ta thường gặp, vì luôn có một số biến thiên và nếu không có biến thiên, chúng ta không quan tâm đến các biến đó vì chúng không mang thông tin. Bạn cần lưu ý rằng đây là một phương pháp rất đơn giản và có nhiều cách phức tạp hơn để ước tính phương sai trong các cấu trúc dữ liệu phân cấp không dễ gặp phải các vấn đề như vậy.

$n$ $k$ $x_{1,1},x_{2,1},\dots,x_{n-1,k},x_{n,k}$ , where the $i$ -th index in $x_{i,j}$ stands for cases and $j$ -th index stands for group indexes. There are several scenarios possible, you can assume that all the points come from the same distribution (for simplicity, let's assume normal distribution),

\begin{matrix} (1) & x_{i, j} \sim N (μ, σ^{2}) \end{matrix}

$x_{i,j} \sim \mathcal{N}(\mu, \sigma^2) \tag{1}$

you can assume that each of the sub-samples has its own mean

\begin{matrix} (2) & x_{i, j} \sim N (μ_{j}, σ^{2}) \end{matrix}

$x_{i,j} \sim \mathcal{N}(\mu_j, \sigma^2) \tag{2}$

or, its own variance

\begin{matrix} (3) & x_{i, j} \sim N (μ, σ_{j}^{2}) \end{matrix}

$x_{i,j} \sim \mathcal{N}(\mu, \sigma^2_j) \tag{3}$

or, each of them have their own, distinct parameters

\begin{matrix} (4) & x_{i, j} \sim N (μ_{j}, σ_{j}^{2}) \end{matrix}

$x_{i,j} \sim \mathcal{N}(\mu_j, \sigma^2_j) \tag{4}$

Depending on your assumptions, particular method may, or may not be adequate for analyzing the data.

In the first case, you wouldn't be interested in estimating the within-group variances, since you would assume that they all are the same. Nonetheless, if you aggregated the global variance from the group variances, you would get the same result as by using pooled variance since the definition of variance is

V a r (X) = \frac{1}{n - 1} \sum_{i} (x_{i} - μ)^{2}

$\mathrm{Var}(X) = \frac{1}{n-1} \sum_i (x_i - \mu)^2$

and in pooled estimator you first multiply it by $n-1$ , then add together, and finally divide by $n_1 + n_2 - 1$ .

In the second case, means differ, but you have a common variance. This example is closest to your example in the edit. In this scenario, the pooled variance would correctly estimate the global variance, while if estimated variance on the whole dataset, you would obtain incorrect results, since you were not accounting for the fact that the groups have different means.

In the third case it doesn't make sense to estimate the "global" variance since you assume that each of the groups have its own variance. You may be still interested in obtaining the estimate for the whole population, but in such case both (a) calculating the individual variances per group, and (b) calculating the global variance from the whole dataset, can give you misleading results. If you are dealing with this kind of data, you should think of using more complicated model that accounts for the hierarchical nature of the data.

The fourth case is the most extreme and quite similar to the previous one. In this scenario, if you wanted to estimate the global mean and variance, you would need a different model and different set of assumptions. In such case, you would assume that your data is of hierarchical structure, and besides the within-group means and variances, there is a higher-level common variance, for example assuming the following model

\begin{matrix} (5) & \begin{aligned} x_{i, j} & \sim N (μ_{j}, σ_{j}^{2}) \\ μ_{j} & \sim N (μ_{0}, σ_{0}^{2}) \\ σ_{j}^{2} & \sim I G (α, β) \end{aligned} \end{matrix}

$\begin{align} x_{i,j} &\sim \mathcal{N}(\mu_j, \sigma^2_j) \\ \mu_j &\sim \mathcal{N}(\mu_0, \sigma^2_0) \\ \sigma^2_j &\sim \mathcal{IG}(\alpha, \beta) \end{align} \tag{5}$

where each sample has its own means and variances $\mu_j,\sigma^2_j$ that are themselves draws from common distributions. In such case, you would use a hierarchical model that takes into consideration both the lower-level and upper-level variability. To read more about this kind of models, you can check the Bayesian Data Analysis book by Gelman et al. and their eight schools example. This is however much more complicated model then the simple pooled variance estimator.

— Tim
nguồn

I have updated my question with different example. In this case, the answer from "literature's formula" is still wrong. I understand that we are usually dealing with "normal data" where there is no extreme case like my example above. However, as mathematicians, shouldn't you care about which formula is indeed correct, instead of which formula applies in "everyday/common problem"? If some formula is fundamentally wrong, it should be discarded, especially if there is another formula which holds in all cases, pathological or not.

— Hanciong

Btw you said there are more complicated ways of estimating variance. Could you show me these ways? Thank you

— Hanciong

2

Tim, pooled variance is not the total variance of the "combined sample". In statistics, "pooled" means weighted averaged (when we speak of averaged quantities such as variances, weights being the n's) or just summed (when we speak of sums such as scatters, sums-of-squares). Please, reconsider your terminology (choice of words) in the answer.

— ttnphns

1

Albeit off the current topic, here is an interesting question about "common" variance concept. stats.stackexchange.com/q/208175/3277

— ttnphns

1

Hanciong. I insist that "pooled" in general and even specifically "pooled variance" concept does not need, in general, any assumption such as: groups came from populations with equal variances. Pooling is simply blending (weighted averaging or summing). It is in ANOVA and similar circumstances that we do add that statistical assumption.

— ttnphns

1

The problem is if you just concatenate the samples and estimate its variance you're assuming they're from the same distribution therefore have the same mean. But we are in general interested in several samples with different mean. Does this make sense?

— ZHU
nguồn

0

The use-case of pooled variance is when you have two samples from distributions that:

may have different means, but
which you expect to have an equal true variance.

An example of this is a situation where you measure the length of Alice's nose $n$ times for one sample, and measure the length of Bob's nose $m$ times for the second. These are likely to produce a bunch of different measurements on the scale of millimeters, because of measurement error. But you expect the variance in measurement error to be the same no matter which nose you measure.

In this case, taking the pooled variance would give you a better estimate of the variance in measurement error than taking the variance of one sample alone.

— Misha
nguồn

Thank you for your answer, but I still don't understand about one thing. The first data gives you the variance with respect to Alice's nose length, and the second data gives you the variance with respect to Bob's nose length. If you are calculating a pooled variance from those data, what does it mean actually? Because the first variance is measuring the variation with respect to Alice's, and the second with respect to Bob's, so what additional information can we gained by calculating their pooled variance? They are completely different numbers.

— Hanciong

0

Through pooled variance we are not trying to estimate the variance of a bigger sample, using smaller samples. Hence, the two examples you gave don't exactly refer to the question.

Pooled variance is required to get a better estimate of population variance, from two samples that have been randomly taken from that population and come up with different variance estimates.

Example, you are trying to gauge variance in the smoking habits of males in London. You sample two times, 300 males from London. You end up getting two variances (probably a bit different!). Now since, you did a fair random sampling (best to your capability! as true random sampling is almost impossible), you have all the rights to say that both the variances are true point estimates of population variance (London males in this case).

But how is that possible? i.e. two different point estimates!! Thus, we go ahead and find a common point estimate which is pooled variance. It is nothing but weighted average of two point estimates, where the weights are the degree of freedom associated with each sample.

Hope this clarifies.

— Sameer Saurabh
nguồn