Mối quan hệ giữa phạm vi và độ lệch chuẩn

Trong một bài viết tôi đã tìm thấy công thức cho độ lệch chuẩn của cỡ mẫu $N$

$\sigma=\frac{\overline{R}}{2.534}$

Trong đó $\overline{R}$ là phạm vi trung bình của các mẫu phụ (cỡ $6$ ) từ mẫu chính. Số $2.534$ được tính như thế nào? Đây là con số chính xác?

standard-deviation descriptive-statistics range

— Andy
nguồn

Xin vui lòng tham khảo. Quan trọng hơn: 1. Không thể có "số chính xác" ở đây độc lập với loại phân phối mà bạn đang rút ra. 2. Các quy tắc này thường xuất phát từ sự quan tâm đến các phương pháp ước tính SD ngắn từ phạm vi. Bây giờ chúng ta có máy tính .... Bạn có muốn làm điều đó không và tại sao? Tại sao không chỉ sử dụng dữ liệu?

— Nick Cox

@Nick Xin lỗi: bạn đã đúng. Giá trị khoảng

4

$4$ hoạt động cho độ lệch chuẩn khi cỡ mẫu khoảng

15

$15$ đến

50

$50$ ;

3

$3$ tác phẩm cho cỡ mẫu khoảng

10

$10$ , v.v. Tôi sẽ xóa bình luận trước đó để nó không gây nhầm lẫn cho ai khác ngoài tôi!

— whuber

@NickCox đó là nguồn tiếng Nga cũ và tôi không thấy công thức trước đây.

— Andy

Đưa ra tài liệu tham khảo hiếm khi là một ý tưởng tồi. Hãy để độc giả tự quyết định xem chúng thú vị hay dễ tiếp cận. (Có rất nhiều người ở đây có thể đọc tiếng Nga chẳng hạn.)

— Nick Cox

Câu trả lời:

Trong một mẫu gồm giá trị độc lập từ phân phối với pdf , pdf của phân phối chung của các cực trị và tỷ lệ với $x$ $n$ $F$ $f$ $\min(x)=x_{[1]}$ $\max(x)=x_{[n]}$

f (x_{[1]}) {(F (x_{[n]}) - F (x_{[1]}))}^{n - 2} f (x_{[n]}) d x_{[1]} d x_{[n]} = H_{F} (x_{[1]}, x_{[n]}) d x_{[1]} d x_{[n]} .

$f(x_{[1]})\left(F(x_{[n]})-F(x_{[1]})\right)^{n-2}f(x_{[n]})dx_{[1]}dx_{[n]} = H_F(x_{[1]}, x_{[n]})dx_{[1]}dx_{[n]}.$

(Hằng số tỷ lệ là nghịch đảo của hệ số đa thức . Theo trực giác, bản PDF chung này thể hiện cơ hội tìm giá trị nhỏ nhất trong phạm vi, giá trị lớn nhất trong phạm vivà giữa $\binom{n}{1,n-2,1} = n(n-1)$ $[x_{[1]},x_{[1]}+dx_{[1]})$ $[x_{[n]},x_{[n]}+dx_{[n]})$ $n-2$ values between them within the range $[x_{[1]}+dx_{[1]}, x_{[n]})$ . When $F$ is continuous, we may replace that middle range by $(x_{[1]}, x_{[n]}]$ , thereby neglecting only an "infinitesimal" amount of probability. The associated probabilities, to first order in the differentials, are $f(x_{[1]})dx_{[1]},$ $f(x_{[n]})dx_{[n]},$ and $F(x_{[n]})-F(x_{[1]}),$ respectively, now making it obvious where the formula comes from.)

Taking the expectation of the range $x_{[n]} - x_{[1]}$ gives $2.53441\ \sigma$ for any Normal distribution with standard deviation $\sigma$ and $n=6$ . The expected range as a multiple of $\sigma$ depends on the sample size $n$ :

Normal

These values were computed by numerically integrating $\binom{n}{1,n-2,1}\left(y-x\right)H_F(x,y)dxdy$ over $\{(x,y)\in\mathbb{R}^2|x\le y\}$ , with $F$ set to the standard Normal CDF, and dividing by the standard deviation of $F$ (which is just $1$ ).

Một mối quan hệ nhân tương tự giữa phạm vi dự kiến và độ lệch chuẩn sẽ giữ cho bất kỳ họ phân phối ở quy mô địa điểm nào, bởi vì đó là một thuộc tính của shape of the distribution alone. For instance, here is a comparable plot for uniform distributions:

Uniform

và phân phối theo cấp số nhân:

Exponential

$f$ $F$ $\frac{n-1}{(n+1)}\sqrt{12}$ $\gamma + \psi(n) = \gamma + \frac{\Gamma'(n)}{\Gamma(n)}$ where $\gamma$ is Euler's constant and $\psi$ is the "polygamma" function, the logarithmic derivative of Euler's Gamma function.

Although they differ (because these distributions display a wide range of shapes), the three roughly agree around $n=6$ , showing that the multiplier $2.5$ does not depend heavily on the shape and therefore can serve as an omnibus, robust assessment of the standard deviation when ranges of small subsamples are known. (Indeed, the very heavy-tailed Student $t$ distribution with three degrees of freedom still has a multiplier around $2.3$ for $n=6$ , not far at all from $2.5$ .)

— whuber
nguồn

Wonderful exposition! You may be interested to know that this appears to have been investigated back in the 1920s. See Tippet 1925. In Tippet's tables (Table X) the expected value for the range given a sample of size 6 is

2.53441 σ

$2.53441\sigma$ . He shows the derivation of the complete distribution of the range for the normal distribution. This was used by David et.al. (1954) to calculate probability points of the range distribution for a test for normality (see D'Agostino & Stephens 9.3.3.4.2).

— Avraham

@Avraham Thank you for the illuminating comments. What struck me when I added the graphics is that the really clever part of this whole approach is the use of subsamples of size six because that's where the multipliers all tend to be about the same regardless of distributional shape.

— whuber

Thanks! Tippet's tables actually give the appropriate multiplier for all numbers between 2 and 1000. He does mention running into calculation issues; of course, this was back in 1925 a good 20 years before ENIAC.

— Avraham

@whuber can you show how the number (2.534) was calculated?

— Andy

I edited the answer to include explanations of the calculations.

— whuber

That approximation is very close to the true sample standard deviation. I wrote a quick R script to illustrate it:

x = sample(1:10000,6000,replace=TRUE)

B = 100000
R = rep(NA,B)
for(i in 1:B){
    samp = sample(x,6)
    R[i] = max(samp)-min(samp)
}

mean(R)/2.534

sd(x)

which yields:

> mean(R)/2.534
[1] 2819.238
> 
> sd(x)
[1] 2880.924

Now I am not sure (yet) why this works but it at least looks like (at face value) that the approximation is a decent one.

Edit: See @Whuber's exceptional comment (above) on why this works

You are drawing subsamples of size

6

$6$ from an approximately uniform distribution. For a truly uniform distribution the ratio is

10 \sqrt{3} / 7 \approx 2.474

$10\sqrt{3}/7\approx 2.474$ . Indeed, if you were to use that factor in your simulation you would obtain mean(R)/2.474 equal to

2887.6

$2887.6$ , very close to sd(x).

— whuber

Very true! > mean(R)/2.474 [1] 2887.611