Làm cách nào để thực hiện kiểm tra t hai mẫu trong R bằng cách nhập số liệu thống kê mẫu thay vì dữ liệu thô?

32

Hãy nói rằng chúng tôi có số liệu thống kê được đưa ra dưới đây

gender mean sd n
f 1.666667 0.5773503 3
m 4.500000 0.5773503 4

Làm thế nào để bạn thực hiện thử nghiệm t hai mẫu (để xem liệu có sự khác biệt đáng kể giữa phương tiện của nam và nữ trong một số biến) bằng cách sử dụng số liệu thống kê như thế này thay vì dữ liệu thực tế không?

Tôi không thể tìm thấy bất cứ nơi nào trên internet làm thế nào để làm điều này. Hầu hết các hướng dẫn và thậm chí là thỏa thuận thủ công với thử nghiệm chỉ với tập dữ liệu thực tế.

r t-test

— Alby
nguồn

2

This Wikipedia article plus the help page for R's t-distribution functions (got by ?pt) -- see especially pt() -- do have all the info you'd need to do this yourself. And you'll learn a lot about stats and R if you do that.

— Josh O'Brien

2

Đã có câu trả lời tốt ở đây và thực sự cả hai đều rất dễ dàng (và thực hành tốt) để tự viết một chức năng cho việc này; tuy nhiên, tôi chỉ nói thêm rằng bạn có thể xem tsum.testchức năng trong gói BSDA , thực hiện kiểm tra t (hai mẫu; Welch hoặc phương sai bằng nhau và cũng là một mẫu) từ dữ liệu tóm tắt bạn cung cấp. Về cơ bản nó hoạt động giống như thử nghiệm t trong vanilla R nhưng trên thông tin tóm tắt.

— Glen_b -Reinstate Monica

1

To be honest, when I was learning to program my teacher always said, "don't re-invent the wheel". Therefore, the most logical function would be tsum.test() from the BSDA library as stated by @Nick Cox. It does exactly the same thing as what @macro wrote in lines of code. If the question asked, what is the understanding of the background calculation for computing the t-test statistic in R then Marco would be more appropriate an answer. Please note, I am not trying to offend anyone, just stating my personal opinion related to my professional background. And @marco that is some neat coding :)

— tcratius

37

You can write your own function based on what we know about the mechanics of the two-sample $t$ -test. For example, this will do the job:

# m1, m2: the sample means
# s1, s2: the sample standard deviations
# n1, n2: the same sizes
# m0: the null value for the difference in means to be tested for. Default is 0. 
# equal.variance: whether or not to assume equal variance. Default is FALSE. 
t.test2 <- function(m1,m2,s1,s2,n1,n2,m0=0,equal.variance=FALSE)
{
    if( equal.variance==FALSE ) 
    {
        se <- sqrt( (s1^2/n1) + (s2^2/n2) )
        # welch-satterthwaite df
        df <- ( (s1^2/n1 + s2^2/n2)^2 )/( (s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1) )
    } else
    {
        # pooled standard deviation, scaled by the sample sizes
        se <- sqrt( (1/n1 + 1/n2) * ((n1-1)*s1^2 + (n2-1)*s2^2)/(n1+n2-2) ) 
        df <- n1+n2-2
    }      
    t <- (m1-m2-m0)/se 
    dat <- c(m1-m2, se, t, 2*pt(-abs(t),df))    
    names(dat) <- c("Difference of means", "Std Error", "t", "p-value")
    return(dat) 
}
x1 = rnorm(100)
x2 = rnorm(200) 
# you'll find this output agrees with that of t.test when you input x1,x2
t.test2( mean(x1), mean(x2), sd(x1), sd(x2), 100, 200)
Difference of means       Std Error               t         p-value 
        -0.05692268      0.12192273     -0.46687500      0.64113442

— Macro
nguồn

1

My edit comparing to t.test got rejected, so here's some code to confirm:

(tt2 <- t.test2(mean(x1), mean(x2), sd(x1), sd(x2), length(x1), length(x2))); (tt <- t.test(x1, x2)); tt$statistic == tt2[["t"]]; tt$p.value == tt2[["p-value"]]

— Max Ghenis

20

You just calculate it by hand:

t = \frac{({mean}_{f} - {mean}_{m}) - expected difference}{S E} S E = \sqrt{\frac{s d_{f}^{2}}{n_{f}} + \frac{s d_{m}^{2}}{n_{m}}} where, d f = n_{m} + n_{f} - 2

$t = \frac{(\text{mean}_f - \text{mean}_m) - \text{expected difference}}{SE} \\ ~\\ ~\\ SE = \sqrt{\frac{sd_f^2}{n_f} + \frac{sd_m^2}{n_m}} \\ ~\\ ~\\ \text{where, }~~~df = n_m + n_f - 2$

The expected difference is probably zero.

If you want the p-value simply use the pt() function:

pt(t, df)

Thus, putting the code together:

> p = pt((((1.666667 - 4.500000) - 0)/sqrt(0.5773503/3 + 0.5773503/4)), (3 + 4 - 2))
> p
[1] 0.002272053

This assumes equal variances which is obvious because they have the same standard deviation.

— gung - Reinstate Monica
nguồn

A couple things: How is this "in R"? What is the distribution of the test statistic (i.e. how do you go from this to

p

$p$ -values)?

— Macro

The degree freedom provided in this case is incorrect! You use unpooled variance which assumes unequal variances. Thus, the degree of freedom is more accurate using Scatterwaite Approximation.

— lzstat

7

You can do the calculations based on the formula in the book (on the web page), or you can generate random data that has the properties stated (see the mvrnorm function in the MASS package) and use the regular t.test function on the simulated data.

— Greg Snow
nguồn

When you say "you can generate random data that has the properties stated", do you mean simulating data with population mean and standard deviation equal to the sample values or simulating under the constraint that the sample mean and standard deviation are equal to a pre-specified value?

— Macro

2

You want the simulated data to have the exact same mean(s) and var(s) as stated in the problem. One way to do this (there are many others) is to use the mvrnorm function in the MASS package (you need to set the empirical argument to TRUE).

— Greg Snow

2

The question asks about R, but the issue can arise with any other statistical software. Stata for example has various so-called immediate commands, which allow calculations from summary statistics alone. See http://www.stata.com/manuals13/rttest.pdf for the particular case of the ttesti command, which applies here.

— Nick Cox
nguồn