Hàm chi phí trong cv.glm trong gói khởi động của R là gì?

14

Tôi đang thực hiện xác nhận chéo bằng phương pháp bỏ qua một lần. Tôi có một phản hồi nhị phân và đang sử dụng gói khởi động cho R và hàm cv.glm . Vấn đề của tôi là tôi không hiểu đầy đủ phần "chi phí" trong chức năng này. Từ những gì tôi có thể hiểu đây là hàm quyết định giá trị ước tính nên được phân loại là 1 hay 0, tức là giá trị ngưỡng cho phân loại. Điều này có đúng không?

Và, trong trợ giúp trong R, họ sử dụng chức năng này cho mô hình nhị thức : cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5). Làm thế nào để tôi giải thích chức năng này? vì vậy tôi có thể sửa đổi nó một cách chính xác cho phân tích của tôi.

Bất kỳ trợ giúp đều được đánh giá cao, không muốn sử dụng chức năng mà tôi không hiểu.

r cross-validation

— mael
nguồn

9

r is a vector that contains the actual outcome, pi is a vector that contains the fitted values.

cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

This is saying $cost = \sum|r_i - pi_i|$ . You can define your own cost functions. In your case for binary classification you can do something like this

mycost <- function(r, pi){
    weight1 = 1 #cost for getting 1 wrong
    weight0 = 1 #cost for getting 0 wrong
    c1 = (r==1)&(pi<0.5) #logical vector - true if actual 1 but predict 0
    c0 = (r==0)&(pi>=0.5) #logical vector - true if actual 0 but predict 1
    return(mean(weight1*c1+weight0*c0))
  }

and put mycost as an argument in the cv.glm function.

— Feng Mai
nguồn

Shouldn't

c o s t

$cost$ be something like

\sum ⌊ \frac{| r_{i} - p_{i} |}{0.5} ⌋

$\sum \Bigl\lfloor \frac{|r_i-p_i|}{0.5}\Bigr\rfloor$ (with the exception, when

| r_{i} - p_{i} | = 1

$|r_i-p_i|=1$ , then the term is

1

$1$ , not

2

$2$ )?

— Mooncrater

@feng-mai pi==0 or pi < 0.5? ( and pi==1 or pi > 0.5?) if using 0.5 as the decision boundary. Are not the pi the predicted probabilities?

— PM.

1

@PM Yes you are right.

p i

$pi$ are the responses from the glm model. Thanks for the correction.

— Feng Mai

1

cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

First, you have set a cut-off as 0.5. Your r is 0/1, but pi is probability. So individual cost is 1 if absolute error is greater than 0.5, otherwise 0. Then, this function calculates the average error rate. But remember, the cut-off has been set before you define your cost function.

Actually, I think it makes more sense if the choice of cut-off is determined by cost function.

— SLi
nguồn

0

The answer by @SLi already explains very well what the cost function you have defined does. However, I thought I would add that the cost function is used to calculate the delta value from cv.glm, which is a measurement of the cross validation error. However, critically delta is the weighted average of the error of each fold given by the cost. We see this by inspecting the relevant bit of the code:

for (i in seq_len(ms)) {
    j.out <- seq_len(n)[(s == i)]
    j.in <- seq_len(n)[(s != i)]
    Call$data <- data[j.in, , drop = FALSE]
    d.glm <- eval.parent(Call)
    p.alpha <- n.s[i]/n # create weighting for averaging later
    cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out, 
        , drop = FALSE], type = "response"))
    CV <- CV + p.alpha * cost.i # add previous error to running total
    cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm, 
        data, type = "response"))
}

and the value returned by the function is:

  list(call = call, K = K, delta = as.numeric(c(CV, CV + cost.0)), 
    seed = seed)

— Alex
nguồn