Làm thế nào là tăng cường độ dốc như giảm độ dốc?

Tôi đang đọc mục Wikipedia hữu ích về tăng cường độ dốc ( https://en.wikipedia.org/wiki/Gradient_boosting ) và cố gắng hiểu làm thế nào / tại sao chúng ta có thể ước chừng số dư bằng bước xuống dốc nhất (còn được gọi là giả độ dốc ). Bất cứ ai có thể cho tôi trực giác về cách đi xuống dốc nhất được liên kết / tương tự như phần dư? Giúp nhiều đánh giá cao!

self-study gradient-descent

— Đồ trang sức
nguồn

Giả sử chúng ta đang ở trong tình huống sau đây. Chúng tôi có một số dữ liệu , trong đó mỗi $\{ x_i, y_i \}$ có thể là một số hoặc vector, và chúng tôi muốn xác định một hàm rằng xấp xỉ với mối quan hệ , theo nghĩa rằng lỗi bình phương nhỏ nhất: $x_i$ $f$ $f(x_i) \approx y_i$

\frac{1}{2} \sum_{i} (y_{i} - f (x_{i}))^{2}

$\frac{1}{2} \sum_i (y_i - f(x_i))^2$

nhỏ.

Bây giờ, câu hỏi nhập vào những gì chúng ta muốn miền của là. Một sự lựa chọn thoái hóa cho miền chỉ là những điểm trong dữ liệu đào tạo của chúng tôi. Trong trường hợp này, chúng tôi có thể chỉ cần xác định , bao gồm toàn bộ miền mong muốn và được thực hiện với nó. Một vòng về cách để đi đến câu trả lời này là bằng cách thực hiện giảm độ dốc với không gian riêng biệt này làm miền. Điều này có một chút thay đổi trong quan điểm. Chúng ta hãy xem sự mất mát là một hàm của điểm đúng và dự đoán (hiện tại, không phải là hàm, mà chỉ là giá trị của dự đoán) $f$ $f(x_i) = y$ $y$ $f$ $f$

L (f; y) = \frac{1}{2} (y - f)^{2}

$L(f; y) = \frac{1}{2} (y - f)^2$

và sau đó lấy độ dốc theo dự đoán

\nabla_{f} L (f; y) = f - y

$\nabla_f L(f; y) = f - y$

Then the gradient update, starting from an initial value of $y_0$ is

y_{1} = y_{0} - \nabla_{f} (y_{0}, y) = y_{0} - (y_{0} - y) = y

$y_1 = y_0 - \nabla_f (y_0, y) = y_0 - (y_0 - y) = y$

So we recover our perfect prediction in a gradient step with this setup, which is nice!

The flaw here is, of course, that we want $f$ to be defined at much more than just our training data points. To do this, we must make a few concessions, for we are not able to evaluate the loss function, or its gradient, at any points other than our training data set.

The big idea is to weakly approximate $\nabla L$ .

Start with an initial guess at $f$ , almost always a simple constant function $f(x) = f_0$ , this is defined everywhere. Now generate a new working dataset by evaluating the gradient of the loss function at the training data, using the initial guess for $f$ :

W = {x_{i}, f_{0} - y}

$W = \{ x_i, f_0 - y \}$

Now approximate $\nabla L$ by fitting weak learner to $W$ . Say we get the approximation $F \approx \nabla L$ . We have gained an extension of the data $W$ across the entire domain in the form of $F(X)$ , though we have lost precision at the training points, since we fit a small learner.

Finally, use $F$ in place of $\nabla L$ in the gradient update of $f_0$ over the entire domain:

f_{1} (x) = f_{0} (x) - F (x)

$f_1(x) = f_0(x) - F(x)$

We get out $f_1$ , a new approximation of $f$ , a bit better than $f_0$ . Start over with $f_1$ , and iterate until satisfied.

Hopefully, you see that what is really important is approximating the gradient of the loss. In the case of least squares minimization this takes the form of raw residuals, but in more sophisticated cases it does not. The machinery still applies though. As long as one can construct an algorithm for computing the loss and gradient of loss at the training data, we can use this algorithm to approximate a function minimizing that loss.

— Matthew Drury
nguồn

Yah, I think that's good. The only thing to note is that if you, for example, want to boost to minimize the binomial loss

\sum_{i} y_{i} \log (p_{i}) + (1 - y_{i}) \log (1 - p_{i})

$\sum_i y_i \log (p_i) + (1 - y_i) \log(1 - p_i)$ then the gradient we expand is no longer related to the residuals in a natural way.

— Matthew Drury

Thanks Matthew. One thing that i am trying to get my head around. In the literature it is often stated that the model update is F(m+1) = F(m) +

α_{m} * h (m)

$\alpha_m*h(m)$ , where h(m) is the weak learner. If i am thinking of a tree-based model - does it mean that for both regression and classification we acually practically update our prediction for a given datapoint by simple addition of the outcomes of the two models? does that also work if we are trying to binary classify this? or should the + sign not be interpreted so literally?

— Wouter

The plus sign is quite literal. But for a tree based weak learners, the model predictions should be interpreted as the weighted average in the leaf, even in the case where the tree is fit to binomial data. Note though, that in boosting, we are usually not fitting to binomial data, we are fitting to the gradient of the likelihood evaluated at the prior stage's predictions, which will not be

0, 1

$0,1$ valued.

— Matthew Drury

@MatthewDrury I think in many literature, we are not direct update

f_{1}

$f_1$ with

f_{0} - F (x)

$f_0-F(x)$ , but with

f_{0} - α * F (x)

$f_0-\alpha*F(x)$ , where

α

$\alpha$ from 0 to 1 is a learning rate.

— Haitao Du

@hxd1011 Yes, that's absolutely correct, and crucial for using gradient boosting successfully.

— Matthew Drury