Ký hiệu ma trận cho hồi quy logistic

Trong hồi quy tuyến tính (mất bình phương), sử dụng ma trận, chúng ta có một ký hiệu rất súc tích cho mục tiêu

minimize ‖ A x - b ‖^{2}

$\text{minimize}~~ \|Ax-b\|^2$

Trong đó $A$ là ma trận dữ liệu, $x$ là các hệ số và $b$ là đáp ứng.

Có một ký hiệu ma trận tương tự cho mục tiêu hồi quy logistic không? Tất cả các ký hiệu tôi đã thấy không thể thoát khỏi số tiền trên tất cả các điểm dữ liệu (giống như $\sum_{\text data} \text{L}_\text{logistic}(y,\beta^Tx)$ ).

EDIT: cảm ơn vì joceratops và câu trả lời tuyệt vời của AdamO. Câu trả lời của họ đã giúp tôi nhận ra rằng một lý do tuyến tính hồi quy có một ký hiệu ngắn gọn hơn là vì định nghĩa của các chỉ tiêu, trong đó đóng gói hình vuông và tổng hoặc $e^\top e$ . Nhưng trong mất mát logistic, không có định nghĩa như vậy, làm cho ký hiệu phức tạp hơn một chút.

— Haitao Du
nguồn

Câu trả lời:

Trong hồi quy tuyến tính, giải pháp Ước tính khả năng tối đa hóa (MLE) để ước tính $x$ có giải pháp dạng đóng sau đây (giả sử rằng A là ma trận có thứ hạng cột đầy đủ):

{\hat{x}}_{lin} = \underset{x}{argmin} ‖ A x - b ‖_{2}^{2} = (A^{T} A)^{- 1} A^{T} b

$\hat{x}_\text{lin}=\underset{x}{\text{argmin}} \|Ax-b\|_2^2 = (A^TA)^{-1}A^Tb$

Này được đọc là "tìm ra $x$ giảm thiểu hàm mục tiêu, $\|Ax-b\|_2^2$ ". Những điều tốt đẹp về đại diện cho hàm mục tiêu hồi quy tuyến tính theo cách này là chúng ta có thể giữ tất cả mọi thứ trong ký hiệu ma trận và giải quyết cho bằng tay. Như Alex R. đề cập, trong thực tế, chúng tôi thường không xem xét trực tiếp vì nó không hiệu quả về mặt tính toán và thường không đáp ứng đầy đủ các tiêu chí xếp hạng. Thay vào đó, chúng tôi chuyển sang giả hành Moore-Penrose $\hat{x}_\text{lin}$ $(A^TA)^{-1}$ $A$ . Các chi tiết giải quyết tính toán cho nghịch đảo giả có thể liên quan đến phân rã Cholesky hoặc Phân rã giá trị số ít.

Ngoài ra, giải pháp MLE để ước tính các hệ số trong hồi quy logistic là:

{\hat{x}}_{log} = \underset{x}{argmin} \sum_{i = 1}^{N} y^{(i)} \log (1 + e^{- x^{T} a^{(i)}}) + (1 - y^{(i)}) \log (1 + e^{x^{T} a^{(i)}})

$\hat{x}_\text{log} = \underset{x}{\text{argmin}} \sum_{i=1}^{N} y^{(i)}\log(1+e^{-x^Ta^{(i)}}) + (1-y^{(i)})\log(1+e^{x^T a^{(i)}})$

trong đó (giả sử từng mẫu dữ liệu được lưu theo hàng):

$x$ là một vectơ biểu thị các hệ số hồi quy

$a^{(i)}$ là một vectơ đại diện cho $i^{th}$ mẫu / hàng trong ma trận dữ liệu $A$

$y^{(i)}$ là một đại lượng vô hướng trong $\{0, 1\}$ , và $i^{th}$ nhãn tương ứng với $i^{th}$ mẫu

$N$ là số mẫu dữ liệu / số hàng trong ma trận dữ liệu $A$ .

Một lần nữa, điều này được đọc là "tìm $x$ làm giảm thiểu hàm mục tiêu".

Nếu bạn muốn, bạn có thể mang nó một bước xa hơn và đại diện cho trong ký hiệu ma trận như sau: $\hat{x}_\text{log}$

{\hat{x}}_{log} = \underset{x}{argmin} [\begin{matrix} 1 & (1 - y^{(1)}) \\ ⋮ & ⋮ \\ 1 & (1 - y^{(N)}) \end{matrix}] [\begin{matrix} \log (1 + e^{- x^{T} a^{(1)}}) & . . . & \log (1 + e^{- x^{T} a^{(N)}}) \\ \log (1 + e^{x^{T} a^{(1)}}) & . . . & \log (1 + e^{x^{T} a^{(N)}}) \end{matrix}]

$\hat{x}_\text{log} = \underset{x}{\text{argmin}} \begin{bmatrix} 1 & (1-y^{(1)}) \\ \vdots & \vdots \\ 1 & (1-y^{(N)})\\\end{bmatrix} \begin{bmatrix} \log(1+e^{-x^Ta^{(1)}}) & ... & \log(1+e^{-x^Ta^{(N)}}) \\\log(1+e^{x^Ta^{(1)}}) & ... & \log(1+e^{x^Ta^{(N)}}) \end{bmatrix}$

$\hat{x}_\text{log}$ $\hat{x}_\text{log}$ là xấp xỉ và được thể hiện trong ký hiệu ma trận ( xem liên kết được cung cấp bởi Alex R. ).

— joceratops
nguồn

Great. Thanks. I think the reason we do not have something like solving

A^{⊤} A x = A^{⊤} b

$A^\top A x=A^\top b$ is the reason we do not take that step more to make the matrix notation and avoid sum symbol.

— Haitao Du

We do have some advantage of taking one step further, making it into matrix multiplication would make the code simpler, and in many platforms such as matlab, for loop with sum over all data, is much slower than matrix operations.

— Haitao Du

@hxd1011: Just a small comment: reducing to matrix equations is not always wise. In the case of

A^{T} A x = A^{T} b

$A^TAx=A^Tb$ , you shouldn't actually try looking for matrix inverse

A^{T} A

$A^TA$ , but rather do something like a Cholesky decomposition which will be much faster and more numerically stable. For logistic regression, there are a bunch of different iteration schemes which do indeed use matrix computations. For a great review see here: research.microsoft.com/en-us/um/people/minka/papers/logreg/…

— Alex R.

@AlexR. thank you very much. I learned that using normal equation will make the matrix conditional number squared. And QR or Cholesky would be much better. Your link is great, such review with numerical methods is always what I wanted.

— Haitao Du

@joceratops answer focuses on the optimization problem of maximum likelihood for estimation. This is indeed a flexible approach that is amenable to many types of problems. For estimating most models, including linear and logistic regression models, there is another general approach that is based on the method of moments estimation.

The linear regression estimator can also be formulated as the root to the estimating equation:

0 = X^{T} (Y - X β)

$0 = \mathbf{X}^T(Y - \mathbf{X}\beta)$

In this regard $\beta$ is seen as the value which retrieves an average residual of 0. It needn't rely on any underlying probability model to have this interpretation. It is, however, interesting to go about deriving the score equations for a normal likelihood, you will see indeed that they take exactly the form displayed above. Maximizing the likelihood of regular exponential family for a linear model (e.g. linear or logistic regression) is equivalent to obtaining solutions to their score equations.

0 = \sum_{i = 1}^{n} S_{i} (α, β) = \frac{\partial}{\partial β} \log L (β, α, X, Y) = X^{T} (Y - g (X β))

$0 = \sum_{i=1}^n S_i(\alpha, \beta) = \frac{\partial}{\partial \beta} \log \mathcal{L}( \beta, \alpha, X, Y) = \mathbf{X}^T (Y - g(\mathbf{X}\beta))$

Where $Y_i$ has expected value $g(\mathbf{X}_i \beta)$ . In GLM estimation, $g$ is said to be the inverse of a link function. In normal likelihood equations, $g^{-1}$ is the identity function, and in logistic regression $g^{-1}$ is the logit function. A more general approach would be to require $0 = \sum_{i=1}^n Y - g(\mathbf{X}_i\beta)$ which allows for model misspecification.

Additionally, it is interesting to note that for regular exponential families, $\frac{\partial g(\mathbf{X}\beta)}{\partial \beta} = \mathbf{V}(g(\mathbf{X}\beta))$ which is called a mean-variance relationship. Indeed for logistic regression, the mean variance relationship is such that the mean $p = g(\mathbf{X}\beta)$ is related to the variance by $\mbox{var}(Y_i) = p_i(1-p_i)$ . This suggests an interpretation of a model misspecified GLM as being one which gives a 0 average Pearson residual. This further suggests a generalization to allow non-proportional functional mean derivatives and mean-variance relationships.

A generalized estimating equation approach would specify linear models in the following way:

0 = \frac{\partial g (X β)}{\partial β} V^{- 1} (Y - g (X β))

$0 = \frac{\partial g(\mathbf{X}\beta)}{\partial \beta} \mathbf{V}^{-1}\left(Y - g(\mathbf{X}\beta)\right)$

With $\mathbf{V}$ a matrix of variances based on the fitted value (mean) given by $g(\mathbf{X}\beta)$ . This approach to estimation allows one to pick a link function and mean variance relationship as with GLMs.

In logistic regression $g$ would be the inverse logit, and $V_{ii}$ would be given by $g(\mathbf{X}_i \beta)(1-g(\mathbf{X}\beta))$ . The solutions to this estimating equation, obtained by Newton-Raphson, will yield the $\beta$ obtained from logistic regression. However a somewhat broader class of models is estimable under a similar framework. For instance, the link function can be taken to be the log of the linear predictor so that the regression coefficients are relative risks and not odds ratios. Which--given the well documented pitfalls of interpreting ORs as RRs--behooves me to ask why anyone fits logistic regression models at all anymore.

— AdamO
nguồn

+1 great answer. formulate it as a root finding on derivative is really new for me. and the second equation is really concise.

— Haitao Du