Hiểu về hồi quy logistic và khả năng

Làm thế nào để ước lượng tham số / Đào tạo hồi quy logistic thực sự hoạt động? Tôi sẽ cố gắng đưa những gì tôi đã có cho đến nay.

Đầu ra là y đầu ra của hàm logistic dưới dạng xác suất tùy thuộc vào giá trị của x: $P (y = 1 | x) = \frac{1}{1 + e^{- ω^{T} x}} \equiv σ (ω^{T} x)$ $P(y=1|x)={1\over1+e^{-\omega^Tx}}\equiv\sigma(\omega^Tx)$ $P (y = 0 | x) = 1 - P (y = 1 | x) = 1 - \frac{1}{1 + e^{- ω^{T} x}}$ $P(y=0|x)=1-P(y=1|x)=1-{1\over1+e^{-\omega^Tx}}$
Đối với một thứ nguyên, cái gọi là Odds được định nghĩa như sau: $\frac{p (y = 1 | x)}{1 - p (y = 1 | x)} = \frac{p (y = 1 | x)}{p (y = 0 | x)} = e^{ω_{0} + ω_{1} x}$ ${{p(y=1|x)}\over{1-p(y=1|x)}}={{p(y=1|x)}\over{p(y=0|x)}}=e^{\omega_0+\omega_1x}$
Bây giờ thêm logchức năng để có được W_0 và W_1 ở dạng tuyến tính: $L o g i t (y) = l o g (\frac{p (y = 1 | x)}{1 - p (y = 1 | x)}) = ω_{0} + ω_{1} x$ $Logit(y)=log({{p(y=1|x)}\over{1-p(y=1|x)}})=\omega_0+\omega_1x$
Bây giờ đến phần vấn đề Sử dụng khả năng (Big X là y) $L (X | P) = \prod_{i = 1, y_{i} = 1}^{N} P (x_{i}) \prod_{i = 1, y_{i} = 0}^{N} (1 - P (x_{i}))$ $L(X|P)=\prod^N_{i=1,y_i=1}P(x_i)\prod^N_{i=1,y_i=0}(1-P(x_i))$ Có ai có thể nói lý do tại sao chúng tôi xem xét xác suất y = 1 hai lần không? kể từ: $P (y = 0 | x) = 1 - P (y = 1 | x)$ $P(y=0|x)=1-P(y=1|x)$

và làm thế nào để có được các giá trị của ω từ nó?

regression logistic likelihood

— Động cơ
nguồn

Câu trả lời:

Giả sử nói chung rằng bạn đã quyết định lấy một mô hình của mẫu

P (y = 1 | X = x) = h (x; Θ)

$P(y=1|X=x) = h(x;\Theta)$

đối với một số thông số . Sau đó, bạn chỉ cần viết ra khả năng cho nó, tức là $\Theta$

L (Θ) = \prod_{i \in {1, . . ., N}, y_{i} = 1} P (y = 1 | x = x; Θ) \cdot \prod_{i \in {1, . . ., N}, y_{i} = 0} P (y = 0 | x = x; Θ)

$L(\Theta) = \prod_{i \in \{1, ..., N\}, y_i = 1} P(y=1|x=x;\Theta) \cdot \prod_{i \in \{1, ..., N\}, y_i = 0} P(y=0|x=x;\Theta)$

giống như

L (Θ) = \prod_{i \in {1, . . ., N}, y_{i} = 1} P (y = 1 | x = x; Θ) \cdot \prod_{i \in {1, . . ., N}, y_{i} = 0} (1 - P (y = 1 | x = x; Θ))

$L(\Theta) = \prod_{i \in \{1, ..., N\}, y_i = 1} P(y=1|x=x;\Theta) \cdot \prod_{i \in \{1, ..., N\}, y_i = 0} (1-P(y=1|x=x;\Theta))$

Bây giờ bạn đã quyết định 'giả định' (mô hình)

P (y = 1 | X = x) = σ (Θ_{0} + Θ_{1} x)

$P(y=1|X=x) = \sigma(\Theta_0 + \Theta_1 x)$

nơi

σ (z) = 1 / (1 + e^{- z})

$\sigma(z) = 1/(1+e^{-z})$

vì vậy bạn chỉ tính toán công thức cho khả năng và làm một số loại thuật toán tối ưu hóa trong để tìm ra , ví dụ, Newtons phương pháp hoặc bất kỳ phương pháp dựa dốc khác. $\text{argmax}_\Theta L(\Theta)$

Đôi khi, mọi người nói rằng khi họ thực hiện hồi quy logistic, họ không tối đa hóa khả năng (như chúng tôi / bạn đã làm ở trên) mà là họ giảm thiểu chức năng mất

l (Θ) = - \sum_{i = 1}^{N} y_{i} \log (P (Y_{i} = 1 | X = x; Θ)) + (1 - y_{i}) \log (P (Y_{i} = 0 | X = x; Θ))

$l(\Theta) = -\sum_{i=1}^N{y_i\log(P(Y_i=1|X=x;\Theta)) + (1-y_i)\log(P(Y_i=0|X=x;\Theta))}$

nhưng lưu ý rằng . $-\log(L(\Theta)) = l(\Theta)$

Đây là một mô hình chung trong Machine Learning: Mặt thực tế (giảm thiểu các hàm mất mát để đo mức độ "sai" của một mô hình heuristic) trên thực tế bằng với "mặt lý thuyết" (mô hình hóa rõ ràng với -symbol, tối đa hóa các đại lượng thống kê như khả năng) và trên thực tế, nhiều mô hình không giống như xác suất (ví dụ SVM) có thể được sử dụng lại trong bối cảnh xác suất và trên thực tế là tối đa hóa khả năng. $P$

— Fabian
nguồn

\prod

$\prod$

L (θ)

$L(\theta)$

y_{i} = 1

$y_i =1$

ω_{1}

$\omega_1$

ω_{0}

$\omega_0$

Σ

$\Sigma$

f (x) = x^{2}

$f(x) = x^2$

x = 3

$x=3$

f

$f$ as it is to complicated. Now the derivative of

f

$f$ is

f^{'} = 2 x

$f' = 2x$ . Interestingly if we are right from the minimum

x = 0

$x=0$ it points to the right and if we are left of it it points left. Mathematically the derivative points into the direction of the 'strongest ascend'

— Fabian Werner

@Engine: In more dimensions you replace the derivative by the gradient, i.e. you start off at a random point

x_{0}

$x_0$ and compute the gradient

\partial f

$\partial f$ at

x

$x$ and if you want to maximize then your next point

x_{1}

$x_1$ is

x_{1} = x_{0} + \partial f (x_{0})

$x_1 = x_0 + \partial f(x_0)$ . Then you compute

\partial f (x_{1})

$\partial f(x_1)$ and you next

x

$x$ is

x_{2} = x_{1} + \partial f (x_{1})

$x_2 = x_1 + \partial f(x_1)$ and so forth. This is called gradient ascend/descent and is the most common technique in maximizing a function. Now you do that with

L (Θ)

$L(\Theta)$ or in your notation

L (ω)

$L(\omega)$ in order to find the

ω

$\omega$ that maxeimizes

L

$L$

— Fabian Werner

@Engine: You are not at all interested in the case

y = 1

$y=1$ ! You are interested in 'the'

ω

$\omega$ that 'best explains your data'. From thet

ω

$\omega$ aou let the model 'speak for itself' and get back to the case of

y = 1

$y=1$ but first of all you need to setup a model! Here, 'best explains' means 'having the highest likelihood' because that is what people came up with (and I think it is very natural)... however, there are other metrics (different loss functions and so on) that one could use! There are two products because we want the model to explain the

y = 1

$y=1$ as well as the

y = 0

$y=0$ 'good'!

— Fabian Werner

Your likelihood function (4) consists of two parts: the product of the probability of success for only those people in your sample who experienced a success, and the product of the probability of failure for only those people in your sample who experienced a failure. Given that each individual experiences either a success or a failure, but not both, the probability will appear for each individual only once. That is what the $, y_i=1$ and $,y_i=0$ mean at the bottom of the product signs.

The coefficients are included in the likelihood function by substituting (1) into (4). That way the likelihood function becomes a function of $\omega$ . The point of maximum likelihood is to find the $\omega$ that will maximize the likelihood.

— Maarten Buis
nguồn

thanks so much for your answer, sorry but still don't get it. isn't

y_{i} = 0

$y_i = 0$ means the probability that y =0[Don't occure] for all y's of the product. and vis versa for y_i=1. And still after the subtitutiing of how can I find

ω

$\omega$ values, caclulating the 2nd derivative ? or gradient ? thanks a lot for your help !

— Engine

\prod_{i = 1, y = 1}^{N}

$\prod_{i=1, y=1}^N$ should be read as "product for persons

i = 1

$i=1$ till

N

$N$ , but only if

y = 1

$y=1$ . So the first part only applies to those persons in your data that experienced the event. Similarly, the second part only refers to persons who did not experienced the event.

— Maarten Buis

There are many possible algorithms for maximizing the likelihood function. The most common one, the Newton-Raphson method, indeed involves computing the first and second derivatives.

— Maarten Buis