Trong phân loại softmax, tại sao sử dụng hàm exp để thực hiện chuẩn hóa?

Tại sao sử dụng softmax trái ngược với tiêu chuẩn hóa? Trong phần bình luận của câu trả lời hàng đầu của câu hỏi này, @Kilian Batzner đã đưa ra 2 câu hỏi cũng làm tôi bối rối rất nhiều. Dường như không ai đưa ra một lời giải thích ngoại trừ lợi ích số.

Tôi có được lý do để sử dụng Mất liên kết chéo, nhưng làm thế nào điều đó liên quan đến softmax? Bạn đã nói "chức năng softmax có thể được coi là cố gắng giảm thiểu entropy chéo giữa các dự đoán và sự thật". Giả sử, tôi sẽ sử dụng chuẩn hóa tuyến tính / tiêu chuẩn, nhưng vẫn sử dụng Mất chéo Entropy. Sau đó, tôi cũng sẽ cố gắng giảm thiểu Cross-Entropy. Vậy làm thế nào là softmax được liên kết với Cross-Entropy ngoại trừ các lợi ích số?

Đối với quan điểm xác suất: động lực để xem xét xác suất đăng nhập là gì? Lý do có vẻ hơi giống như "Chúng tôi sử dụng e ^ x trong softmax, bởi vì chúng tôi hiểu x là log-probabilties". Với cùng một lý do mà chúng ta có thể nói, chúng tôi sử dụng e ^ e ^ e ^ x trong softmax, bởi vì chúng tôi hiểu x là xác suất log-log-log-log (dĩ nhiên là phóng đại ở đây). Tôi nhận được những lợi ích số của softmax, nhưng động lực lý thuyết cho việc sử dụng nó là gì?

machine-learning deep-learning

— Hans
nguồn

Nó là khác biệt, dẫn đến kết quả không âm (chẳng hạn như cần thiết cho một xác suất để có thể tính toán entropy chéo) và hoạt động như hàm max, phù hợp trong cài đặt phân loại. Chào mừng đến với trang web!

— Emre

@Emre Cảm ơn! Nhưng "hành xử như hàm max" nghĩa là gì? Ngoài ra, nếu tôi có một chức năng khác cũng khác biệt, tăng đơn điệu và dẫn đến kết quả không âm, tôi có thể sử dụng chức năng này để thay thế chức năng exp trong công thức không?

— Hans

Khi bạn bình thường hóa bằng cách sử dụng

max

$\max$ , đối số lớn nhất sẽ được ánh xạ thành 1 trong khi phần còn lại được ánh xạ về 0, do sự tăng trưởng của kết cấu theo cấp số nhân.

— Emre

Câu trả lời:

Nó không chỉ là số. Một lời nhắc nhanh về softmax:

P (y = j | x) = \frac{e^{x_{j}}}{\sum_{k = 1}^{K} e^{x_{k}}}

$P(y=j | x) = \frac{e^{x_j}}{\sum_{k=1}^K e^{x_k}}$

Nơi là một vector đầu vào với chiều dài tương đương với số lớp . Hàm softmax có 3 thuộc tính rất hay: 1. nó bình thường hóa dữ liệu của bạn (đưa ra phân phối xác suất phù hợp), 2. là khác biệt và 3. nó sử dụng exp bạn đã đề cập. Một vài điểm quan trọng: $x$ $K$

Hàm mất không liên quan trực tiếp đến softmax. Bạn có thể sử dụng chuẩn hóa và vẫn sử dụng entropy chéo.
Hàm "hardmax" (tức là argmax) không khác biệt. Softmax cung cấp ít nhất một xác suất tối thiểu cho tất cả các yếu tố trong vectơ đầu ra, và do đó rất khác biệt, do đó thuật ngữ "mềm" trong softmax.
Bây giờ tôi nhận được câu hỏi của bạn. Các trong softmax là hàm số mũ tự nhiên. Trước khi bình thường hóa, chúng ta biến đổi như trong biểu đồ của $e$ $x$ $e^x$ :

Nếu là 0 thì , nếu là 1, thì và nếu là 2, thì bây giờ ! Một bước tiến lớn! Đây là những gì được gọi là một chuyển đổi phi tuyến tính của điểm số nhật ký không chuẩn hóa của chúng tôi. Tính chất thú vị của hàm số mũ kết hợp với chuẩn hóa trong softmax là điểm cao trong trở nên có thể xảy ra hơn nhiều so với điểm thấp. $x$ $y=1$ $x$ $y=2.7$ $x$ $y=7$ $x$

Một ví dụ . Nói và điểm nhật ký của bạn là vectơ $K=4$ $x$ . Các đầu ra hàm argmax đơn giản: $[2, 4, 2, 1]$

[0, 1, 0, 0]

$[0, 1, 0, 0]$

Argmax là mục tiêu, nhưng nó không khác biệt và chúng ta không thể huấn luyện mô hình của mình với nó :( Một chuẩn hóa đơn giản, có thể phân biệt được, đưa ra các xác suất sau:

[0.2222, 0.4444, 0.2222, 0.1111]

$[0.2222, 0.4444, 0.2222, 0.1111]$

Điều đó thực sự rất xa so với argmax! :( Trong khi các đầu ra softmax:

[0.1025, 0.7573, 0.1025, 0.0377]

$[0.1025, 0.7573, 0.1025, 0.0377]$

Điều đó gần với argmax hơn nhiều! Bởi vì chúng tôi sử dụng số mũ tự nhiên, chúng tôi cực kỳ tăng xác suất của điểm số lớn nhất và giảm xác suất của điểm thấp hơn khi so sánh với tiêu chuẩn hóa. Do đó "max" trong softmax.

— vega
nguồn

Thông tin tuyệt vời. Tuy nhiên, thay vì sử dụng e, vậy còn việc sử dụng hằng số 3 hoặc 4 thì sao? Kết quả sẽ giống nhau?

— Cheok Yan Cheng

@CheokYanCheng, vâng. Nhưng ecó đạo hàm đẹp hơn;)

— vega

Tôi đã thấy rằng kết quả của softmax thường được sử dụng làm xác suất thuộc về mỗi lớp. Nếu sự lựa chọn 'e' thay vì hằng số khác là tùy ý, sẽ không có ý nghĩa gì khi xem nó theo xác suất, phải không?

— javierdvalle

@vega Sorry, but I still don't see how that answers the question: why not use e^e^e^e^e^x for the very same reasons? Please explain

— Gulzar

@jvalle it is not e that makes it interpretable as a probability, it is the fact each element of the softmax output is bounded in [0,1] and the whole sums to 1.

— vega

In addition to vega's explanation,

P (y = j | x) = \frac{ψ^{x_{j}}}{\sum_{k = 1}^{K} ψ^{x_{k}}}

$P(y=j | x) = \frac{\psi^{x_j}}{\sum_{k=1}^K \psi^{x_k}}$

ψ

$\psi$ is a constant >= 1

$\psi=1$ , then you are pretty far from argmax as @vega mentioned.

$\psi=100$ , now you are pretty close to the argmax but you also have a really small numbers for negative values and big numbers for positives. This numbers overflows the float point arithmetic limit easily(for example maximum limit of numpy float64 is $10^{308}$ ). In addition to that, even if the selection is $\psi=e$ which is much smaller than $100$ , frameworks should implement a more stable version of softmax (multiplying both numerator and denominator with constant $C$ ) since results become to small to be able to express with such precision.

So, you want to pick a constant big enough to approximate argmax well, and also small enough to express these big and small numbers in calculations.

And of course, $e$ also has pretty nice derivative.

— komunistbakkal
nguồn

This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.

I will explain this by using an example with $N$ images, which are constituted by $n_1$ images from the class $\mathcal{C}_1$ , $n_2$ images from the class $\mathcal{C}_2$ , ..., and $n_K$ images from the class $\mathcal{C}_K$ . Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' $E_k$ to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.

The mean energy $\bar{E}$ is related to the other energies $E_k$ by the following relationship

N \bar{E} = \sum_{k = 1}^{K} n_{k} E_{k} . (*)

$\begin{equation} N\bar{E} = \sum_{k=1}^{K} n_k E_k.\qquad (*) \label{eq:mean_energy} \end{equation}$

At the same time, we see that the total amount of images can be calculated as the following sum

N = \sum_{k = 1}^{K} n_{k} . (* *)

$\begin{equation} N = \sum_{k=1}^{K}n_k.\qquad (**) \label{eq:conservation_of_particles} \end{equation}$

The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class $n_1$ it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.

The entropy increases with the number of combinations in which we can split the $N$ images into the $n_1$ , $n_2$ , ..., $n_K$ image classes with corresponding energy. This number of combinations is given by the multinomial coefficient

(\begin{matrix} N! \\ n_{1}!, n_{2}!, \dots, n_{K}! \end{matrix}) = \frac{N!}{\prod_{k = 1}^{K} n_{k}!} .

$\begin{equation} \begin{pmatrix} N!\\ n_1!,n_2!,\ldots,n_K!\\ \end{pmatrix}=\dfrac{N!}{\prod_{k=1}^K n_k!}. \end{equation}$

We will try to maximize this number assuming that we have infinitely many images $N\to \infty$ . But his maximization has also equality constraints $(*)$ and $(**)$ . This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers $\beta$ and $\alpha$ for the equality constraints and we introduce the Lagrange Funktion $\mathcal{L}\left(n_1,n_2,\ldots,n_k;\alpha, \beta \right)$ .

L (n_{1}, n_{2}, \dots, n_{k}; α, β) = \frac{N!}{\prod_{k = 1}^{K} n_{k}!} + β [\sum_{k = 1}^{K} n_{k} E_{k} - N \bar{E}] + α [N - \sum_{k = 1}^{K} n_{k}]

$\begin{equation} \mathcal{L}\left(n_1,n_2,\ldots,n_k;\alpha, \beta \right) = \dfrac{N!}{\prod_{k=1}^{K}n_k!}+\beta\left[\sum_{k=1}^Kn_k E_k - N\bar{E}\right]+\alpha\left[N-\sum_{k=1}^{K} n_k\right] \end{equation}$

As we assumed $N\to \infty$ we can also assume $n_k \to \infty$ and use the Stirling approximation for the factorial

\ln n! = n \ln n - n + O (\ln n) .

$\begin{equation} \ln n! = n\ln n - n + \mathcal{O}(\ln n). \end{equation}$

Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to $\ln n!$ for $n\to \infty$ .

The partial derivative of the Lagrange function with respect $n_\tilde{k}$ will result in

\frac{\partial L}{\partial n_{\tilde{k}}} = - \ln n_{\tilde{k}} - 1 - α + β E_{\tilde{k}} .

$\dfrac{\partial \mathcal{L}}{\partial n_\tilde{k}}=-\ln n_\tilde{k}-1-\alpha+\beta E_\tilde{k}.$

If we set this partial derivative to zero we can find

n_{\tilde{k}} = \frac{\exp (β E_{\tilde{k}})}{\exp (1 + α)} . (* * *)

$n_\tilde{k}=\dfrac{\exp(\beta E_\tilde{k})}{\exp(1+\alpha)}. \qquad (***)$

If we put this back into $(**)$ we can obtain

\exp (1 + α) = \frac{1}{N} \sum_{k = 1}^{K} \exp (β E_{k}) .

$\exp(1+\alpha)=\dfrac{1}{N}\sum_{k=1}^K\exp(\beta E_k).$

If we put this back into $(***)$ we get something that should remind us of the softmax function

n_{\tilde{k}} = \frac{\exp (β E_{\tilde{k}})}{\frac{1}{N} \sum_{k = 1}^{K} \exp (β E_{k})} .

$n_\tilde{k}=\dfrac{\exp(\beta E_\tilde{k})}{\dfrac{1}{N}\sum_{k=1}^K\exp(\beta E_k)}.$

If we define $n_\tilde{k}/N$ as the probability of class $\mathcal{C}_\tilde{k}$ by $p_\tilde{k}$ we will obtain something that is really similar to the softmax function

p_{\tilde{k}} = \frac{\exp (β E_{\tilde{k}})}{\sum_{k = 1}^{K} \exp (β E_{k})} .

$p_\tilde{k}=\dfrac{\exp(\beta E_\tilde{k})}{\sum_{k=1}^K\exp(\beta E_k)}.$

Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set $\beta E_\tilde{k}=\boldsymbol{w}^T_k\boldsymbol{x}$ we exactly get the definition of the softmax function for the $k^{\text{th}}$ output.

— MachineLearner
nguồn