This question is very interesting. I do not know the exact reason but I think the following reason could be used to explain the usage of the exponential function. This post is inspired by statistical mechanics and the principle of maximum entropy.
I will explain this by using an example with N images, which are constituted by n1 images from the class C1, n2 images from the class C2, ..., and nK images from the class CK. Then we assume that our neural network was able to apply a nonlinear transform on our images, such that we can assign an 'energy level' Ek to all the classes. We assume that this energy is on a nonlinear scale which allows us to linearly separate the images.
The mean energy E¯ is related to the other energies Ek by the following relationship
NE¯=∑k=1KnkEk.(∗)
At the same time, we see that the total amount of images can be calculated as the following sum
N=∑k=1Knk.(∗∗)
The main idea of the maximum entropy principle is that the number of the images in the corresponding classes is distributed in such a way that that the number of possible combinations of for a given energy distribution is maximized. To put it more simply the system will not very likeli go into a state in which we only have class n1 it will also not go into a state in which we have the same number of images in each class. But why is this so? If all the images were in one class the system would have very low entropy. The second case would also be a very unnatural situation. It is more likely that we will have more images with moderate energy and fewer images with very high and very low energy.
The entropy increases with the number of combinations in which we can split the N images into the n1, n2, ..., nK image classes with corresponding energy. This number of combinations is given by the multinomial coefficient
(N!n1!,n2!,…,nK!)=N!∏Kk=1nk!.
We will try to maximize this number assuming that we have infinitely many images N→∞. But his maximization has also equality constraints (∗) and (∗∗). This type of optimization is called constrained optimization. We can solve this problem analytically by using the method of Lagrange multipliers. We introduce the Lagrange multipliers β and α for the equality constraints and we introduce the Lagrange Funktion L(n1,n2,…,nk;α,β).
L(n1,n2,…,nk;α,β)=N!∏Kk=1nk!+β[∑k=1KnkEk−NE¯]+α[N−∑k=1Knk]
As we assumed N→∞ we can also assume nk→∞ and use the Stirling approximation for the factorial
lnn!=nlnn−n+O(lnn).
Note that this approximation (the first two terms) is only asymptotic it does not mean that this approximation will converge to lnn! for n→∞.
The partial derivative of the Lagrange function with respect nk~ will result in
∂L∂nk~=−lnnk~−1−α+βEk~.
If we set this partial derivative to zero we can find
nk~=exp(βEk~)exp(1+α).(∗∗∗)
If we put this back into (∗∗) we can obtain
exp(1+α)=1N∑k=1Kexp(βEk).
If we put this back into (∗∗∗) we get something that should remind us of the softmax function
nk~=exp(βEk~)1N∑Kk=1exp(βEk).
If we define nk~/N as the probability of class Ck~ by pk~ we will obtain something that is really similar to the softmax function
pk~=exp(βEk~)∑Kk=1exp(βEk).
Hence, this shows us that the softmax function is the function that is maximizing the entropy in the distribution of images. From this point, it makes sense to use this as the distribution of images. If we set βEk~=wTkx we exactly get the definition of the softmax function for the kth output.