# Softmax & Cross entropy cost function


Softmax is often added as an output layer in the neural network for sorting tasks, the key process in the backward propagation is derivation. This process can also provide a deeper understanding of the back propagation process and give more thought to the problem of gradient propagation.

## 1. softmax function

Softmax(Flexible maximum) function, usually in neural network, can work as the output layer of classification assignment. Actually we can think of softmax output as the probability of selecting several categories. For example, If I have a classification task that is divided into three classes, the Softmax function can output the probability of the selection of the three classes based on their relative size, and the probability sum is 1.

The form of softmax function is:

$$
S_i = \frac{e^{z_i}}{\sum_k e^{z_k}}
$$

* $S_i$ is the class probability output that pass through the softmax
* $z_k$ is the output of neuron

More vivid expression is shown as the following graph:

![softmax_demo](images/softmax_demo.png)

Softmax straightforward is the original output is $[3, 1, 3] $by softmax function role, is mapping the value of (0, 1), and these values are tired and 1 (meet the properties of probability), then we can understand it into probability, in the final selection of the output nodes, we can choose most probability (that is, value corresponding to the largest) node, as we predict the goal.
softm

First is the output of neuron, the following graph shows a neuron:

![softmax_neuron](images/softmax_neuron.png)

we assume that the output of neuron is:

$$
z_i = \sum_{j} w_{ij} x_{j} + b
$$

Among them $W_{ij}$ is the $jth$ weight of $ith$ neuron and $b$ is the bias. $z_i$ represent the $ith$ output of this network.

Add a softmax function to the outpur we have:

$$
a_i = \frac{e^{z_i}}{\sum_k e^{z_k}}
$$

$a_i$ represent the $ith$ output value of softmax, while the right side uses softmax function.


### 1.1  loss function

In the propagation of neural networks, we need to calculate a loss function, this loss function is actually the error between the true value and  the estimation of network. Only when we get the error, it is possible to know how to change the weight in the network.

There are many form of loss function, what we used here is the cross entropy function, it is mainly because that the derivation reasult is quiet easy and convenient to calculate, and cross entropy can solve some lower learning rate problem**[Cross entropy function](https://blog.csdn.net/u014313009/article/details/51043064)**is this：

$$
C = - \sum_i y_i ln a_i
$$

Among them $y_i$ represent the truly classification result.



## 2. Derive process

Firstly, we need to make sure what we want, we want to get the gradient of our $loss$ to neuron output($z_i$), which is:

$$
\frac{\partial C}{\partial z_i}
$$

According to the derivation rule of composite function:

$$
\frac{\partial C}{\partial z_i} = \frac{\partial C}{\partial a_j} \frac{\partial a_j}{\partial z_i}
$$

Someone may have question, why we have $a_j$ instead of $a_i$. We need to check the formula of $softmax$ here, because of the special characteristcs, its denominatorc contains all the output of neurons. Therefore, for the other output which do not equal to i, it also contains $z_i$, all the $a$ are needed to be included into the calcultaion range and the calcultaion backwards need to be divide into two parts, which is $i = j$ and $i\ne j$.

### 2.1 The partial derviation of  $a_j$

$$
\frac{\partial C}{\partial a_j} = \frac{(\partial -\sum_j y_j ln a_j)}{\partial a_j} = -\sum_j y_j \frac{1}{a_j}
$$

### 2.2 The partial derviation of $z_i$

If $i=j$ :

\begin{eqnarray}
\frac{\partial a_i}{\partial z_i} & = & \frac{\partial (\frac{e^{z_i}}{\sum_k e^{z_k}})}{\partial z_i} \\
  & = & \frac{\sum_k e^{z_k} e^{z_i} - (e^{z_i})^2}{\sum_k (e^{z_k})^2} \\
  & = & (\frac{e^{z_i}}{\sum_k e^{z_k}} ) (1 - \frac{e^{z_i}}{\sum_k e^{z_k}} ) \\
  & = & a_i (1 - a_i)
\end{eqnarray}

IF $i \ne j$:
\begin{eqnarray}
\frac{\partial a_j}{\partial z_i} & = & \frac{\partial (\frac{e^{z_j}}{\sum_k e^{z_k}})}{\partial z_i} \\
  & = &  \frac{0 \cdot \sum_k e^{z_k} - e^{z_j} \cdot e^{z_i} }{(\sum_k e^{z_k})^2} \\
  & = & - \frac{e^{z_j}}{\sum_k e^{z_k}} \cdot \frac{e^{z_i}}{\sum_k e^{z_k}} \\
  & = & -a_j a_i
\end{eqnarray}

When u, v are the dependent variable the derivation formula of derivative:
$$
(\frac{u}{v})' = \frac{u'v - uv'}{v^2} 
$$

### 2.3 Derivation of the whole

\begin{eqnarray}
\frac{\partial C}{\partial z_i} & = & (-\sum_j y_j \frac{1}{a_j} ) \frac{\partial a_j}{\partial z_i} \\
  & = & - \frac{y_i}{a_i} a_i ( 1 - a_i) + \sum_{j \ne i} \frac{y_j}{a_j} a_i a_j \\
  & = & -y_i + y_i a_i + \sum_{j \ne i} y_j a_i \\
  & = & -y_i + a_i \sum_{j} y_j \\
  & = & -y_i + a_i
\end{eqnarray}

## 3. Question
How to apply the softmax, cross entropy cost function in this section to the BP method in the previous section?

## References

* Softmax & 交叉熵
  * [交叉熵代价函数（作用及公式推导）](https://blog.csdn.net/u014313009/article/details/51043064)
  * [手打例子一步一步带你看懂softmax函数以及相关求导过程](https://www.jianshu.com/p/ffa51250ba2e)
  * [简单易懂的softmax交叉熵损失函数求导](https://www.jianshu.com/p/c02a1fbffad6)