- Input
- data
-
$\mathbb{x}$ (shape 2,)
-
- label
-
$\mathbb{y}$ (shape 3,)
-
- data
- Weight
- Input -> Hidden layer
- $\mathbb{w} = \begin{bmatrix}w_{11} & w_{12} & w_{13} \ w_{21} & w_{22} & w_{23}\end{bmatrix}$ (shape 2x3)
- Input -> Hidden layer
- Bias
- Hidden layer
-
$\mathbb{b} = [b_1, b_2, b_3]^T$ (shape 3,)
-
- Hidden layer
- Layer nodes
- Input layer
$\mathbb{x} = [x_1, x_2]^T$
- Hidden layer
$\mathbb{\theta} = [\theta_1, \theta_2, \theta_3]^T = \mathbb{w}^T\mathbb{x} + \mathbb{b}$ $\displaystyle \theta_k = \sum_{i=1}^3 w_{ik}x_i + b_k$
- Output layer (Softmax)
$\hat{\mathbb{y}} = [\hat{y}_1, \hat{y}_2, \hat{y}_3]^T$ - $\displaystyle\hat{y_i} = \operatorname{softmax}(\mathbb{\theta})i = \frac{\exp(\theta_i)}{\displaystyle\sum{j=1}^3 \exp(\theta_j)} ,\forall i=1, 2, 3$
- Input layer
- Loss Function: Cross Entropy
$J(\theta) = \operatorname{CE}(\mathbb{y}, \mathbb{\hat{y}}) = \displaystyle -\sum_{i=1}^3 y_i \log(\hat{y}_i)$ $\mathbb{\theta} = \mathbb{w}^T\mathbb{x} + \mathbb{b}$
What we want to calculate
According to chain rule we get
So we first calculate derivative of the softmax function
(define
Consider two different cases
Note that if i = j this derivative is similar to the derivative of the logistic function.
Then we calculate the derivative of the cross-entropy loss function for the softmax function.
(
Lets find out
Because
and
So we found that
- Weight & Bias
-
$\mathbb{w}$ = 0.5 for all w -
$\mathbb{b}$ = 1 for all b
-
- Input
$\mathbb{x} = [8, 7]$ $\mathbb{y} = [0, 0, 1]$
Result:
Round: 0
Current loss: [[1.9095425]]
Current weight:
[[0.5 0.5 0.5]
[0.5 0.5 0.5]]
Current bias:
[[1]
[1]
[1]]
Round: 1
Current loss: [[2.99760217e-15]]
Current weight:
[[-2.16666667 -2.16666667 5.83333333]
[-1.83333333 -1.83333333 5.16666667]]
Current bias:
[[0.66666667]
[0.66666667]
[1.66666667]]
======= Finish Training ======
After 1 round training
Final eight:
[[-2.16666667 -2.16666667 5.83333333]
[-1.83333333 -1.83333333 5.16666667]]
Final bias:
[[0.66666667]
[0.66666667]
[1.66666667]]
Final loss: [[2.99760217e-15]]
y_hat =
[[3.09335001e-50]
[3.09335001e-50]
[1.00000000e+00]]- Weight & Bias
-
$\mathbb{w}$ = random uniform distribution 0~1 -
$\mathbb{b}$ = 0 for all b
-
- Input
$\mathbb{x} = [8, 7]$ $\mathbb{y} = [0, 0, 1]$
Result:
Round: 0
Current loss: [[0.46077411]]
Current weight:
[[-0.03010464 -0.48454925 -0.19359648]
[-0.48489008 -0.29742846 -0.09327472]]
Current bias:
[[0.]
[0.]
[0.]]
Round: 1
Current loss: [[3.02091685e-13]]
Current weight:
[[-1.54217344 -0.63265234 1.46657541]
[-1.80795027 -0.42701866 1.35937568]]
Current bias:
[[-0.1890086 ]
[-0.01851289]
[ 0.20752149]]
======= Finish Training ======
After 1 round training
Final eight:
[[-1.54217344 -0.63265234 1.46657541]
[-1.80795027 -0.42701866 1.35937568]]
Final bias:
[[-0.1890086 ]
[-0.01851289]
[ 0.20752149]]
Final loss: [[3.02091685e-13]]
y_hat =
[[5.56493009e-21]
[1.50529585e-13]
[1.00000000e+00]]Softmax & Cross Entropy
- Softmax classification with cross-entropy
- Classification and Loss Evaluation - Softmax and Cross Entropy Loss
- Backpropagation with Softmax / Cross Entropy
- Derivative of Softmax loss function
- The Softmax function and its derivative
- CSDN - 【深度學習】:超詳細的Softmax求導
Softmax in Python
