学习笔记 NLP里的RNN和LSTM[通俗易懂]

1. Recurrent neural network

1.1 Elman network

$h_{t}=\sigma_{h}\left(W_{h} x_{t}+U_{h} h_{t-1}+b_{h}\right)$

$y_{t}=\sigma_{y}\left(W_{y} h_{t}+b_{y}\right)$

1.2 Jordan network

$h_{t}=\sigma_{h}\left(W_{h} x_{t}+U_{h} y_{t-1}+b_{h}\right)$

$y_{t}=\sigma_{y}\left(W_{y} h_{t}+b_{y}\right)$

Variables and functions

$x_{t}$ : input vector
$h_{t}$ : hidden layer vector
$y_{t}$ : output vector
$W, U$ and $b$ : parameter matrices and vector
$\sigma_{h}$ and $\sigma_{y}$ : Activation functions

1.3 Bidirectional RNN

2. Long short-term memory

2.1 LSTM with a forget gate

The compact forms of the equations for the forward pass of an LSTM cell with a forget gate are:
$\begin{aligned} f_{t} &=\sigma_{g}\left(W_{f} x_{t}+U_{f} h_{t-1}+b_{f}\right) \\ i_{t} &=\sigma_{g}\left(W_{i} x_{t}+U_{i} h_{t-1}+b_{i}\right) \\ o_{t} &=\sigma_{g}\left(W_{o} x_{t}+U_{o} h_{t-1}+b_{o}\right) \\ \tilde{c}_{t} &=\sigma_{c}\left(W_{c} x_{t}+U_{c} h_{t-1}+b_{c}\right) \\ c_{t} &=f_{t} \circ c_{t-1}+i_{t} \circ \tilde{c}_{t} \\ h_{t} &=o_{t} \circ \sigma_{h}\left(c_{t}\right) \end{aligned}$
where the initial values are $c_{0}=0$ and $h_{0}=0$ and the operator o denotes the Hadamard product (element-wise product). The subscript $t$ indexes the time step.
Variables

$x_{t} \in \mathbb{R}^{d}$ : input vector to the LSTM unit
$f_{t} \in(0,1)^{h}$ : forget gate’s activation vector
$i_{t} \in(0,1)^{h}:$ input/update gate’s activation vector
$o_{t} \in(0,1)^{h}$ : output gate’s activation vector
$h_{t} \in(-1,1)^{h}$ : hidden state vector also known as output vector of the LSTM unit
$\tilde{c}_{t} \in(-1,1)^{h}:$ cell input activation vector
$c_{t} \in \mathbb{R}^{h}$ : cell state vector
$\in \mathbb{R}^{h \times d}, U \in \mathbb{R}^{h \times h}$ and $\in \mathbb{R}^{h}$ : weight matrices and bias vector parameters which need to be learned during training where the superscripts $d$ and $h$ refer to the number of input features and number of hidden units, respectively.

2.2 Peephole LSTM

$\begin{aligned} f_{t} &=\sigma_{g}\left(W_{f} x_{t}+U_{f} c_{t-1}+b_{f}\right) \\ i_{t} &=\sigma_{g}\left(W_{i} x_{t}+U_{i} c_{t-1}+b_{i}\right) \\ o_{t} &=\sigma_{g}\left(W_{o} x_{t}+U_{o} c_{t-1}+b_{o}\right) \\ c_{t} &=f_{t} \circ c_{t-1}+i_{t} \circ \sigma_{c}\left(W_{c} x_{t}+b_{c}\right) \\ h_{t} &=o_{t} \circ \sigma_{h}\left(c_{t}\right) \end{aligned}$

3. training RNN

3.1 Problem

RNN: The error surface is either very flat or very steep → 梯度消失/爆炸 Gradient Vanishing/Exploding

绘图1

3.2 Techniques

Clipping the gradients
Advanced optimization technology
- NAG
- RMSprop
Try LSTM (or other simpler variants)
- Can deal with gradient vanishing (not gradient explode)
- Memory and input are added (在RNN中，对于每一个输入，memory会重置)
- The influence never disappears unless forget gate is closed (No Gradient vanishing, if forget gate is opened.)
Better initialization
- Vanilla RNN Initialized with Identity matrix + ReLU activation function [Quoc V. Le, arXiv’15]

参考资料

[1] Recurrent neural network – Wikipedia

[2] Long short-term memory – Wikipedia

[3] Bidirectional Recurrent Neural Networks – Dive into Deep …

[4] 机器学习李宏毅

今天的文章学习笔记 NLP里的RNN和LSTM[通俗易懂]分享到此就结束了，感谢您的阅读。

版权声明：本文内容由互联网用户自发贡献，该文观点仅代表作者本人。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容，请发送邮件至举报，一经查实，本站将立刻删除。
如需转载请保留出处：https://bianchenghao.cn/67061.html