学习笔记 NLP里的RNN和LSTM[通俗易懂]

学习笔记 NLP里的RNN和LSTM[通俗易懂]学习笔记NLP里的RNN和LSTM_lstmwithaforgetgate

1. Recurrent neural network

Recurrent_neural_network_unfold.svg

1.1 Elman network

h t = σ h ( W h x t + U h h t − 1 + b h ) h_{t}=\sigma_{h}\left(W_{h} x_{t}+U_{h} h_{t-1}+b_{h}\right) ht=σh(Whxt+Uhht1+bh)

y t = σ y ( W y h t + b y ) y_{t}=\sigma_{y}\left(W_{y} h_{t}+b_{y}\right) yt=σy(Wyht+by)

1.2 Jordan network

h t = σ h ( W h x t + U h y t − 1 + b h ) h_{t}=\sigma_{h}\left(W_{h} x_{t}+U_{h} y_{t-1}+b_{h}\right) ht=σh(Whxt+Uhyt1+bh)

y t = σ y ( W y h t + b y ) y_{t}=\sigma_{y}\left(W_{y} h_{t}+b_{y}\right) yt=σy(Wyht+by)

Variables and functions

  • x t x_{t} xt : input vector
  • h t h_{t} ht : hidden layer vector
  • y t y_{t} yt : output vector
  • W , U W, U W,U and b b b : parameter matrices and vector
  • σ h \sigma_{h} σh and σ y \sigma_{y} σy : Activation functions
1.3 Bidirectional RNN


学习笔记 NLP里的RNN和LSTM[通俗易懂]

2. Long short-term memory

2.1 LSTM with a forget gate

The compact forms of the equations for the forward pass of an LSTM cell with a forget gate are:
f t = σ g ( W f x t + U f h t − 1 + b f ) i t = σ g ( W i x t + U i h t − 1 + b i ) o t = σ g ( W o x t + U o h t − 1 + b o ) c ~ t = σ c ( W c x t + U c h t − 1 + b c ) c t = f t ∘ c t − 1 + i t ∘ c ~ t h t = o t ∘ σ h ( c t ) \begin{aligned} f_{t} &=\sigma_{g}\left(W_{f} x_{t}+U_{f} h_{t-1}+b_{f}\right) \\ i_{t} &=\sigma_{g}\left(W_{i} x_{t}+U_{i} h_{t-1}+b_{i}\right) \\ o_{t} &=\sigma_{g}\left(W_{o} x_{t}+U_{o} h_{t-1}+b_{o}\right) \\ \tilde{c}_{t} &=\sigma_{c}\left(W_{c} x_{t}+U_{c} h_{t-1}+b_{c}\right) \\ c_{t} &=f_{t} \circ c_{t-1}+i_{t} \circ \tilde{c}_{t} \\ h_{t} &=o_{t} \circ \sigma_{h}\left(c_{t}\right) \end{aligned} ftitotc~tctht=σg(Wfxt+Ufht1+bf)=σg(Wixt+Uiht1+bi)=σg(Woxt+Uoht1+bo)=σc(Wcxt+Ucht1+bc)=ftct1+itc~t=otσh(ct)
where the initial values are c 0 = 0 c_{0}=0 c0=0 and h 0 = 0 h_{0}=0 h0=0 and the operator o denotes the Hadamard product (element-wise product). The subscript t t t indexes the time step.
Variables

  • x t ∈ R d x_{t} \in \mathbb{R}^{d} xtRd : input vector to the LSTM unit
  • f t ∈ ( 0 , 1 ) h f_{t} \in(0,1)^{h} ft(0,1)h : forget gate’s activation vector
  • i t ∈ ( 0 , 1 ) h : i_{t} \in(0,1)^{h}: it(0,1)h: input/update gate’s activation vector
  • o t ∈ ( 0 , 1 ) h o_{t} \in(0,1)^{h} ot(0,1)h : output gate’s activation vector
  • h t ∈ ( − 1 , 1 ) h h_{t} \in(-1,1)^{h} ht(1,1)h : hidden state vector also known as output vector of the LSTM unit
  • c ~ t ∈ ( − 1 , 1 ) h : \tilde{c}_{t} \in(-1,1)^{h}: c~t(1,1)h: cell input activation vector
  • c t ∈ R h c_{t} \in \mathbb{R}^{h} ctRh : cell state vector
  • W ∈ R h × d , U ∈ R h × h W \in \mathbb{R}^{h \times d}, U \in \mathbb{R}^{h \times h} WRh×d,URh×h and b ∈ R h b \in \mathbb{R}^{h} bRh : weight matrices and bias vector parameters which need to be learned during training where the superscripts d d d and h h h refer to the number of input features and number of hidden units, respectively.
2.2 Peephole LSTM

f t = σ g ( W f x t + U f c t − 1 + b f ) i t = σ g ( W i x t + U i c t − 1 + b i ) o t = σ g ( W o x t + U o c t − 1 + b o ) c t = f t ∘ c t − 1 + i t ∘ σ c ( W c x t + b c ) h t = o t ∘ σ h ( c t ) \begin{aligned} f_{t} &=\sigma_{g}\left(W_{f} x_{t}+U_{f} c_{t-1}+b_{f}\right) \\ i_{t} &=\sigma_{g}\left(W_{i} x_{t}+U_{i} c_{t-1}+b_{i}\right) \\ o_{t} &=\sigma_{g}\left(W_{o} x_{t}+U_{o} c_{t-1}+b_{o}\right) \\ c_{t} &=f_{t} \circ c_{t-1}+i_{t} \circ \sigma_{c}\left(W_{c} x_{t}+b_{c}\right) \\ h_{t} &=o_{t} \circ \sigma_{h}\left(c_{t}\right) \end{aligned} ftitotctht=σg(Wfxt+Ufct1+bf)=σg(Wixt+Uict1+bi)=σg(Woxt+Uoct1+bo)=ftct1+itσc(Wcxt+bc)=otσh(ct)


学习笔记 NLP里的RNN和LSTM[通俗易懂]

3. training RNN

3.1 Problem

RNN: The error surface is either very flat or very steep → 梯度消失/爆炸 Gradient Vanishing/Exploding

绘图1

3.2 Techniques
  • Clipping the gradients
  • Advanced optimization technology
    • NAG
    • RMSprop
  • Try LSTM (or other simpler variants)
    • Can deal with gradient vanishing (not gradient explode)
    • Memory and input are added (在RNN中,对于每一个输入,memory会重置)
    • The influence never disappears unless forget gate is closed (No Gradient vanishing, if forget gate is opened.)
  • Better initialization
    • Vanilla RNN Initialized with Identity matrix + ReLU activation function [Quoc V. Le, arXiv’15]

参考资料

[1] Recurrent neural network – Wikipedia

[2] Long short-term memory – Wikipedia

[3] Bidirectional Recurrent Neural Networks – Dive into Deep …

[4] 机器学习 李宏毅

今天的文章学习笔记 NLP里的RNN和LSTM[通俗易懂]分享到此就结束了,感谢您的阅读。

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
如需转载请保留出处:https://bianchenghao.cn/67061.html

(0)
编程小号编程小号

相关推荐

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注