• 博客园logo
  • 会员
  • 众包
  • 新闻
  • 博问
  • 闪存
  • 赞助商
  • HarmonyOS
  • Chat2DB
    • 搜索
      所有博客
    • 搜索
      当前博客
  • 写随笔 我的博客 短消息 简洁模式
    用户头像
    我的博客 我的园子 账号设置 会员中心 简洁模式 ... 退出登录
    注册 登录
鱼市口
博客园    首页    新随笔    联系   管理    订阅  订阅
NLP经典论文,自我回顾笔记

(持续更新,目前找工作中)

1. 

Sequence to Sequence Learning with Neural Networks(2014 Google Research)

However, the first few words in the source language are now very close to the first few words in the target language, so the problem’s minimal time lag is greatly reduced. Thus, backpropagation has an easier time “establishing communication” between the source sentence and the target sentence, which in turn results in substantially improved overall performance.

这篇文章基本上是重要的基石,上面这段的经验也被认为是大体正确的。YES - good res as easy “establishing communication” using SGD/ADAM etc do the BP and improve overall 

 

much larger hidden state make the model better, as they choose to use deep LSTM instead of shalow one...

We used deep LSTMs with 4 layers, with 1000 cells at each layer and 1000 dimensional word embeddings, with an input vocabulary of 160,000 and an output vocabulary of 80,000. Thus the deep LSTM uses 8000 real numbers to represent a sentence.

文章实验细节,4 X 1000 X 2 = 8k

 

  • 1st init each cell in 1000 LSTM cells , all the same 8/100
  • fixed lr = 0.7 and 7.5 epochs used and halving lr after 5 epochs, actually it can be modified today
  • LSTMs tend to not suffer from the vanishing gradient problem, they can have

    exploding gradients. Author enforced a hard constraint on the norm of the gradient [10,

    1. 25] by scaling it when its norm exceeded a threshold. For each training batch, we compute

      s=∥g∥ ,where g is the gradient divided by 128. Ifs>5,we set g= 5g/s.

    • minibatch of 128 randomly chosen training sentences will have many short sentences and few long sentences

    • so have to normlize the length of sentence taken each time 
posted on 2023-09-26 19:11  鱼市口  阅读(49)  评论(0)    收藏  举报
刷新页面返回顶部
博客园  ©  2004-2025
浙公网安备 33010602011771号 浙ICP备2021040463号-3