CS224n, lec 10, NMT & Seq2Seq Attn

1990s-2010s: Statistical Machine Translation

Core idea: Learn a probabilistic model from data
best English sentence y, given French sentence x,
Do some statistics on aligned parralle corpus
Rules!

NMT

The first model is an end-to-end, combining an encoder-decoder language model, and use beam search to find the most suitable result, more fluent and less handcraft, but less interpretable and harder to control(add rules).

Evaluation

BLEU compares the machine-written translation to one or several human-written translation(s), and computes a similarity score based on: n-gram precision. Good buy Imperfect.

ATTENTION!

Seq2Seq models are good, but they encode too much information in a single state, which appears to be the information bottleneck. We need some structual improvements to tackle this problem. Here we have attention.

Namely speaking, attention is another layer of assigning weights to all the hidden states of the encoder by dot product them with the initial state of the decoder, and summing the hidden states by weights, and concatenate the result with the initial state of the decoder, then take the result as the input of the decoder.

Formally we have

There a a variety of pros of the attention mechanism

solves the information bottleneck problem
helps with vanishing gradient problem: add highway-like gradient path
provides some interpretablity, remembering image captioning? Somehow let us know what do the neurons care about. Another example is that we can get free alignment of parallel corpus! Very cool!

Next time: more attention! I am super excited!

posted @ 2018-04-22 20:32 ichneumon 阅读(319) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

ichneumon

Code is Poetry...

CS224n, lec 10, NMT & Seq2Seq Attn

1990s-2010s: Statistical Machine Translation

NMT

Evaluation

ATTENTION!

公告