# Various Sequence To Sequence Architectures

## Basic Models

### Sequence to sequence model ### Image captioning

use CNN(AlexNet) first to get a 4096-dimensional vector, feed it to a RNN

### Picking the Most Likely Sentence

translate a French sentence $x$ to the most likely English sentence $y$ .

it's to find

$\argmax_{y^{<1>}, \dots, y^{<T_y>}} P(y^{<1>}, \dots, y^{<T_y>} | x)$

• Why not a greedy search?

(Find the most likely words one by one) Because it may be verbose and long.

• set the $B = 3 \text{(beam width)}$, find $3$ most likely English outputs

• consider each for the most likely second word, and then find $B$ most likely words • do it again until $<EOS>$

if $B = 1$, it's just greedy search.

### Length normalization

$\argmax_{y} \prod_{t = 1}^{T_y} P(y^{<t>}|x, y^{<1>}, y^{<t - 1>})$

$P$ is much less than $1$ (close to $0$) take $\log$

$\argmax_{y} \sum_{t = 1}^{T_y} \log P(y^{<t>}|x, y^{<1>}, y^{<t - 1>})$

it tends to give the short sentences.

So you can normalize it ($\alpha$ is a hyperparameter)

$\argmax_{y} \frac 1 {T_y^{\alpha}} \sum_{t = 1}^{T_y} \log P(y^{<t>}|x, y^{<1>}, y^{<t - 1>})$

### Beam search discussion

• large $B$ : better result, slower
• small $B$ : worse result, faster

let $y^*$ be human high quality translation, and $\hat y$ be algorithm output.

• $P(y^* | x) > P(\hat y | x)$ : Beam search is at fault
• $P(y^* | x) \le P(\hat y | x)$ : RNN model is at fault

## Bleu(bilingual evaluation understudy) Score

if you have some good referrences to evaluate the score.

$p_n = \frac{\sum_{\text{n-grams} \in \hat y} \text{Count}_{\text{clip}}(\text{n-grams})} {\sum_{\text{n-grams} \in \hat y} \text{Count}(\text{n-grams})}$

### Bleu details

calculate it with $\exp(\frac{1}{4} \sum_{n = 1}^4 p_n)$

BP = brevity penalty

$BP = \begin{cases} 1 & \text{if~~MT\_output\_length > reference\_output\_length}\\ \exp(1 - \text{reference\_output\_length / MT\_output\_length}) & \text{otherwise} \end{cases}$

don't want short translation.

## Attention Model Intuition

it's hard for network to memorize the whole sentence. compute the attention weight to predict the word from the context ## Attention Model

Use a BiRNN or BiLSTM.

\begin{aligned} a^{<t'>} &= (\vec a^{<t'>}, \overleftarrow a^{<t'>})\\ \sum_{t'} \alpha^{<i, t'>} &= 1\\ c^{<i>} &= \sum_{t'} \alpha^{<i, t'>} \alpha^{<t'>} \end{aligned} ### Computing attention

\begin{aligned} \alpha^{<t, t'>} &= \text{amount of "attention" } y^{<t>} \text{ should pay to } a^{<t'>}\\ &= \frac{\exp(e^{<t, t'>})}{\sum_{t' = 1}^{T_x} \exp(e^{<t, t'>})} \end{aligned}

train a very small network to learn what the function is

the complexity is $\mathcal O(T_x T_y)$ , which is so big (quadratic cost) # Speech Recognition - Audio Data

## Speech recognition

$x(\text{audio clip}) \to y(\text{transcript})$

### Attention model for sppech recognition

generate character by character

### CTC cost for speech recognition

CTC(Connectionist temporal classification)

"ttt_h_eee___ ____qqq$\dots$" $\rightarrow$ "the quick brown fox"

Basic rule: collapse repeated characters not separated by "blank"

### Trigger Word Detection

label the trigger word, let the output be $1$s

posted @ 2021-08-23 23:21  zjp_shadow  阅读(62)  评论(0编辑  收藏  举报