Sequence Models - Recurrent Neural Networks

Examples of sequence data:

  • Speech recognition
  • Music generation
  • Sentiment classification
  • DNA sequence analysis
  • Machine translation
  • Video activity recognition
  • Name entity recognition

Recurrent Neural Network Model

Why not a standard network?

  • Inputs, outputs can be different lengths in different examples.
  • Doesn't share features learned across different position of text.

Weakness of RNN

only use the earlier information in sequence

(use Bidirectional RNN instead)

Forward Propagation


\[\begin{aligned} a^{(t)} &= g_1(w_{aa} a^{<t - 1>} + w_{ax} x^{<t>} + b_a)\\ &= g(w_a \begin{bmatrix} a^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_a)\\ y^{(t)} &= g_2(w_{ya} a^{<t>} + b_y) \end{aligned} \]

  • the activation \(g_1\) will often be a \(\tanh\)​ in choice of RNN
  • \(g_2\)​ will often be
    • binary classification problem: \(sigmoid\)
    • k-way classification problem: \(softmax\)

Back Propagataion


\[\begin{aligned} \mathcal L^{<t>} (\hat y^{<t>}, y^{<t>}) &= -y^{<t>} \log \hat y^{<t>} - (1 - y^{<t>}) \log (1 - \hat y^{<t>})\\ \mathcal L(\hat y, y) &= \sum_{t = 1}^{T_y} \mathcal L^{<t>} (\hat y^{<t>}, y^{<t>}) \end{aligned} \]

Different Types of RNN

  • many-to-one architecture:

    Sentiment Classification

  • one-to-many architecture

    Music Generation

  • many-to-many architecture:

    Machine Traslation: input, output can be diffent lengths. (encoder, decoder)

Language Model and Sequence Generation

  • Language modelling

    give the probability of a sentence: \(P(\text{setence}) = ?\)

    basic job: estimates the probability of sequences \(P(y^{<1>}, \dots, y^{<T_y>})\)

  • Traingning set: large corpus of english text.

    • add \(\text{<EOS>}\) at the end of sentence.
    • replace the unkown words with \(\text{<UNK>}\)
  • Training with RNN model

    replace the \(x^{<i>}\) with \(y^{<i - 1>}\) .

    \[P(y^{<1>}, y^{<2>}, y^{<3>}) = P(y^{<1>}) P(y^{<2>} | y^{<1>}) P(y^{<3>} | y^{<1>}, y^{<2>}) \]

Sampling novel sequences

  • Sampling a sequence from a trained RNN

    Generate the sentence word by word.

  • Character-level language model

    \(\text{Vocabulary = [a, b, c, \dots]}\)

Vanishing gradients with RNNS

Basic RNNs is not very good at capturing long-range dependencies.

Exploding gradients in Backpropagation. (addressed by using

Gated Recurrent Unit(GRU)



\(c = \text{memeory cell}\) and \(c^{<t>} = a^{<t>}\)

\[\begin{aligned} \tilde{c}^{<t>} &= \tanh (w_c \begin{bmatrix} c^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_c)\\ \Gamma_u &= \sigma(w_u \begin{bmatrix} c^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_u) \in [0, 1]\\ c^{<t>} &= \Gamma_u \times \tilde{c}^{<t>} + (1 - \Gamma_u) \times c^{<t - 1>} & (\text{element-wise}) \end{aligned} \]

\(\tilde{c}^{<t>}\) is a candidate for replacing \(c^{<t>}\)

\(\Gamma_u\) as being either \(0\) or \(1\) most of the time.

if \(\Gamma_u \approx 0\) , the \(c^{<t>}\) is maintained pretty much exactly even across many times that.

  • adress vanishing gradient problem
  • learn even very long-range dependencies

Full GRU

\[\begin{aligned} \tilde{c}^{<t>} &= \tanh (w_c \begin{bmatrix} \Gamma_r \times c^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_c)\\ \Gamma_u &= \sigma(w_u \begin{bmatrix} c^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_u) \in [0, 1]\\ \Gamma_r &= \sigma(w_r \begin{bmatrix} c^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_r)\\ c^{<t>} &= \Gamma_u \times \tilde{c}^{<t>} + (1 - \Gamma_u) \times c^{<t - 1>} &(\text{element-wise})\\ a^{<t>} &= c^{<t>} \end{aligned} \]

\(\Gamma_r\) is a standing of relevance

Long Short Term Memory (LSTM)


\[\begin{aligned} \tilde{c}^{<t>} &= \tanh (w_c \begin{bmatrix} a^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_c)\\ \Gamma_u &= \sigma(w_u \begin{bmatrix} a^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_u) \in [0, 1] &(\text{update})\\ \Gamma_f &= \sigma(w_f \begin{bmatrix} a^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_f) &(\text{forget})\\ \Gamma_o &= \sigma(w_o \begin{bmatrix} a^{<t - 1>}\\ x^{<t>} \end{bmatrix} + b_o) &(\text{output})\\ c^{<t>} &= \Gamma_u \times \tilde{c}^{<t>} + \Gamma_f \times c^{<t - 1>} &(\text{element-wise})\\ a^{<t>} &= \Gamma_o \times \tanh(c^{<t>}) \end{aligned} \]

peephole connection(element-wise): fifth element affect fifth element.

Bidirectional RNN

\(\overrightarrow a^{<t>}\) forward prop

Acyclic graph

\[\hat y^{<t>} = g(w_y \begin{bmatrix} \overrightarrow a^{<t>}\\ \overleftarrow{x}^{<t>} \end{bmatrix} + b_y)\\ \]

BRNN with LSTM blocks would be a pretty reasonable first thing to try

Deep RNNs


Homework: Improvise a Jazz Solo with an LSTM Network

You would like to create a jazz music piece specially for a friend's birthday. However, you don't know how to play any instruments, or how to compose music. Fortunately, you know deep learning and will solve this problem using an LSTM network!

You will train a network to generate novel jazz solos in a style representative of a body of performed work. 😎🎷

There's something coming into me when I saw it... Aye...

Exercise 1 - djmodel

n_values = 90 # number of music values
reshaper = Reshape((1, n_values))                  # Used in Step 2.B of djmodel(), below
LSTM_cell = LSTM(n_a, return_state = True)         # Used in Step 2.C
densor = Dense(n_values, activation='softmax')     # Used in Step 2.D

def djmodel(Tx, LSTM_cell, densor, reshaper):
    Implement the djmodel composed of Tx LSTM cells where each cell is responsible
    for learning the following note based on the previous note and context.
    Each cell has the following schema: 
            [X_{t}, a_{t-1}, c0_{t-1}] -> RESHAPE() -> LSTM() -> DENSE()
        Tx -- length of the sequences in the corpus
        LSTM_cell -- LSTM layer instance
        densor -- Dense layer instance
        reshaper -- Reshape layer instance
        model -- a keras instance model with inputs [X, a0, c0]
    # Get the shape of input values
    n_values = densor.units
    # Get the number of the hidden state vector
    n_a = LSTM_cell.units
    # Define the input layer and specify the shape
    X = Input(shape=(Tx, n_values)) 
    # Define the initial hidden state a0 and initial cell state c0
    # using `Input`
    a0 = Input(shape=(n_a,), name='a0')
    c0 = Input(shape=(n_a,), name='c0')
    a = a0
    c = c0
    ### START CODE HERE ### 
    # Step 1: Create empty list to append the outputs while you iterate (≈1 line)
    outputs = []
    # Step 2: Loop over tx
    for t in range(Tx):
        # Step 2.A: select the "t"th time step vector from X. 
        x = X[:, t, :]
        # Step 2.B: Use reshaper to reshape x to be (1, n_values) (≈1 line)
        x = reshaper(x)
        # Step 2.C: Perform one step of the LSTM_cell
        a, _, c = LSTM_cell(x, initial_state=[a, c])
        # Step 2.D: Apply densor to the hidden state output of LSTM_Cell
        out = densor(a)
        # Step 2.E: add the output to "outputs"
    # Step 3: Create model instance
    model = Model(inputs=[X, a0, c0], outputs=outputs)
    ### END CODE HERE ###
    return model

We will use:

  • optimizer: Adam optimizer
  • Loss function: categorical cross-entropy (for multi-class classification)
opt = Adam(lr=0.01, beta_1=0.9, beta_2=0.999, decay=0.01)

model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

history =[X, a0, c0], list(Y), epochs=100, verbose = 0)

Exercise 2 - music_inference_model

# GRADED FUNCTION: music_inference_model

def music_inference_model(LSTM_cell, densor, Ty=100):
    Uses the trained "LSTM_cell" and "densor" from model() to generate a sequence of values.
    LSTM_cell -- the trained "LSTM_cell" from model(), Keras layer object
    densor -- the trained "densor" from model(), Keras layer object
    Ty -- integer, number of time steps to generate
    inference_model -- Keras model instance
    # Get the shape of input values
    n_values = densor.units
    # Get the number of the hidden state vector
    n_a = LSTM_cell.units
    # Define the input of your model with a shape 
    x0 = Input(shape=(1, n_values))
    # Define s0, initial hidden state for the decoder LSTM
    a0 = Input(shape=(n_a,), name='a0')
    c0 = Input(shape=(n_a,), name='c0')
    a = a0
    c = c0
    x = x0

    ### START CODE HERE ###
    # Step 1: Create an empty list of "outputs" to later store your predicted values (≈1 line)
    outputs = []
    # Step 2: Loop over Ty and generate a value at every time step
    for t in range(Ty):
        # Step 2.A: Perform one step of LSTM_cell. Use "x", not "x0" (≈1 line)
        a, _, c = LSTM_cell(x, initial_state = [a, c])
        # Step 2.B: Apply Dense layer to the hidden state output of the LSTM_cell (≈1 line)
        out = densor(a)
        # Step 2.C: Append the prediction "out" to "outputs". out.shape = (None, 90) (≈1 line)
        # Step 2.D:
        # Select the next value according to "out",
        # Set "x" to be the one-hot representation of the selected value
        # See instructions above.
        x = tf.math.argmax(out, axis = -1)
        x = tf.one_hot(x, depth = n_values)
        # Step 2.E: 
        # Use RepeatVector(1) to convert x into a tensor with shape=(None, 1, 90)
        x = RepeatVector(1)(x)
    # Step 3: Create model instance with the correct "inputs" and "outputs" (≈1 line)
    inference_model = Model(inputs=[x0, a0, c0], outputs=outputs)
    ### END CODE HERE ###
    return inference_model

Exercise 3 - predict_and_sample

# GRADED FUNCTION: predict_and_sample

def predict_and_sample(inference_model, x_initializer = x_initializer, a_initializer = a_initializer, 
                       c_initializer = c_initializer):
    Predicts the next value of values using the inference model.
    inference_model -- Keras model instance for inference time
    x_initializer -- numpy array of shape (1, 1, 90), one-hot vector initializing the values generation
    a_initializer -- numpy array of shape (1, n_a), initializing the hidden state of the LSTM_cell
    c_initializer -- numpy array of shape (1, n_a), initializing the cell state of the LSTM_cel
    results -- numpy-array of shape (Ty, 90), matrix of one-hot vectors representing the values generated
    indices -- numpy-array of shape (Ty, 1), matrix of indices representing the values generated
    n_values = x_initializer.shape[2]
    ### START CODE HERE ###
    # Step 1: Use your inference model to predict an output sequence given x_initializer, a_initializer and c_initializer.
    pred = inference_model.predict([x_initializer, a_initializer, c_initializer])
    # Step 2: Convert "pred" into an np.array() of indices with the maximum probabilities
    indices = np.argmax(pred, axis = -1)
    # Step 3: Convert indices to one-hot vectors, the shape of the results should be (Ty, n_values)
    results = to_categorical(indices, num_classes = x_initializer.shape[-1])
    ### END CODE HERE ###
    return results, indices
posted @ 2021-08-13 23:34  zjp_shadow  阅读(268)  评论(0编辑  收藏  举报