seq2seq.py中attention model源码分析及各种变形

http://blog.csdn.net/wuzqChom/article/details/77918780
https://zhuanlan.zhihu.com/p/27769667

是在seq2seq_model中的tf.contrib.legacy_seq2seq.model_with_buckets函数中调用的tf.contrib.legacy_seq2seq.embedding_attention_seq2seq.
model_with_buckets函数是用来计算output和loss的。而embedding_attention_seq2seq是负责计算output的部分（和state）。接口情况如下：

           tf.contrib.legacy_seq2seq.embedding_attention_seq2seq(
                encoder_inputs,# shape=[encoder_size, batch_size] (6,32) (?, )
                decoder_inputs,# shape=[decoder_size, batch_size] (6,32)
                cell,# MultiRNNCell 返回的网络结构
                num_encoder_symbols=source_vocab_size,# 词典大小 10
                num_decoder_symbols=target_vocab_size,# 词典大小 10
                embedding_size=size,# embedding的维度 32
                output_projection=output_projection,# 采用sample_softmax，因此有两个参数 (w,b)
                feed_previous=do_decode,# train or predict
                dtype=dtype)

这样就跳转到seq2seq.py源文件中。
其调用关系如下：
embedding_attention_seq2seq() -> embedding_attention_decoder() -> attention_decoder() -> attention()

embedding_attention_seq2seq() 负责将encoder_input转换成embedding形式（由序号变成向量），代码如下

with variable_scope.variable_scope(
                scope or "embedding_attention_seq2seq", dtype=dtype) as scope:
    dtype = scope.dtype
    # Encoder.
    encoder_cell = core_rnn_cell.EmbeddingWrapper(
        cell,
        embedding_classes=num_encoder_symbols,  # 10
        embedding_size=embedding_size)  # 32维 构造embedding的网络结构
    encoder_outputs, encoder_state = core_rnn.static_rnn(
        encoder_cell, encoder_inputs, dtype=dtype)  # 用构造出来的结构将encoder_input转换成embedding形式，即encoder_outputs。

    # 计算得出attention_states 
    top_states = [
        array_ops.reshape(e, [-1, 1, cell.output_size]) for e in encoder_outputs
    ]
    attention_states = array_ops.concat(top_states, 1)

    # Decoder.
    output_size = None

    if isinstance(feed_previous, bool):
        return embedding_attention_decoder(
            decoder_inputs,
            encoder_state,
            attention_states,
            cell,
            num_decoder_symbols,
            embedding_size,
            num_heads=num_heads,  # 1 代表后面的公式三种做几次加权
            output_size=output_size,  # None
            output_projection=output_projection,  # (w,b)
            feed_previous=feed_previous,
            initial_state_attention=initial_state_attention)  # False

embedding_attention_decoder() 接口上文已说明。它通过通过embedding_ops.embedding_lookup()函数把decoder_inputs转换为向量的形式（刚才是通过函数转换，这次是通过构建变量然后查找）。具体代码如下：

if output_size is None:
    output_size = cell.output_size  # 32
if output_projection is not None:
    proj_biases = ops.convert_to_tensor(output_projection[1], dtype=dtype)
    proj_biases.get_shape().assert_is_compatible_with([num_symbols])  # 把b转换成tensor形式，并验证是否是词典大小10

with variable_scope.variable_scope(
                scope or "embedding_attention_decoder", dtype=dtype) as scope:
    embedding = variable_scope.get_variable("embedding",
                                            [num_symbols, embedding_size])
    loop_function = _extract_argmax_and_embed(
        embedding, output_projection,
        update_embedding_for_previous) if feed_previous else None  # False
    emb_inp = [
        embedding_ops.embedding_lookup(embedding, i) for i in decoder_inputs
    ]  # 改变decoder_inputs维度 (?, )变成(?, 32)
    return attention_decoder(
        emb_inp,
        initial_state,
        attention_states,
        cell,
        output_size=output_size,
        num_heads=num_heads,
        loop_function=loop_function,  # None
        initial_state_attention=initial_state_attention)  # false

attention_decoder() 核心函数。

对于这幅图，z是以下代码中的state，h是hidden_state。对应的权重分别为attention_vec_size和k。具体计算交给attention函数处理。

c是input和attn结合之后得出的输入cell的x。
x和state一起给cell，得到新的state和cell_output。
state输给attention函数，得到新的attn。
attn和cell_output得到最终的output。
在后文提到的beam search体现在：它的input不再是直接由decoder_input得来，而是由前一个output(prev)给loop_function计算得到。

with variable_scope.variable_scope(
                scope or "attention_decoder", dtype=dtype) as scope:
    dtype = scope.dtype
    # 从输入数据中得出这些参数维度
    batch_size = array_ops.shape(decoder_inputs[0])[0]  # 32 保持了输入时的结构shape=[decoder_size, batch_size] (6,32)
    attn_length = attention_states.get_shape()[1].value
    attn_size = attention_states.get_shape()[2].value  # 和embedding_size是一样的

    hidden = array_ops.reshape(attention_states,
                               [-1, attn_length, 1, attn_size])  # hidden就是h（attention_states）
    hidden_features = []
    v = []
    attention_vec_size = attn_size  # attention query vector
    for a in xrange(num_heads):  # 1
        k = variable_scope.get_variable("AttnW_%d" % a,
                                        [1, 1, attn_size, attention_vec_size])
        hidden_features.append(nn_ops.conv2d(hidden, k, [1, 1, 1, 1], "SAME"))  # attention state和k卷积
        v.append(
            variable_scope.get_variable("AttnV_%d" % a, [attention_vec_size]))  # v的size设置为和attn_size一样

    state = initial_state

    # 准备阶段
    outputs = []
    prev = None
    batch_attn_size = array_ops.stack([batch_size, attn_size])
    attns = [
        array_ops.zeros(
            batch_attn_size, dtype=dtype) for _ in xrange(num_heads)
    ]
    for a in attns:  # Ensure the second shape of attention vectors is set.
        a.set_shape([None, attn_size])
    if initial_state_attention:
        attns = attention(initial_state)

    for i, inp in enumerate(decoder_inputs):
        # 循环，依次将每一个时刻的state都做一次attention，然后和该时刻的decoder_inputs值共同决定该时刻的输入
        if loop_function is not None and prev is not None:
            with variable_scope.variable_scope("loop_function", reuse=True):
                inp = loop_function(prev, i)
        input_size = inp.get_shape().with_rank(2)[1]  # 32
        x = linear([inp] + attns, input_size, True)  # x是综合了输入和attention的结果
        # Run the RNN.
        # # 使用输入和上一个时刻的隐状态共同决定当前时刻的隐状态和解码的输出
        cell_output, state = cell(x, state)
        # 调用attention函数
        if i == 0 and initial_state_attention:
            with variable_scope.variable_scope(
                    variable_scope.get_variable_scope(), reuse=True):
                attns = attention(state)  # state就是z， 即decoder的隐层状态
        else:
            attns = attention(state)

        with variable_scope.variable_scope("AttnOutputProjection"):
            output = linear([cell_output] + attns, output_size, True)  # 综合attention的结果和cell本身的输出
        outputs.append(output)  # 最终的输出

return outputs, state

最后的attention函数。

输入的query就是公式1中的d_{t}

def attention(query):
    """Put attention masks on hidden using hidden_features and query."""
    ds = []  # 存储最终结果
    if nest.is_sequence(query):  # If the query is a tuple, flatten it.
        query_list = nest.flatten(query)
        for q in query_list:  # Check that ndims == 2 if specified.
            ndims = q.get_shape().ndims
            if ndims:
                assert ndims == 2
        query = array_ops.concat(query_list, 1)
    for a in xrange(num_heads):
        with variable_scope.variable_scope("Attention_%d" % a):
            # y是公式（1）中的$W_2^{d_t}$
            y = linear(query, attention_vec_size, True)
            y = array_ops.reshape(y, [-1, 1, 1, attention_vec_size])
            # Attention mask is a softmax of v^T * tanh(...).
            # s是公式1的结果
            s = math_ops.reduce_sum(v[a] * math_ops.tanh(hidden_features[a] + y),
                                    [2, 3])
            # 公式（2）
            a = nn_ops.softmax(s)
            # Now calculate the attention-weighted vector d.
            # 公式（3）
            d = math_ops.reduce_sum(
                array_ops.reshape(a, [-1, attn_length, 1, 1]) * hidden, [1, 2])
            ds.append(array_ops.reshape(d, [-1, attn_size]))
    return ds

在GNMT里

后三个函数都没有变化。结构如下：

在seq2seq_model.py中，首先用命令

cell = Stack_Residual_RNNCell.Stack_Residual_RNNCell(list_of_cell)

重新定义了整个网络的cell，而不是直接用tf.contrib.rnn.MultiRNNCell。新代码如下：

def __call__(self, inputs, state, scope=None):
    with vs.variable_scope(scope or type(self).__name__):
        cur_state_pos = 0
        cur_inp = inputs
        if self._use_residual_connections:  # new
            past_inp = tf.zeros_like(cur_inp)  # past_inp负责保存之前的cur_inp的值，这样计算新的cur_inp的时候就可以加上past_inp
        new_states = []
        for i, cell in enumerate(self._cells):
            with vs.variable_scope("Cell%d" % i):
                if self._state_is_tuple:
                    if not nest.is_sequence(state):
                        raise ValueError(
                            "Expected state to be a tuple of length %d, but received: %s"
                            % (len(self.state_size), state))
                    cur_state = state[i]
                else:
                    cur_state = array_ops.slice(
                        state, [0, cur_state_pos], [-1, cell.state_size])
                    cur_state_pos += cell.state_size
                if self._use_residual_connections:  # new
                    cur_inp += past_inp
                    past_inp = cur_inp
                cur_inp, new_state = cell(cur_inp, cur_state)
                new_states.append(new_state)
    new_states = (tuple(new_states) if self._state_is_tuple
                  else array_ops.concat(1, new_states))
    return cur_inp, new_states

之后跳转到前文提到的四个函数的第一个embedding_attention_seq2seq中。
首先用如下语句定义第一、二层双向RNN，获得这一部分的输出outputs。

encoder_fw_cell = rnn_cell.EmbeddingWrapper(single_cell_1, embedding_classes=num_encoder_symbols,
                                            embedding_size=embedding_size / 2)
encoder_bw_cell = rnn_cell.EmbeddingWrapper(single_cell_2, embedding_classes=num_encoder_symbols,
                                            embedding_size=embedding_size / 2)
outputs, _, _ = rnn.bidirectional_rnn(encoder_fw_cell, encoder_bw_cell, encoder_inputs, dtype=dtype)

再用刚才的Stack_Residual_RNNCell完成剩下来的num_layers层的RNN cell2结构。这时候的输出encoder_outputs, encoder_state就相当于源代码中embedding之后的输出。接下来的代码就和源代码一样了。
----不明白为什么要定义两个cell cell2？为什么双向embedding之后的内容不能直接用？

在Neural_Conversation_Models里

讨论在attention环境中的beam search，它改写的部分仅限于对于inp的计算：

if loop_function is not None:
       with variable_scope.variable_scope("loop_function", reuse=True):
             if prev is not None:
                  inp = loop_function(prev, i, log_beam_probs, beam_path, beam_symbols)

即：通过beam search的方法每次找出beam_size个中的最优解作为inp。具体loop_function：

def loop_function(prev, i, log_beam_probs, beam_path, beam_symbols):
    if output_projection is not None:
        prev = nn_ops.xw_plus_b(
            prev, output_projection[0], output_projection[1])
    # prev变成了one-hot形式的prob向量

    probs = tf.log(tf.nn.softmax(prev)) # softmax

    if i > 1:
        probs = tf.reshape(probs + log_beam_probs[-1],
                           [-1, beam_size * num_symbols]) # 加上之前保留的prob

    best_probs, indices = tf.nn.top_k(probs, beam_size) # 从中选出beam_size个最优的
    indices = tf.stop_gradient(tf.squeeze(tf.reshape(indices, [-1, 1])))  
    best_probs = tf.stop_gradient(tf.reshape(best_probs, [-1, 1]))

    symbols = indices % num_symbols  # 最终的词
    beam_parent = indices // num_symbols  # 从哪个beam选来的

    beam_symbols.append(symbols)
    beam_path.append(beam_parent)
    log_beam_probs.append(best_probs)

    emb_prev = embedding_ops.embedding_lookup(embedding, symbols)
    emb_prev = tf.reshape(emb_prev, [beam_size, embedding_size]) # output size!
    if not update_embedding:
        emb_prev = array_ops.stop_gradient(emb_prev)
    return emb_prev
return loop_function

这里返回的是(beam_size,input_size)，但是实际上prev接受的应该是(batch_size,input_size)。代码跑不通……

posted on 2017-11-03 11:05 yingtaomj 阅读(734) 评论(0) 收藏举报