Attention Mechanisms-attention-scoring-functions课后题

1、Modify keys in the toy example and visualize attention weights. Do additive attention and scaled dot-product attention still output the same attention weights? Why or why not?

相同，原因是keys为形状为（2,10,2）值为全1矩阵，经过线性变换层，结果的为形状为（2,10,8），其中最后一维8每行数据相同（因为是one矩阵）；queries形状为（2,1,20），经过线性变化为（2,1,8）；keys和queies按照（batch_size, queries_size, token_size, feature）分别扩展queries的的token_size维和keys的token_size维，queries和key用广播机制相加获得形状为（2,1,10,2），由于keys每行值相同，queries有2行不同的值（2个batch）；所以相加结果features每个batch内值均相同，有2个batch，然后features所以经过valid_lens进行mask，因未valid_lens有2个值，因为feature在每个batch中值均相同，所以经过mask之后计算softmax都相同

3、When queries and keys have the same vector length, is vector summation a better design than dot product for the scoring function? Why or why not? 点乘和相加哪个更好

矩阵的点乘是一个矩阵点空间见的转换，从一个点空间到另外一个点空间；但是相加只是相同空间点的叠加。响应能表示的范围更广

posted @ 2021-05-27 17:32 哈哈哈喽喽喽阅读(48) 评论(0) 收藏举报

刷新页面返回顶部

哈哈哈喽喽喽

Attention Mechanisms-attention-scoring-functions课后题

公告