Attention Mechanisms-attention-scoring-functions课后题

1、Modify keys in the toy example and visualize attention weights. Do additive attention and scaled dot-product attention still output the same attention weights? Why or why not?

相同,原因是keys为形状为(2,10,2)值为全1矩阵,经过线性变换层,结果的为形状为(2,10,8),其中最后一维8每行数据相同(因为是one矩阵);queries形状为(2,1,20),经过线性变化为(2,1,8);keys和queies按照(batch_size, queries_size, token_size, feature)分别扩展queries的的token_size维和keys的token_size维,queries和key用广播机制相加获得形状为(2,1,10,2),由于keys每行值相同,queries有2行不同的值(2个batch);所以相加结果features每个batch内值均相同,有2个batch,然后features所以经过valid_lens进行mask,因未valid_lens有2个值,因为feature在每个batch中值均相同,所以经过mask之后计算softmax都相同

3、When queries and keys have the same vector length, is vector summation a better design than dot product for the scoring function? Why or why not? 点乘和相加哪个更好

矩阵的点乘是一个矩阵点空间见的转换,从一个点空间到另外一个点空间; 但是相加只是相同空间点的叠加。响应能表示的范围更广

posted @ 2021-05-27 17:32  哈哈哈喽喽喽  阅读(48)  评论(0)    收藏  举报