2、Sentence-BERT:使用 Siamese BERT-Networks 的句子嵌入

1、摘要

BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering.

BERT (Devlin et al., 2018) 和 RoBERTa (Liu et al., 2019) 在语义文本相似性 (STS) 等句子对回归任务上取得了新的最先进的性能。 然而,它需要将两个句子都输入到网络中,这会导致大量的计算开销:在 10,000 个句子的集合中找到最相似的一对需要使用 BERT 进行大约 5000 万次推理计算(约 65 小时)。 BERT 的构建使其不适用于语义相似性搜索以及聚类等无监督任务。

In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.

在本出版物中,我们介绍了 Sentence-BERT (SBERT),这是对预训练 BERT 网络的一种修改,该网络使用 siamese 和三元组网络结构来推导语义上有意义的句子嵌入,可以使用余弦相似度进行比较。 这将寻找最相似对的工作量从使用 BERT / RoBERTa 的 65 小时减少到使用 SBERT 的大约 5 秒,同时保持了 BERT 的准确性。

We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods.

我们在常见的 STS 任务和迁移学习任务上评估 SBERT 和 SRoBERTa,它优于其他最先进的句子嵌入方法。

2、介绍

In this publication, we present Sentence-BERT (SBERT), a modification of the BERT network using siamese and triplet networks that is able to derive semantically meaningful sentence embeddings2. This enables BERT to be used for certain new tasks, which up-to-now were not applicable for BERT. These tasks include large-scale semantic similarity comparison, clustering, and information retrieval via semantic search.

在本出版物中,我们介绍了 Sentence-BERT (SBERT),它是使用 siamese 和三元组网络对 BERT 网络的一种修改,能够推导出具有语义意义的句子嵌入 2。 这使得 BERT 可以用于某些新任务,而这些任务目前不适用于 BERT。 这些任务包括通过语义搜索进行大规模语义相似性比较、聚类和信息检索。

BERT set new state-of-the-art performance on various sentence classification and sentence-pair regression tasks. BERT uses a cross-encoder: Two sentences are passed to the transformer network and the target value is predicted. However, this setup is unsuitable for various pair regression tasks due to too many possible combinations. Finding in a collection of n = 10 000 sentences the pair with the highest similarity requires with BERT n·(n−1)/2 = 49 995 000 inference computations. On a modern V100 GPU, this requires about 65 hours. Similar, finding which of the over 40 million existent questions of Quora is the most similar for a new question could be modeled as a pair-wise comparison with BERT, however, answering a single query would require over 50 hours.

BERT 在各种句子分类和句子对回归任务上设置了新的最先进的性能。 BERT 使用交叉编码器:将两个句子传递到 Transformer 网络并预测目标值。 但是,由于可能的组合太多,此设置不适用于各种配对回归任务。 使用 BERT n·(n−1)/2 = 49 995 000 次推理计算,在 n = 10 000 个句子的集合中找到具有最高相似度的句子对。 在现代 V100 GPU 上,这需要大约 65 小时。 类似地,发现 Quora 超过 4000 万个现有问题中哪一个与新问题最相似可以建模为与 BERT 的成对比较,但是,回答单个查询需要 50 多个小时。

A common method to address clustering and semantic search is to map each sentence to a vector space such that semantically similar sentences are close. Researchers have started to input individual sentences into BERT and to derive fixedsize sentence embeddings. The most commonly used approach is to average the BERT output layer(known as BERT embeddings) or by using the output of the first token (the [CLS] token). As we will show, this common practice yields rather bad sentence embeddings, often worse than averaging GloVe embeddings (Pennington et al., 2014).

解决聚类和语义搜索的常用方法是将每个句子映射到向量空间,使得语义相似的句子相近。 研究人员已经开始将单个句子输入 BERT 并导出固定大小的句子嵌入。 最常用的方法是平均 BERT 输出层(称为 BERT 嵌入)或使用第一个令牌([CLS] 令牌)的输出。 正如我们将展示的,这种常见的做法会产生相当糟糕的句子嵌入,通常比平均 GloVe 嵌入更糟糕(Pennington 等,2014)。

To alleviate this issue, we developed SBERT. The siamese network architecture enables that fixed-sized vectors for input sentences can be derived. Using a similarity measure like cosinesimilarity or Manhatten / Euclidean distance, semantically similar sentences can be found. These similarity measures can be performed extremely efficient on modern hardware, allowing SBERT to be used for semantic similarity search as well as for clustering. The complexity for finding the most similar sentence pair in a collection of 10,000 sentences is reduced from 65 hours with BERT to the computation of 10,000 sentence embeddings (~5 seconds with SBERT) and computing cosinesimilarity (~0.01 seconds). By using optimized index structures, finding the most similar Quora question can be reduced from 50 hours to a few milliseconds (Johnson et al., 2017).

为了缓解这个问题,我们开发了 SBERT。 siamese 网络架构使得可以导出输入句子的固定大小的向量。 使用余弦相似度或曼哈顿/欧几里得距离等相似性度量,可以找到语义相似的句子。 这些相似性度量可以在现代硬件上非常有效地执行,允许 SBERT 用于语义相似性搜索以及聚类。 在 10,000 个句子的集合中找到最相似的句子对的复杂性从使用 BERT 的 65 小时减少到计算 10,000 个句子嵌入(使用 SBERT 约 5 秒)和计算余弦相似度(约 0.01 秒)。 通过使用优化的索引结构,找到最相似的 Quora 问题的时间可以从 50 小时减少到几毫秒(Johnson 等人,2017 年)。

We fine-tune SBERT on NLI data, which creates sentence embeddings that significantly outperform other state-of-the-art sentence embedding methods like InferSent (Conneau et al., 2017) and Universal Sentence Encoder (Cer et al., 2018). On seven Semantic Textual Similarity (STS) tasks, SBERT achieves an improvement of 11.7 points compared to InferSent and 5.5 points compared to Universal Sentence Encoder. On SentEval (Conneau and Kiela, 2018), an evaluation toolkit for sentence embeddings, we achieve an improvement of 2.1 and 2.6 points, respectively.

我们在 NLI 数据上微调 SBERT,它创建的句子嵌入显着优于其他最先进的句子嵌入方法,如 InferSent(Conneau 等人,2017 年)和 Universal Sentence Encoder(Cer 等人,2018 年)。 在七个语义文本相似性 (STS) 任务上,SBERT 与 InferSent 相比提高了 11.7 分,与 Universal Sentence Encoder 相比提高了 5.5 分。 在 SentEval(Conneau 和 Kiela,2018 年)上,一个句子嵌入的评估工具包,我们分别实现了 2.1 和 2.6 分的改进。

SBERT can be adapted to a specific task. It sets new state-of-the-art performance on a challenging argument similarity dataset (Misra et al., 2016) and on a triplet dataset to distinguish sentences from different sections of a Wikipedia article (Dor et al., 2018).

SBERT 可以适应特定的任务。 它在具有挑战性的参数相似性数据集(Misra 等人,2016 年)和三元组数据集上设置了新的最先进性能,以区分来自维基百科文章不同部分的句子(Dor 等人,2018 年)。

The paper is structured in the following way: Section 3 presents SBERT, section 4 evaluates SBERT on common STS tasks and on the challenging Argument Facet Similarity (AFS) corpus (Misra et al., 2016). Section 5 evaluates SBERT on SentEval. In section 6, we perform an ablation study to test some design aspect of SBERT. In section 7, we compare the computational efficiency of SBERT sentence embeddings in contrast to other state-of-the-art sentence embedding methods.

该论文的结构如下:第 3 节介绍 SBERT,第 4 节评估 SBERT 在常见 STS 任务和具有挑战性的参数方面相似性(AFS)语料库(Misra 等人,2016 年)。 第 5 节在 SentEval 上评估 SBERT。 在第 6 节中,我们执行消融研究以测试 SBERT 的某些设计方面。 在第 7 节中,我们比较了 SBERT 句子嵌入与其他最先进的句子嵌入方法的计算效率。

3、相关工作

We first introduce BERT, then, we discuss stateof-the-art sentence embedding methods.

我们首先介绍 BERT,然后讨论最先进的句子嵌入方法。

BERT (Devlin et al., 2018) is a pre-trained transformer network (Vaswani et al., 2017), which set for various NLP tasks new state-of-the-art results, including question answering, sentence classification, and sentence-pair regression. The input for BERT for sentence-pair regression consists of the two sentences, separated by a special [SEP] token. Multi-head attention over 12 (base-model) or 24 layers (large-model) is applied and the output is passed to a simple regression function to derive the final label. Using this setup, BERT set a new state-of-the-art performance on the Semantic Textual Semilarity (STS) benchmark (Cer et al., 2017). RoBERTa (Liu et al., 2019) showed, that the performance of BERT can further improved by
small adaptations to the pre-training process. We also tested XLNet (Yang et al., 2019), but it led in general to worse results than BERT.

BERT (Devlin et al., 2018) 是一个预训练的 Transformer Network (Vaswani et al., 2017),它为各种 NLP 任务设置了最新的最新结果,包括问答、句子分类和句子 -对回归。 用于句子对回归的 BERT 输入由两个句子组成,由一个特殊的 [SEP] 标记分隔。 应用超过 12 层(基础模型)或 24 层(大模型)的多头注意力,并将输出传递给简单的回归函数以导出最终标签。 使用此设置,BERT 在语义文本半度 (STS) 基准测试 (Cer et al., 2017) 上设置了新的最先进性能。 RoBERTa (Liu et al., 2019) 表明,BERT 的性能可以通过对预训练过程进行小的调整来进一步提高。 我们还测试了 XLNet(Yang 等人,2019 年),但它的结果通常比 BERT 更差。

A large disadvantage of the BERT network structure is that no independent sentence embeddings are computed, which makes it difficult to derive sentence embeddings from BERT. To bypass this limitations, researchers passed single sentences through BERT and then derive a fixed sized vector by either averaging the outputs (similar to average word embeddings) or by using the output of the special CLS token (for example: May et al. (2019); Zhang et al. (2019); Qiao et al. (2019)). These two options are also provided by the popular bert-as-a-service-repository3 . Up to our knowledge, there is so far no evaluation if these methods lead to useful sentence embeddings.

BERT 网络结构的一大缺点是没有计算独立的句子嵌入,这使得很难从 BERT 导出句子嵌入。 为了绕过这个限制,研究人员通过 BERT 传递单个句子,然后通过平均输出(类似于平均词嵌入)或使用特殊 CLS 令牌的输出(例如:May et al. (2019) ); Zhang et al. (2019); Qiao et al. (2019))。 这两个选项也由流行的 bert-as-a-service-repository3 提供。 据我们所知,到目前为止还没有评估这些方法是否会导致有用的句子嵌入。

Sentence embeddings are a well studied area with dozens of proposed methods. Skip-Thought (Kiros et al., 2015) trains an encoder-decoder architecture to predict the surrounding sentences. InferSent (Conneau et al., 2017) uses labeled data of the Stanford Natural Language Inference dataset (Bowman et al., 2015) and the MultiGenre NLI dataset (Williams et al., 2018) to train a siamese BiLSTM network with max-pooling over the output. Conneau et al. showed, that InferSent consistently outperforms unsupervised
methods like SkipThought. Universal Sentence Encoder (Cer et al., 2018) trains a transformer network and augments unsupervised learning with training on SNLI. Hill et al. (2016) showed, that the task on which sentence embeddings are trained significantly impacts their quality. Previous work (Conneau et al., 2017; Cer et al., 2018) found that the SNLI datasets are suitable for training sentence embeddings. Yang et al. (2018) presented a method to train on conversations from Reddit using siamese DAN and siamese transformer networks, which yielded good results on the STS benchmark dataset.

句子嵌入是一个经过充分研究的领域,有数十种提出的方​​法。 Skip-Thought (Kiros et al., 2015) 训练编码器-解码器架构来预测周围的句子。 InferSent(Conneau 等人,2017 年)使用斯坦福自然语言推理数据集(Bowman 等人,2015 年)和 MultiGenre NLI 数据集(Williams 等人,2018 年)的标记数据来训练具有最大池化的 siamese BiLSTM 网络超过输出。康诺等人。表明,InferSent 始终优于无监督 像 SkipThought 这样的方法。 Universal Sentence Encoder (Cer et al., 2018) 训练变压器网络,并通过 SNLI 训练增强无监督学习。希尔等人。 (2016) 表明,训练句子嵌入的任务会显着影响它们的质量。之前的工作(Conneau 等人,2017 年;Cer 等人,2018 年)发现 SNLI 数据集适用于训练句子嵌入。杨等人。 (2018) 提出了一种使用 siamese DAN 和 siamese Transformer 网络训练来自 Reddit 的对话的方法,该方法在 STS 基准数据集上取得了良好的结果。
 
 
Humeau et al. (2019) addresses the run-time overhead of the cross-encoder from BERT and present a method (poly-encoders) to compute a score between m context vectors and pre computed candidate embeddings using attention. This idea works for finding the highest scoring sentence in a larger collection. However, polyencoders have the drawback that the score function is not symmetric and the computational overhead is too large for use-cases like clustering, which would require O(n2) score computations.
 
休莫等人。 (2019) 解决了 BERT 交叉编码器的运行时开销,并提出了一种方法(多编码器)来使用注意力计算 m 个上下文向量和预先计算的候选嵌入之间的分数。 这个想法适用于在更大的集合中找到得分最高的句子。 然而,多编码器的缺点是得分函数不对称,并且计算开销对于聚类等用例来说太大,这需要 O(n2) 得分计算。
 
Previous neural sentence embedding methods started the training from a random initialization. In this publication, we use the pre-trained BERT and RoBERTa network and only fine-tune it to yield useful sentence embeddings. This reduces significantly the needed training time: SBERT can be tuned in less than 20 minutes, while yielding better results than comparable sentence embedding methods.
 
以前的神经句子嵌入方法从随机初始化开始训练。 在本出版物中,我们使用预训练的 BERT 和 RoBERTa 网络,仅对其进行微调以产生有用的句子嵌入。 这显着减少了所需的训练时间:SBERT 可以在不到 20 分钟的时间内进行调整,同时产生比类似句子嵌入方法更好的结果。
 

4、模型

SBERT adds a pooling operation to the output of BERT / RoBERTa to derive a fixed sized sentence embedding. We experiment with three pooling strategies: Using the output of the CLS-token, computing the mean of all output vectors (MEANstrategy), and computing a max-over-time of the output vectors (MAX-strategy). The default configuration is MEAN.

SBERT 在 BERT / RoBERTa 的输出中加入池化操作,推导出固定大小的句子嵌入。 我们试验了三种池化策略:使用 CLS 令牌的输出,计算所有输出向量的均值 (MEANstrategy),以及计算输出向量的最大随时间变化 (MAX-strategy)。 默认配置为 MEAN。

In order to fine-tune BERT / RoBERTa, we create siamese and triplet networks (Schroff et al., 2015) to update the weights such that the produced sentence embeddings are semantically meaningful and can be compared with cosine-similarity.

为了微调 BERT / RoBERTa,我们创建了孪生网络和三元组网络(Schroff 等人,2015 年)来更新权重,使得生成的句子嵌入在语义上有意义并且可以与余弦相似度进行比较。

The network structure depends on the available training data. We experiment with the following structures and objective functions.

网络结构取决于可用的训练数据。 我们用以下结构和目标函数进行实验。

 

 

(1) 分类目标函数。我们将句子嵌入 u 和 v 与元素差异 |u−v| 连接起来并将其与可训练权重 Wt ∈ R 3n×k 相乘。
其中 n 是句子嵌入的维度,k 是标签的数量。我们优化交叉熵损失。这种结构如图 1 所示。
(2)回归目标函数。计算两个句子嵌入 u 和 v 之间的余弦相似度(图 2)。我们使用均方误差损失作为目标函数。
(3)三元组目标函数。给定一个锚句 a、一个正句 p 和一个负句 n,triplet loss 调整网络,使得 a 和 p 之间的距离小于 a 和 n 之间的距离。在数学上,我们最小化以下损失函数:
max(||sa − sp|| − ||sa − sn|| + , 0)
用 sx 表示 a/n/p 的句子嵌入,|| · ||距离度量和边距。边距确保 sp 至少比 sn 更接近 sa。作为度量,我们使用欧几里得距离,并在我们的实验中设置 = 1。

4.1、训练细节

We train SBERT on the combination of the SNLI (Bowman et al., 2015) and the Multi-Genre NLI (Williams et al., 2018) dataset. The SNLI is a collection of 570,000 sentence pairs annotated with the labels contradiction, eintailment, and neutral. MultiNLI contains 430,000 sentence pairs and covers a range of genres of spoken and written text. We fine-tune SBERT with a 3-way softmaxclassifier objective function for one epoch. We used a batch-size of 16, Adam optimizer with learning rate 2e−5, and a linear learning rate warm-up over 10% of the training data. Our default pooling strategy is MEAN.

 我们在 SNLI (Bowman et al., 2015) 和 Multi-Genre NLI (Williams et al., 2018) 数据集的组合上训练 SBERT。 SNLI 是 570,000 个句子对的集合,标注了矛盾、eintailment 和中性标签。 MultiNLI 包含 430,000 个句子对,涵盖了各种类型的口语和书面文本。 我们使用一个 epoch 的 3 路 softmaxclassifier 目标函数对 SBERT 进行微调。 我们使用了 16 的批量大小、学习率为 2e-5 的 Adam 优化器,以及超过 10% 的训练数据的线性学习率热身。 我们的默认池化策略是 MEAN。

 

5、语义文本相似性评估

We evaluate the performance of SBERT for common Semantic Textual Similarity (STS) tasks.State-of-the-art methods often learn a (complex) regression function that maps sentence embeddings to a similarity score. However, these regression functions work pair-wise and due to the combinatorial explosion those are often not scalable if the collection of sentences reaches a certain size. Instead, we always use cosine-similarity to compare the similarity between two sentence embeddings. We ran our experiments also with negative Manhatten and negative Euclidean distances as similarity measures, but the results for all approaches remained roughly the same。

我们评估 SBERT 在常见语义文本相似性 (STS) 任务中的性能。最先进的方法通常会学习一个(复杂的)回归函数,将句子嵌入映射到相似性分数。 然而,这些回归函数是成对工作的,并且由于组合爆炸,如果句子的集合达到一定的大小,这些回归函数通常是不可扩展的。 相反,我们总是使用余弦相似度来比较两个句子嵌入之间的相似度。 我们还使用负曼哈顿距离和负欧几里德距离作为相似性度量进行了实验,但所有方法的结果大致相同。

5.1、无监督STS

We evaluate the performance of SBERT for STS without using any STS specific training data. We use the STS tasks 2012 - 2016 (Agirre et al., 2012, 2013, 2014, 2015, 2016), the STS benchmark (Cer et al., 2017), and the SICK-Relatedness dataset (Marelli et al., 2014). These datasets provide labels between 0 and 5 on the semantic relatedness of sentence pairs. We showed in (Reimers et al., 2016) that Pearson correlation is badly suited for STS. Instead, we compute the Spearman’s rank correlation between the cosine-similarity of the sentence embeddings and the gold labels. The setup for the other sentence embedding methods is equivalent, the similarity is computed by cosinesimilarity. The results are depicted in Table 1.

我们在不使用任何 STS 特定训练数据的情况下评估 SBERT 对 STS 的性能。 我们使用 STS 任务 2012 - 2016(Agirre et al., 2012, 2013, 2014, 2015, 2016)、STS 基准(Cer et al., 2017)和 SICK-Relatedness 数据集(Marelli et al., 2014) )。 这些数据集在句子对的语义相关性上提供了 0 到 5 之间的标签。 我们在 (Reimers et al., 2016) 中表明 Pearson 相关非常适合 STS。 相反,我们计算句子嵌入的余弦相似度与黄金标签之间的 Spearman 等级相关性。 其他句子嵌入方法的设置是等效的,相似度由余弦相似度计算。 结果如表 1 所示。

The results shows that directly using the output of BERT leads to rather poor performances. Averaging the BERT embeddings achieves an average correlation of only 54.81, and using the CLStoken output only achieves an average correlation of 29.19. Both are worse than computing average GloVe embeddings.

Using the described siamese network structure and fine-tuning mechanism substantially improves the correlation, outperforming both InferSent and Universal Sentence Encoder substantially. The only dataset where SBERT performs worse than Universal Sentence Encoder is SICK-R. Universal Sentence Encoder was trained on various datasets, including news, question-answer pages and discussion forums, which appears to be more suitable to the data of SICK-R. In contrast, SBERT was pre-trained only on Wikipedia (via BERT) and on NLI data.

While RoBERTa was able to improve the performance for several supervised tasks, we only observe minor difference between SBERT and SRoBERTa for generating sentence embeddings.

结果表明,直接使用 BERT 的输出会导致性能相当差。 BERT 嵌入的平均相关性仅为 54.81,而使用 CLStoken 输出的平均相关性仅为 29.19。两者都比计算平均 GloVe 嵌入更糟糕。

使用所描述的 siamese 网络结构和微调机制大大提高了相关性,大大优于 InferSent 和 Universal Sentence Encoder。 SBERT 性能比通用句子编码器差的唯一数据集是 SICK-R。 Universal Sentence Encoder 在各种数据集上进行了训练,包括新闻、问答页面和论坛,这似乎更适合 SICK-R 的数据。相比之下,SBERT 仅在维基百科(通过 BERT)和 NLI 数据上进行了预训练。

虽然 RoBERTa 能够提高几个监督任务的性能,但我们只观察到 SBERT 和 SRoBERTa 在生成句子嵌入方面的细微差别。

5.2、有监督STS

The STS benchmark (STSb) (Cer et al., 2017) provides is a popular dataset to evaluate supervised STS systems. The data includes 8,628 sentence pairs from the three categories captions, news, and forums. It is divided into train (5,749), dev (1,500) and test (1,379). BERT set a new state-of-the-art performance on this dataset by passing both sentences to the network and using a simple regres-sion method for the output.

We use the training set to fine-tune SBERT using the regression objective function. At prediction time, we compute the cosine-similarity between the sentence embeddings. All systems are trained with 10 random seeds to counter variances (Reimers and Gurevych, 2018).

The results are depicted in Table 2. We experimented with two setups: Only training on STSb, and first training on NLI, then training on STSb. We observe that the later strategy leads to a slight improvement of 1-2 points. This two-step approach had an especially large impact for the BERT cross-encoder, which improved the performance by 3-4 points. We do not observe a significant difference between BERT and RoBERTa.

STS 基准 (STSb)(Cer 等人,2017 年)提供了一个流行的数据集,用于评估受监督的 STS 系统。数据包括来自标题、新闻和论坛三个类别的 8,628 个句子对。它分为训练(5,749)、开发(1,500)和测试(1,379)。 BERT 通过将两个句子传递到网络并使用简单的回归方法输出,在这个数据集上设置了一个新的最先进的性能。
我们使用训练集使用回归目标函数对 SBERT 进行微调。在预测时,我们计算句子嵌入之间的余弦相似度。所有系统都使用 10 个随机种子进行训练以对抗差异(Reimers 和 Gurevych,2018 年)。

结果如表 2 所示。我们试验了两种设置:仅在 STSb 上训练,首先在 NLI 上训练,然后在 STSb 上训练。我们观察到后面的策略导致 1-2 点的轻微改善。这种两步法对BERT交叉编码器的影响特别大,性能提高了3-4个百分点。我们没有观察到 BERT 和 RoBERTa 之间的显着差异。

5.3、参数方面相似性

We evaluate SBERT on the Argument Facet Similarity (AFS) corpus by Misra et al. (2016). The AFS corpus annotated 6,000 sentential argument pairs from social media dialogs on three controversial topics: gun control, gay marriage, and death penalty. The data was annotated on a scale from 0 (“different topic”) to 5 (“completely equivalent”). The similarity notion in the AFS corpus is fairly different to the similarity notion in the STS datasets from SemEval. STS data is usually descriptive, while AFS data are argumentative excerpts from dialogs. To be considered similar, arguments must not only make similar claims, but also provide a similar reasoning. Further, the lexical gap between the sentences in AFS is much larger. Hence, simple unsupervised methods as well as state-of-the-art STS systems perform badly on this dataset (Reimers et al., 2019).

We evaluate SBERT on this dataset in two scenarios: 1) As proposed by Misra et al., we evaluate SBERT using 10-fold cross-validation. A drawback of this evaluation setup is that it is not clear how well approaches generalize to different topics. Hence, 2) we evaluate SBERT in a cross-topic setup. Two topics serve for training and the approach is evaluated on the left-out topic. We repeat this for all three topics and average the results.

SBERT is fine-tuned using the Regression Objective Function. The similarity score is computed using cosine-similarity based on the sentence embeddings. We also provide the Pearson correlation r to make the results comparable to Misra et al. However, we showed (Reimers et al., 2016) that Pearson correlation has some serious drawbacks and should be avoided for comparing STS systems. The results are depicted in Table 3.

Unsupervised methods like tf-idf, average GloVe embeddings or InferSent perform rather badly on this dataset with low scores. Training SBERT in the 10-fold cross-validation setup gives a performance that is nearly on-par with BERT.

However, in the cross-topic evaluation, we observe a performance drop of SBERT by about 7 points Spearman correlation. To be considered similar, arguments should address the same claims and provide the same reasoning. BERT is able to use attention to compare directly both sentences (e.g. word-by-word comparison), while SBERT must map individual sentences from an unseen topic to a vector space such that arguments with similar claims and reasons are close. This is a much more challenging task, which appears to require more than just two topics for training to work on-par with BERT。

我们在 Misra 等人的 Argument Facet Similarity (AFS) 语料库上评估 SBERT。 (2016)。 AFS 语料库注释了来自社交媒体对话的 6,000 个句子论据对,涉及三个有争议的话题:枪支管制、同性婚姻和死刑。数据的注释范围从 0(“不同主题”)到 5(“完全等效”)。 AFS 语料库中的相似性概念与来自 SemEval 的 STS 数据集中的相似性概念完全不同。 STS 数据通常是描述性的,而 AFS 数据是来自对话的争论性摘录。要被认为是相似的,论证不仅必须提出相似的主张,而且还必须提供相似的推理。此外,AFS 中句子之间的词汇差距要大得多。因此,简单的无监督方法以及最先进的 STS 系统在该数据集上表现不佳(Reimers 等,2019)。
我们在两种情况下在此数据集上评估 SBERT:1) 正如 Misra 等人提出的那样,我们使用 10 倍交叉验证评估 SBERT。这种评估设置的一个缺点是不清楚方法对不同主题的泛化程度。因此,2)我们在跨主题设置中评估 SBERT。两个主题用于训练,该方法在被遗漏的主题上进行评估。我们对所有三个主题重复此操作并平均结果。

SBERT 使用回归目标函数进行了微调。相似度分数是基于句子嵌入使用余弦相似度计算的。我们还提供了 Pearson 相关 r 以使结果与 Misra 等人的结果相当。然而,我们表明(Reimers 等人,2016 年)Pearson 相关有一些严重的缺点,在比较 STS 系统时应该避免。结果如表 3 所示。

tf-idf、平均 GloVe 嵌入或 InferSent 等无监督方法在此数据集上的表现相当糟糕,得分较低。在 10 倍交叉验证设置中训练 SBERT 的性能几乎与 BERT 相当。

然而,在跨主题评估中,我们观察到 SBERT 的性能下降了大约 7 个点的 Spearman 相关性。要被视为相似,论点应针对相同的主张并提供相同的推理。 BERT 能够使用注意力直接比较两个句子(例如逐字比较),而 SBERT 必须将单个句子从一个看不见的主题映射到一个向量空间,以便具有相似主张和原因的论点接近。这是一项更具挑战性的任务,它似乎需要不止两个主题来训练才能与 BERT 相提并论。

 

 5.4、维基百科部分区别

Dor et al. (2018) use Wikipedia to create a thematically fine-grained train, dev and test set for sentence embeddings methods. Wikipedia articles are separated into distinct sections focusing on certain aspects. Dor et al. assume that sen-tences in the same section are thematically closer than sentences in different sections. They use this to create a large dataset of weakly labeled sentence triplets: The anchor and the positive example come from the same section, while the negative example comes from a different section of the same article. For example, from the Alice Arnold article: Anchor: Arnold joined the BBC Radio Drama Company in 1988., positive: Arnold gained media attention in May 2012., negative: Balding and Arnold are keen amateur golfers.

We use the dataset from Dor et al. We use the Triplet Objective, train SBERT for one epoch on the about 1.8 Million training triplets and evaluate it on the 222,957 test triplets. Test triplets are from a distinct set of Wikipedia articles. As evaluation metric, we use accuracy: Is the positive example closer to the anchor than the negative example? Results are presented in Table 4.

Dor et al. finetuned a BiLSTM architecture with triplet loss to derive sentence embeddings for this dataset. As the table shows, SBERT clearly outperforms the BiLSTM approach by Dor et al.

多尔等人。 (2018) 使用维基百科为句子嵌入方法创建了一个主题细粒度的训练、开发和测试集。维基百科文章被分成不同的部分,侧重于某些方面。多尔等人。假设同一部分中的句子在主题上比不同部分中的句子更接近。他们使用它来创建一个包含弱标记句子三元组的大型数据集:锚点和正例来自同一部分,而反例来自同一文章的不同部分。比如来自 Alice Arnold 的文章:Anchor:Arnold 于 1988 年加入 BBC Radio Drama 公司,正面:Arnold 于 2012 年 5 月获得媒体关注,负面:Balding 和 Arnold 都是狂热的业余高尔夫球手。
我们使用 Dor 等人的数据集。我们使用三元组目标,在大约 180 万个训练三元组上训练 SBERT 一个时期,并在 222,957 个测试三元组上对其进行评估。测试三元组来自一组不同的维基百科文章。作为评估指标,我们使用准确性:正例是否比负例更接近锚点?结果如表 4 所示。

多尔等人。使用三元组损失对 BiLSTM 架构进行了微调,以获得该数据集的句子嵌入。如表所示,SBERT 明显优于 Dor 等人的 BiLSTM 方法。

 

 六、Evaluation - SentEval

SentEval (Conneau and Kiela, 2018) is a popular toolkit to evaluate the quality of sentence embeddings. Sentence embeddings are used as features for a logistic regression classifier. The logistic regression classifier is trained on various tasks in a 10-fold cross-validation setup and the prediction accuracy is computed for the test-fold.

SentEval (Conneau and Kiela, 2018) 是一种流行的工具包,用于评估句子嵌入的质量。 句子嵌入用作逻辑回归分类器的特征。 逻辑回归分类器在 10 折交叉验证设置中针对各种任务进行训练,并为测试折计算预测精度。

The purpose of SBERT sentence embeddings are not to be used for transfer learning for other tasks. Here, we think fine-tuning BERT as described by Devlin et al. (2018) for new tasks is the more suitable method, as it updates all layers of the BERT network. However, SentEval can still give an impression on the quality of our sentence embeddings for various tasks. We compare the SBERT sentence embeddings to other sentence embeddings methods on the following seven SentEval transfer tasks: • MR: Sentiment prediction for movie reviews snippets on a five start scale (Pang and Lee, 2005). • CR: Sentiment prediction of customer product reviews (Hu and Liu, 2004). • SUBJ: Subjectivity prediction of sentences from movie reviews and plot summaries (Pang and Lee, 2004). • MPQA: Phrase level opinion polarity classification from newswire (Wiebe et al., 2005). • SST: Stanford Sentiment Treebank with binary labels (Socher et al., 2013). • TREC: Fine grained question-type classification from TREC (Li and Roth, 2002). • MRPC: Microsoft Research Paraphrase Corpus from parallel news sources (Dolan et al., 2004). The results can be found in Table 5. SBERT is able to achieve the best performance in 5 out of 7 tasks. The average performance increases by about 2 percentage points compared to InferSent as well as the Universal Sentence Encoder. Even though transfer learning is not the purpose of SBERT, it outperforms other state-of-the-art sentence embeddings methods on this task.

SBERT 句子嵌入的目的不用于其他任务的迁移学习。在这里,我们认为按照 Devlin 等人的描述对 BERT 进行微调。 (2018)对于新任务是更合适的方法,因为它更新了 BERT 网络的所有层。然而,SentEval 仍然可以给我们在各种任务中的句子嵌入的质量留下印象。
我们在以下七个 SentEval 传输任务中将 SBERT 句子嵌入与其他句子嵌入方法进行了比较:
• MR:电影评论片段的情感预测,从五个开始(Pang 和 Lee,2005 年)。 • CR:客户产品评论的情绪预测(Hu and Liu,2004)。 • SUBJ:来自电影评论和情节摘要的句子的主观预测(Pang 和 Lee,2004 年)。 • MPQA:来自新闻专线的短语级别意见极性分类(Wiebe 等,2005)。 • SST:带有二元标签的斯坦福情绪树库(Socher 等人,2013 年)。 • TREC:来自TREC 的细粒度问题类型分类(Li 和Roth,2002 年)。 • MRPC:来自平行新闻来源的微软研究释义语料库(Dolan 等,2004)。
结果见表 5。SBERT 能够在 7 个任务中的 5 个任务中取得最佳性能。与 InferSent 和 Universal Sentence Encoder 相比,平均性能提高了大约 2 个百分点。尽管迁移学习不是 SBERT 的目的,但它在此任务上优于其他最先进的句子嵌入方法。

 

 It appears that the sentence embeddings from SBERT capture well sentiment information: We observe large improvements for all sentiment tasks (MR, CR, and SST) from SentEval in comparison to InferSent and Universal Sentence Encoder. The only dataset where SBERT is significantly worse than Universal Sentence Encoder is the TREC dataset. Universal Sentence Encoder was pre-trained on question-answering data, which appears to be beneficial for the question-type classification task of the TREC dataset. Average BERT embeddings or using the CLStoken output from a BERT network achieved bad results for various STS tasks (Table 1), worse than average GloVe embeddings. However, for SentEval, average BERT embeddings and the BERT CLS-token output achieves decent results (Table 5), outperforming average GloVe embeddings. The reason for this are the different setups. For the STS tasks, we used cosine-similarity to estimate the similarities between sentence embeddings. Cosine-similarity treats all dimensions equally. In contrast, SentEval fits a logistic regression classifier to the sentence embeddings. This allows that certain dimensions can have higher or lower impact on the classification result. We conclude that average BERT embeddings / CLS-token output from BERT return sentence embeddings that are infeasible to be used with cosinesimilarity or with Manhatten / Euclidean distance. For transfer learning, they yield slightly worse results than InferSent or Universal Sentence Encoder. However, using the described fine-tuning setup with a siamese network structure on NLI datasets yields sentence embeddings that achieve a new state-of-the-art for the SentEval toolkit.

 

SBERT 的句子嵌入似乎很好地捕获了情感信息:与 InferSent 和 Universal Sentence Encoder 相比,我们观察到 SentEval 的所有情感任务(MR、CR 和 SST)都有很大的改进。
SBERT 明显比通用句子编码器差的唯一数据集是 TREC 数据集。 Universal Sentence Encoder 在问答数据上进行了预训练,这似乎有利于 TREC 数据集的问题类型分类任务。
平均 BERT 嵌入或使用来自 BERT 网络的 CLStoken 输出在各种 STS 任务中取得了糟糕的结果(表 1),比平均 GloVe 嵌入差。然而,对于 SentEval,平均 BERT 嵌入和 BERT CLS 令牌输出取得了不错的结果(表 5),优于平均 GloVe 嵌入。原因是不同的设置。对于 STS 任务,我们使用余弦相似度来估计句子嵌入之间的相似度。余弦相似性平等对待所有维度。相比之下,SentEval 将逻辑回归分类器拟合到句子嵌入中。这允许某些维度对分类结果具有更高或更低的影响。
我们得出的结论是,来自 BERT 的平均 BERT 嵌入/CLS-token 输出返回的句子嵌入不能与余弦相似度或曼哈顿/欧几里得距离一起使用。对于迁移学习,它们产生的结果比 InferSent 或 Universal Sentence Encoder 稍差。然而,在 NLI 数据集上使用具有 siamese 网络结构的所述微调设置产生句子嵌入,实现了 SentEval 工具包的最新技术水平。

 七、消融研究

We have demonstrated strong empirical results for the quality of SBERT sentence embeddings. In this section, we perform an ablation study of different aspects of SBERT in order to get a better understanding of their relative importance.

We evaluated different pooling strategies (MEAN, MAX, and CLS). For the classification objective function, we evaluate different concatenation methods. For each possible configuration, we train SBERT with 10 different random seeds and average the performances.

The objective function (classification vs. regression) depends on the annotated dataset. For the classification objective function, we train SBERTbase on the SNLI and the Multi-NLI dataset. For the regression objective function, we train on the training set of the STS benchmark dataset. Performances are measured on the development split of the STS benchmark dataset. Results are shown in Table 6.

我们已经证明了 SBERT 句子嵌入质量的强大实证结果。 在本节中,我们对 SBERT 的不同方面进行了消融研究,以便更好地了解它们的相对重要性。

我们评估了不同的池化策略(MEAN、MAX 和 CLS)。 对于分类目标函数,我们评估不同的连接方法。 对于每种可能的配置,我们用 10 个不同的随机种子训练 SBERT 并平均性能。

目标函数(分类与回归)取决于带注释的数据集。 对于分类目标函数,我们在 SNLI 和 Multi-NLI 数据集上训练 SBERTbase。 对于回归目标函数,我们在 STS 基准数据集的训练集上进行训练。 性能是根据 STS 基准数据集的开发拆分来衡量的。 结果如表 6 所示。

 

When trained with the classification objective function on NLI data, the pooling strategy has a rather minor impact. The impact of the concatenation mode is much larger. InferSent (Conneau et al., 2017) and Universal Sentence Encoder (Cer et al., 2018) both use (u, v, |u − v|, u ∗ v) as input for a softmax classifier. However, in our architecture, adding the element-wise u ∗ v decreased the performance.

The most important component is the elementwise difference |u − v|. Note, that the concatenation mode is only relevant for training the softmax classifier. At inference, when predicting similarities for the STS benchmark dataset, only the sentence embeddings u and v are used in combination with cosine-similarity. The element-wise difference measures the distance between the dimensions of the two sentence embeddings, ensuring that similar pairs are closer and dissimilar pairs are further apart.

When trained with the regression objective function, we observe that the pooling strategy has a large impact. There, the MAX strategy perform significantly worse than MEAN or CLS-token strategy. This is in contrast to (Conneau et al., 2017), who found it beneficial for the BiLSTM-layer of InferSent to use MAX instead of MEAN pooling.

在 NLI 数据上使用分类目标函数进行训练时,池化策略的影响相当小。串联模式的影响要大得多。 InferSent (Conneau et al., 2017) 和 Universal Sentence Encoder (Cer et al., 2018) 都使用 (u, v, |u − v|, u ∗ v) 作为 softmax 分类器的输入。然而,在我们的架构中,添加元素 u ∗ v 会降低性能。

最重要的部分是元素差异|u - v|。请注意,连接模式仅与训练 softmax 分类器相关。在推理时,在预测 STS 基准数据集的相似性时,只有句子嵌入 u 和 v 与余弦相似性结合使用。元素差异测量两个句子嵌入维度之间的距离,确保相似对更近,不同对更远。

当用回归目标函数训练时,我们观察到池化策略有很大的影响。在那里,MAX 策略的表现明显比 MEAN 或 CLS 令牌策略差。这与 (Conneau et al., 2017) 形成对比,后者发现 InferSent 的 BiLSTM 层使用 MAX 而不是 MEAN 池化是有益的。

八、计算效率

Sentence embeddings need potentially be computed for Millions of sentences, hence, a high computation speed is desired. In this section, we compare SBERT to average GloVe embeddings, InferSent (Conneau et al., 2017), and Universal Sentence Encoder (Cer et al., 2018).

For our comparison we use the sentences from the STS benchmark (Cer et al., 2017). We compute average GloVe embeddings using a simple for-loop with python dictionary lookups and NumPy. InferSent4 is based on PyTorch. For Universal Sentence Encoder, we use the TensorFlow Hub version5 , which is based on TensorFlow. SBERT is based on PyTorch. For improved computation of sentence embeddings, we implemented a smart batching strategy: Sentences with similar lengths are grouped together and are only padded to the longest element in a mini-batch. This drastically reduces computational overhead from padding tokens.

Performances were measured on a server with Intel i7-5820K CPU @ 3.30GHz, Nvidia Tesla V100 GPU, CUDA 9.2 and cuDNN. The results are depicted in Table 7.

可能需要为数百万个句子计算句子嵌入,因此需要高计算速度。在本节中,我们将 SBERT 与平均 GloVe 嵌入、InferSent(Conneau 等人,2017 年)和 Universal Sentence Encoder(Cer 等人,2018 年)进行比较。

为了进行比较,我们使用 STS 基准测试中的句子(Cer 等人,2017 年)。我们使用带有 Python 字典查找和 NumPy 的简单 for 循环来计算平均 GloVe 嵌入。 InferSent4 基于 PyTorch。对于通用句子编码器,我们使用基于 TensorFlow 的 TensorFlow Hub version5。 SBERT 基于 PyTorch。为了改进句子嵌入的计算,我们实施了一种智能批处理策略:将长度相似的句子组合在一起,并仅填充到小批量中最长的元素。这大大减少了填充令牌的计算开销。

性能是在配备 Intel i7-5820K CPU @ 3.30GHz、Nvidia Tesla V100 GPU、CUDA 9.2 和 cuDNN 的服务器上测量的。结果如表 7 所示。

 

 On CPU, InferSent is about 65% faster than SBERT. This is due to the much simpler network architecture. InferSent uses a single BiLSTM layer, while BERT uses 12 stacked transformer layers. However, an advantage of transformer networks is the computational efficiency on GPUs. There, SBERT with smart batching is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. Smart batching achieves a speed-up of 89% on CPU and 48% on GPU. Average GloVe embeddings is obviously by a large margin the fastest method to compute sentence embeddings.

在 CPU 上,InferSent 比 SBERT 快 65%。 这是由于更简单的网络架构。 InferSent 使用单个 BiLSTM 层,而 BERT 使用 12 个堆叠的 Transformer 层。 然而,transformer 网络的一个优势是 GPU 上的计算效率。 在那里,具有智能批处理的 SBERT 比 InferSent 快约 9%,比 Universal Sentence Encoder 快约 55%。 智能批处理在 CPU 上实现了 89% 的加速,在 GPU 上实现了 48% 的加速。 平均手套嵌入显然是计算句子嵌入的最快方法。

九、结论

We showed that BERT out-of-the-box maps sentences to a vector space that is rather unsuitable to be used with common similarity measures like cosine-similarity. The performance for seven STS tasks was below the performance of average GloVe embeddings.

To overcome this shortcoming, we presented Sentence-BERT (SBERT). SBERT fine-tunes BERT in a siamese / triplet network architecture. We evaluated the quality on various common benchmarks, where it could achieve a significant improvement over state-of-the-art sentence embeddings methods. Replacing BERT with RoBERTa did not yield a significant improvement in our experiments.

SBERT is computationally efficient. On a GPU, it is about 9% faster than InferSent and about 55% faster than Universal Sentence Encoder. SBERT can be used for tasks which are computationally not feasible to be modeled with BERT. For example, clustering of 10,000 sentences with hierarchical clustering requires with BERT about 65 hours, as around 50 Million sentence combinations must be computed. With SBERT, we were able to reduce the effort to about 5 seconds.

我们展示了 BERT 开箱即用地将句子映射到一个向量空间,该空间非常不适合与常见的相似性度量(如余弦相似性)一起使用。七个 STS 任务的性能低于平均 GloVe 嵌入的性能。

为了克服这个缺点,我们提出了 Sentence-BERT (SBERT)。 SBERT 在 siamese/triplet 网络架构中对 BERT 进行微调。我们评估了各种常见基准的质量,在那里它可以实现对最先进的句子嵌入方法的显着改进。在我们的实验中,用 RoBERTa 替换 BERT 并没有产生显着的改进。

SBERT 在计算上是高效的。在 GPU 上,它比 InferSent 快约 9%,比 Universal Sentence Encoder 快约 55%。 SBERT 可用于在计算上无法用 BERT 建模的任务。例如,使用层次聚类对 10,000 个句子进行聚类需要 BERT 大约 65 小时,因为必须计算大约 5000 万个句子组合。使用 SBERT,我们能够将工作时间减少到大约 5 秒。

 十、笔记

Introduction

Bert模型已经在NLP各大任务中都展现出了强者的姿态。在语义相似度计算(semantic textual similarity)任务上也不例外,但是,由于bert模型规定,在计算语义相似度时,需要将两个句子同时进入模型,进行信息交互,这造成大量的计算开销。例如,有10000个句子,我们想要找出最相似的句子对,需要计算(10000*9999/2)次,需要大约65个小时。Bert模型的构造使得它既不适合语义相似度搜索,也不适合非监督任务,比如聚类。

在实际应用中,例如:在问答系统任务中,往往会人为地配置一些常用并且描述清晰的问题及其对应的回答,我们将这些配置好的问题称之为“标准问”。当用户进行提问时,常常将用户的问题与所有配置好的标准问进行相似度计算,找出与用户问题最相似的标准问,并返回其答案给用户,这样就完成了一次问答操作。如果使用bert模型,那么每一次一个用户问题过来,都需要与标准问库计算一遍。在实时交互的系统中,是不可能上线的。

而作者提出了Sentence-BERT网络结构来解决bert模型的不足。简单通俗地讲,就是借鉴孪生网络模型的框架,将不同的句子输入到两个bert模型中(但这两个bert模型是参数共享的,也可以理解为是同一个bert模型),获取到每个句子的句子表征向量;而最终获得的句子表征向量,可以用于语义相似度计算,也可以用于无监督的聚类任务。对于同样的10000个句子,我们想要找出最相似的句子对,只需要计算10000次,需要大约5秒就可计算完全。

从65小时到5秒钟,这真是恐怖的差距。

Model

下面我们详细介绍一些Sentence-BERT网络结构,作者在文中定义了三种通过bert模型求句子向量的策略,分别是CLS向量,平均池化和最大值池化。

CLS向量策略,就是将bert模型中,开始标记【cls】向量,作为整句话的句向量。

平均池化策略,就是将句子通过bert模型得到的句子中所有的字向量进行求均值操作,最终将均值向量作为整句话的句向量。

最大值池化策略,就是将句子通过bert模型得到的句子中所有的字向量进行求最大值操作,最终将最大值向量作为整句话的句向量。

并且作者在对bert模型进行微调时,设置了三个目标函数,用于不同任务的训练优化;

(1)Classification Objective Function

如图1所示,我们分别获得两句话的句子向量 [公式] 和 [公式] ,并将[公式] 、 [公式]和二者按位求差向量 [公式] 进行拼接,再将拼接好的向量乘上一个可训练的权重 [公式] :

[公式]

其中, [公式] 为句子向量维度, [公式] 为类别数。

图1

(2)Regression Objective Function

如图2所示,目标函数是,直接对两句话的句子向量 [公式] 和 [公式]计算余弦相似度。

图2

(3)Triplet Objective Function

在这个目标函数下,将模型框架进行修改,将原来的两个输入,变成三个句子输入。给定一个锚定句 [公式] ,一个肯定句 [公式] 和一个否定句 [公式] ,模型通过使 [公式] 的距离小于 [公式] 的距离,来优化模型。使其目标函数o最小,即

o=[公式]

其中, [公式] 表示句子 [公式] 的向量, [公式] 表示距离度量, [公式] 表示边距。在论文中,距离度量为欧式距离,边距大小为1。

总结

作者做了大量的实验,比较三种求句子向量策略的好坏,认为平均池化策略最优,并且在多个数据集上进行了效果验证。虽然效果没有bert输入两句话的效果好,但是比其他方法还是要好的,并且速度很快。

个人认为,该篇论文对于工业界很有实用价值。

 

posted @ 2021-08-04 17:03  jasonzhangxianrong  阅读(1265)  评论(0)    收藏  举报