zzbcoder

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

Calculate Similarity — the most relevant Metrics in a Nutshell

——调研学习相似度定义与计算

Zhang Zhibin 张芷彬

Many data science techniques are based on measuring similarity and dissimilarity between objects. For example, K-Nearest-Neighbors uses similarity to classify new data objects. In Unsupervised Learning, K-Means is a clustering method which uses Euclidean distance to compute the distance between the cluster centroids and it’s assigned data points. Recommendation engines use neighborhood based collaborative filtering methods which identify an individual’s neighbor based on the similarity/dissimilarity to the other users.

I will take a look at the most relevant similarity metrics in practice. Measuring similarity between objects can be performed in a number of ways.

Generally we can divide similarity metrics into two different groups:

  1. Similarity Based Metrics:
  • Pearson’s correlation
  • Spearman’s correlation
  • Kendall’s Tau
  • Cosine similarity
  • Jaccard similarity

2. Distance Based Metrics:

  • Euclidean distance
  • Manhattan distance

许多数据科学技术都是基于测量对象之间的相似性和相异性。例如,K-Nearest-Neighbors 使用相似性对新数据对象进行分类。在无监督学习中,K-Means 是一种聚类方法,它使用欧几里德距离来计算聚类质心与其分配的数据点之间的距离。推荐引擎使用基于邻域的协同过滤方法,该方法根据与其他用户的相似性/不相似性来识别个人的邻居。

通常我们可以将相似度度量分为两个不同的组:

  1. 基于相似性的指标:
  • 皮尔逊相关性
  • Spearman 相关性
  • 肯德尔的 Tau
  • 余弦相似度
  • 杰卡德相似度

2.基于距离的指标:

  • 欧几里得距离
  • 曼哈顿距离

Similarity Based Metrics基于相似性的指标

  • Similarity based methods determine the most similar objects with the highest values as it implies they live in closer neighborhoods.

Pearson’s Correlation皮尔逊相关性

  • Correlation is a technique for investigating the relationship between two quantitative, continuous variables, for example, age and blood pressure. Pearson’s correlation coefficient is a measure related to the strength and direction of a linear relationship. We calculate this metric for the vectors x and y in the following way:
  • where
  • The Pearson’s correlation can take a range of values from -1 to +1. Only having an increase or decrease that are directly related will not lead to a Pearson’s correlation of 1 or -1.
  • Pearson 的相关性可以取从 -1 到 +1 的值范围。只有直接相关的增加或减少不会导致 Pearson 相关性为 1 或 -1

代码示例:

import numpy as np
from scipy.stats import pearsonr
import matplotlib.pyplot as plt# seed random number generator
np.random.seed(42)
# prepare data
x = np.random.randn(15)
y = x + np.random.randn(15)# plot x and y
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)))
plt.xlabel('x')
plt.ylabel('y')
plt.show()

# calculate Pearson's correlation
corr, _ = pearsonr(x, y)
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: 0.810

Spearman’s Correlation斯皮尔曼等级相关系数

Spearman’s correlation is what is known as non-parametric statistic, which is a statistic who’s distribution does not depend on parameters (statistics that follow normal distributions or binomial distributions are examples of parametric statistics). Very often, non-parametric statistics rank the data instead of taking the originial values. This is true for Spearman’s correlation coefficient, which is calculated similarly to Pearson’s correlation. The difference between these metrics is that Spearman’s correlation uses the rank of each value.

To calculate Spearman’s correlation we first need to map each of our data to ranked data values:

If the raw data are [0, -5, 4, 7], the ranked values will be [2, 1, 3, 4]. We can calculate Spearman’s correlation in the following way:

where

Spearman’s correlation benchmarks monotonic relationships, therefore it can have perfect relationships that are not linear. It can take a range of values from -1 to +1. The following plot clarifies the difference between Pearson’s and Spearman’s correlation.

Source: Wikipedia

For data exploration, I recommend calculating both Pearson’s and Spearman’s correlation. The comparison of both can result in interesting findings. If S>P (as shown above), it means that we have a monotonic relationship, not a linear relationship. Since linearity simplifies the process of fitting a regression algorithm to the dataset, we might want to modify the non-linear, monotonic data using log-transformation to appear linear.

代码示例:

from scipy.stats import spearmanr
# calculate Spearman's correlation
corr, _ = spearmanr(x, y)
print(‘Spearmans correlation: %.3f’ % corr)

Spearmans correlation: 0.836

Cosine Similarity余弦相似度

余弦相似度计算两个向量之间夹角的余弦值。为了计算余弦相似度,我们使用以下公式:

回想一下余弦函数:左边的红色向量指向不同的角度,右边的图表显示了结果函数。

因此,余弦相似度可以取-1和+1之间的值。如果向量指向完全相同的方向,则余弦相似度为 +1。如果向量指向相反的方向,则余弦相似度为 -1。

余弦相似度在文本分析中非常流行。它用于确定文档彼此之间的相似程度,而不管它们的大小。TF-IDF 文本分析技术有助于将文档转换为向量,其中向量中的每个值对应于文档中单词的 TF-IDF 分数。每个单词都有自己的轴,余弦相似度决定了文档的相似程度。

The cosine similarity calculates the cosine of the angle between two vectors. In order to calculate the cosine similarity we use the following formula:

Recall the cosine function: on the left the red vectors point at different angles and the graph on the right shows the resulting function.

Accordingly, the cosine similarity can take on values between -1 and +1. If the vectors point in the exact same direction, the cosine similarity is +1. If the vectors point in opposite directions, the cosine similarity is -1.

The cosine similarity is very popular in text analysis. It is used to determine how similar documents are to one another irrespective of their size. The TF-IDF text analysis technique helps converting the documents into vectors where each value in the vector corresponds to the TF-IDF score of a word in the document. Each word has its own axis, the cosine similarity then determines how similar the documents are.

Implementation in Python

We need to reshape the vectors x and y using .reshape(1, -1) to compute the cosine similarity for a single sample.

from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))
print('Cosine similarity: %.3f' % cos_sim)

Cosine similarity: 0.773

Jaccard Similarity杰卡德相似度

余弦相似度用于比较两个实值向量,而 Jaccard 相似度用于比较两个二进制向量(集合)。

在集合论中,查看公式的可视化通常很有帮助:

我们可以看到,Jaccard 相似度将交集的大小除以样本集的并集大小。

余弦相似度和 Jaccard 相似度都是计算文本相似度的常用指标。计算 Jaccard 相似度的计算成本更高,因为它将一个文档的所有术语与另一个文档匹配。通过检测重复,Jaccard 相似度被证明是有用的。

Cosine similarity is for comparing two real-valued vectors, but Jaccard similarity is for comparing two binary vectors (sets).

In set theory it is often helpful to see a visualization of the formula:

We can see that the Jaccard similarity divides the size of the intersection by the size of the union of the sample sets.

Both Cosine similarity and Jaccard similarity are common metrics for calculating text similarity. Calculating the Jaccard similarity is computationally more expensive as it matches all the terms of one document to another document. The Jaccard similarity turns out to be useful by detecting duplicates.

Implementation in Python

from sklearn.metrics import jaccard_score
A = [1, 1, 1, 0]
B = [1, 1, 0, 1]
jacc = jaccard_score(A,B)
print(‘Jaccard similarity: %.3f’ % jacc)

Jaccard similarity: 0.500

Distance Based Metrics基于距离的指标

Distance based methods prioritize objects with the lowest values to detect similarity amongst them.

Euclidean Distance欧几里得距离

欧几里得距离是两个向量之间的直线距离。

对于两个向量 x 和 y,可以如下计算:

与余弦和 Jaccard 相似度相比,欧几里得距离在 NLP 应用的上下文中并不经常使用。它适用于连续数值变量。欧几里得距离不是尺度不变的,因此建议在计算距离之前对数据进行缩放。此外,欧几里得距离乘以数据集中冗余信息的影响。如果我有五个高度相关的变量,并且我们将所有五个变量作为输入,那么我们会将这种冗余效应加权 5。

The Euclidean distance is a straight-line distance between two vectors.

For the two vectors x and y, this can be computed as follows:

Compared to the Cosine and Jaccard similarity, Euclidean distance is not used very often in the context of NLP applications. It is appropriate for continuous numerical variables. Euclidean distance is not scale invariant, therefore scaling the data prior to computing the distance is recommended. Additionally, Euclidean distance multiplies the effect of redundant information in the dataset. If I had five variables which are heavily correlated and we take all five variables as input, then we would weight this redundancy effect by five.

Implementation in Python

from scipy.spatial import distance
dst = distance.euclidean(x,y)
print(‘Euclidean distance: %.3f’ % dst)

Euclidean distance: 3.273

Manhattan Distance与欧几里得距离不同的是曼哈顿距离,也称为“城市街区”,即一个向量到另一个向量的距离。您可以将此指标想象为当您无法穿过建筑物时计算两点之间距离的一种方法。

绿线为您提供欧几里得距离,而紫色线为您提供曼哈顿距离。

Different from Euclidean distance is the Manhattan distance, also called ‘cityblock’, distance from one vector to another. You can imagine this metric as a way to compute the distance between two points when you are not able to go through buildings.

We calculate the Manhattan distance as follows:

我们计算曼哈顿距离如下:

The green line gives you the Euclidean distance, while the purple line gives you the Manhattan distance.

Source: Quora

In many ML applications Euclidean distance is the metric of choice. However, for high dimensional data Manhattan distance is preferable as it yields more robust results.

Implementation in Python

from scipy.spatial import distance
dst = distance.cityblock(x,y)
print(‘Manhattan distance: %.3f’ % dst)

Manhattan distance: 10.468

posted on 2023-03-07 09:56  码dao成功  阅读(100)  评论(0编辑  收藏  举报