简单的理解协同过滤: 类似兴趣爱好的人喜欢类似的东西,具有类似属性的物品能够推荐给喜欢同类物品的人。比方,user A喜欢武侠片。user B也喜欢武侠片。那么能够把A喜欢而B没看过的武侠片推荐给B,反之亦然。这样的模式称为基于用户的协同过滤推荐(User-User Collaborative Filtering Recommendation)。再比方User A买了《java 核心技术卷一》。那么能够推荐给用户《java核心技术卷二》《java编程思想》,这样的模式称为基于物品的协同过滤(Item-Item Collaborative Filtering Recommendation).
以下是亚马逊中查看《java核心技术卷一》这本书的推荐结果:
这里写图片描写叙述

以下參考《集体智慧编程》一书,实现基于欧几里德距离和基于皮尔逊相关度的用户类似度计算和推荐。

数据集,用户对电影的打分表:

movies Lady in the Water Snakes on a Plane Just My Luck Superman Returns You, Me and Dupree The Night Listener
Lisa 2.5 3.5 3.0 3.5 2.5 3.0
Gene 3.0 3.5 1.5 5.0 3.5 3.0
Michael 2.5 3.0 3.5 4.0
Claudia 3.5 3.0 4.0 2.5 4.5
Mick 3.0 4.0 2.0 3.0 2.0
Jack 3.0 4.0 5.0 3.5 3.0
Toby 4.5 4.0 1.0

建立数据字典


critics={'Lisa': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5, 
 'The Night Listener': 3.0},
'Gene': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 
 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 
 'You, Me and Dupree': 3.5}, 
'Michael': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
 'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
 'The Night Listener': 4.5, 'Superman Returns': 4.0, 
 'You, Me and Dupree': 2.5},
'Mick': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 
 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
 'You, Me and Dupree': 2.0}, 
'Jack': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}

欧几里德距离


#返回一个有关person1与person2的基于距离的类似度评价
def sim_distance(prefs, person1, person2):

  #得到shared_item的列表
  ci = {}
  for item in prefs[person1]:
    if item in prefs[person2]:
      ci[item] = prefs[person1][item] - prefs[person2][item]

  if len(ci) == 1:  # confuses pearson metric
    return sim_distance(prefs, person1, person2)

  sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
                      for item in prefs[person1] if item in prefs[person2]])

  return 1/(1 + sqrt(sum_of_squares))

计算Lisa和Gene之间的欧式距离:

print(sim_distance(critics,'Lisa','Gene'))

结果:

0.294298055086

皮尔逊相关系数

#返回一个有关person1与person2的基于皮尔逊相关度评价
def sim_pearson(prefs,p1,p2):
  # Get the list of mutually rated items
  si={}
  for item in prefs[p1]: 
    if item in prefs[p2]: si[item]=1

  # if they are no ratings in common, return 0
  if len(si)==0: return 0

  # Sum calculations
  n=len(si)

  # Sums of all the preferences
  sum1=sum([prefs[p1][it] for it in si])
  sum2=sum([prefs[p2][it] for it in si])

  # Sums of the squares
  sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
  sum2Sq=sum([pow(prefs[p2][it],2) for it in si]) 

  # Sum of the products
  pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])

  # Calculate r (Pearson score)
  num=pSum-(sum1*sum2/n)
  den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
  if den==0: return 0

  r=num/den

  return r

计算Lisa和Gene之间的皮尔逊相关系数:

print(sim_pearson(critics,'Lisa','Gene'))

结果:

0.396059017191

选择Top N

 # 从反应偏好的字典中返回最为匹配者
 # 返回结果的个数和类似度函数均为可选參数
def topMatches(prefs, person, n=5, similarity=sim_pearson):
  scores = [(similarity(prefs, person, other), other)
      for other in prefs if other != person]
  scores.sort(reverse=True)
  return scores[0:n]

返回4个和Toby品味最类似的用户:

print(topMatches(critics,'Toby',n=4))

结果:

yaopans-MacBook-Pro:ucas01 yaopan$ python  recommend.py 
[(0.9912407071619299, 'Lisa'), (0.9244734516419049, 'Mick'), (0.8934051474415647, 'Claudia'), (0.66284898035987, 'Jack')]

基于用户推荐

def getRecommendations(prefs,person,similarity=sim_pearson):
  totals={}
  simSums={}
  for other in prefs:
    #和其它人比較,跳过自己
    if other==person: continue
    sim=similarity(prefs,person,other)

    #忽略评价值为0或小于0的情况
    if sim<=0: continue
    for item in prefs[other]:

      # 仅仅对自己还未看过到影片进行评价
      if item not in prefs[person] or prefs[person][item]==0:
        # 类似度*评价值
        totals.setdefault(item,0)
        totals[item]+=prefs[other][item]*sim
        # 类似度之和
        simSums.setdefault(item,0)
        simSums[item]+=sim

  # 建立一个归一化列表
  rankings=[(total/simSums[item],item) for item,total in totals.items()]

  # 返回经过排序的列表
  rankings.sort()
  rankings.reverse()
  return rankings

给Toby推荐:

print(getRecommendations(critics,'Toby'))

推荐结果

yaopans-MacBook-Pro:ucas01 yaopan$ python  recommend.py 
[(3.3477895267131013, 'The Night Listener'), (2.8325499182641614, 'Lady in the Water'), (2.5309807037655645, 'Just My Luck')]

基于物品推荐

基于物品推荐和基于用户推荐类似。把物品和用户调换。
转换函数:

def transformPrefs(prefs):
  result={}
  for person in prefs:
    for item in prefs[person]:
      result.setdefault(item,{})
      result[item][person]=prefs[person][item]
  return result

返回和Superman Returns类似的电影:

movies=transformPrefs(critics)
print("和Superman Returns类似的电影:")
print(topMatches(movies,'Superman Returns'))

结果:

[(0.6579516949597695, 'You, Me and Dupree'), (0.4879500364742689, 'Lady in the Water'), (0.11180339887498941, 'Snakes on a Plane'), (-0.1798471947990544, 'The Night Listener'), (-0.42289003161103106, 'Just My Luck')]

结果为负的是最不相关的。
代码下载地址:
recommend.py