CS100.1x-lab3_text_analysis_and_entity_resolution_student

这次作业叫Text Analysis and Entity Resolution,比前几次作业难度要大很多。相关ipynb文件见我github

实体解析在数据清洗和数据整合中是一个很重要,且有难度的问题。这次作业将用Apache Spark和文本分析的方法应用到实体解析。实体解析是指,从不同的数据源的记录里找到相同的实体,当进行数据融合时,这个步骤是必须的。

这次作业的数据来源于metric-learning project,主要目录包括:

  • Google.csv, the Google Products dataset
  • Amazon.csv, the Amazon dataset
  • Google_small.csv, 200 records sampled from the Google data
  • Amazon_small.csv, 200 records sampled from the Amazon data
  • Amazon_Google_perfectMapping.csv, the "gold standard" mapping
  • stopwords.txt, a list of common English words

除此之外,作业还有一些样本数据用于Part 1和一个存储着google和amazon所有的实体的映射表,这个表是用来评价算法的性能。

Part 0 Preliminaries

下面我们要读取google和amazon的数据,并转化为RDD。其中,这两个数据集的格式是这样的。

The file format of an Amazon line is:                                 
"id","title","description","manufacturer","price"                                
The file format of a Google line is:                             
"id","name","description","manufacturer","price"               

我们这一步要把ID这一列抽取出来。google的数据集里面,ID是指url,而amazon里面,ID是指包括数字和字母的字符串。我们第一步就是把数据变为pair RDD的形式,其中,ID是key,name/title, description, and manufacturer是value。

import re
DATAFILE_PATTERN = '^(.+),"(.+)",(.*),(.*),(.*)'

def removeQuotes(s):
    """ Remove quotation marks from an input string
    Args:
        s (str): input string that might have the quote "" characters
    Returns:
        str: a string without the quote characters
    """
    return ''.join(i for i in s if i!='"')


def parseDatafileLine(datafileLine):
    """ Parse a line of the data file using the specified regular expression pattern
    Args:
        datafileLine (str): input string that is a line from the data file
    Returns:
        str: a string parsed using the given regular expression and without the quote characters
    """
    match = re.search(DATAFILE_PATTERN, datafileLine)
    if match is None:
        print 'Invalid datafile line: %s' % datafileLine
        return (datafileLine, -1)
    elif match.group(1) == '"id"':
        print 'Header datafile line: %s' % datafileLine
        return (datafileLine, 0)
    else:
        product = '%s %s %s' % (match.group(2), match.group(3), match.group(4))
        return ((removeQuotes(match.group(1)), product), 1)
import sys
import os
from test_helper import Test

baseDir = os.path.join('data')
inputPath = os.path.join('cs100', 'lab3')

GOOGLE_PATH = 'Google.csv'
GOOGLE_SMALL_PATH = 'Google_small.csv'
AMAZON_PATH = 'Amazon.csv'
AMAZON_SMALL_PATH = 'Amazon_small.csv'
GOLD_STANDARD_PATH = 'Amazon_Google_perfectMapping.csv'
STOPWORDS_PATH = 'stopwords.txt'

def parseData(filename):
    """ Parse a data file
    Args:
        filename (str): input file name of the data file
    Returns:
        RDD: a RDD of parsed lines
    """
    return (sc
            .textFile(filename, 4, 0)
            .map(parseDatafileLine)
            .cache())

def loadData(path):
    """ Load a data file
    Args:
        path (str): input file name of the data file
    Returns:
        RDD: a RDD of parsed valid lines
    """
    filename = os.path.join(baseDir, inputPath, path)
    raw = parseData(filename).cache()
    failed = (raw
              .filter(lambda s: s[1] == -1)
              .map(lambda s: s[0]))
    for line in failed.take(10):
        print '%s - Invalid datafile line: %s' % (path, line)
    valid = (raw
             .filter(lambda s: s[1] == 1)
             .map(lambda s: s[0])
             .cache())
    print '%s - Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (path,
                                                                                        raw.count(),
                                                                                        valid.count(),
                                                                                        failed.count())
    assert failed.count() == 0
    assert raw.count() == (valid.count() + 1)
    return valid

googleSmall = loadData(GOOGLE_SMALL_PATH)
google = loadData(GOOGLE_PATH)
amazonSmall = loadData(AMAZON_SMALL_PATH)
amazon = loadData(AMAZON_PATH)

通过跑这段代码,我们得到如下结果。

Google_small.csv - Read 201 lines, successfully parsed 200 lines, failed to parse 0 lines
Google.csv - Read 3227 lines, successfully parsed 3226 lines, failed to parse 0 lines
Amazon_small.csv - Read 201 lines, successfully parsed 200 lines, failed to parse 0 lines
Amazon.csv - Read 1364 lines, successfully parsed 1363 lines, failed to parse 0 lines

我们跑这段代码看看数据长什么样子。

for line in googleSmall.take(3):
    print 'google: %s: %s\n' % (line[0], line[1])

for line in amazonSmall.take(3):
    print 'amazon: %s: %s\n' % (line[0], line[1])
google: http://www.google.com/base/feeds/snippets/11448761432933644608: spanish vocabulary builder "expand your vocabulary! contains fun lessons that both teach and entertain you'll quickly find yourself mastering new terms. includes games and more!" 

google: http://www.google.com/base/feeds/snippets/8175198959985911471: topics presents: museums of world "5 cd-rom set. step behind the velvet rope to examine some of the most treasured collections of antiquities art and inventions. includes the following the louvre - virtual visit 25 rooms in full screen interactive video detailed map of the louvre ..." 

google: http://www.google.com/base/feeds/snippets/18445827127704822533: sierrahome hse hallmark card studio special edition win 98 me 2000 xp "hallmark card studio special edition (win 98 me 2000 xp)" "sierrahome"

amazon: b000jz4hqo: clickart 950 000 - premier image pack (dvd-rom)  "broderbund"

amazon: b0006zf55o: ca international - arcserve lap/desktop oem 30pk "oem arcserve backup v11.1 win 30u for laptops and desktops" "computer associates"

amazon: b00004tkvy: noah's ark activity center (jewel case ages 3-8)  "victory multimedia"

Part 1 ER as Text Similarity - Bags of Words

我们在解析实体的时候,经常把所有的记录都当成字符串来处理,然后计算它们的相似度。这里我们用bag of words的方法。这个在文本分析中是简单且有效的方法。其主要思想是,把一个文档当作是没有顺序的word的合集,或者是tokens的合集。token是我们分解完文档后的最小单位,它可能是单词,数字,缩写等等。

当我们比较两个文档的相似度的时候,我们会看看这两个文档有多少个共同的token。而且当我们用关键词搜索文档时,我们可以直接看转换后的文档是否含有这个key。这个方法的优点就是,它对单词顺序和标点符号存在一定的鲁棒性。

Tokenize a String

下面是开始了作业部分了,注释里面含有TODO的,就表示这个功能要我们实现。我们要实现的函数是把一个String转换为一个token的list,要注意的是,把所以的token转换为小写。

# TODO: Replace <FILL IN> with appropriate code
quickbrownfox = 'A quick brown fox jumps over the lazy dog.'
split_regex = r'\W+'

def simpleTokenize(string):
    """ A simple implementation of input string tokenization
    Args:
        string (str): input string
    Returns:
        list: a list of tokens
    """
    return [x for x in filter(lambda x:len(x) > 0, re.split(split_regex,string.lower()))]

print simpleTokenize(quickbrownfox) # Should give ['a', 'quick', 'brown', ... ]

这个比较难,我稍微解释一下。filter(function, sequence):对sequence中的item依次执行function(item),将执行结果为True的item组成一个List/String/Tuple(取决于sequence的类型)返回。所以这里的function就是lambda x:len(x) > 0,sequence就是re.split(split_regex,string.lower()))。re.split和str.split不一样。

>>>'hello, world'.split(',')
>>>['hello',' world']
>>>re.split(r'\W+','hello, world')
>>>['hello','world']

Removing stopwords

在英语中,stopwords指那些对整个句子的意义没多大作用的单词,比如"the", "a", "is", "to",这在bag of words方法里,就是噪声了。因为这些单词在句子太常见了,两个没有联系的句子可能因为stopwords太多而被判断相似。环境给我们提供了stopwords的文档,我们读取后转换为set,直接用in来判断就行。

# TODO: Replace <FILL IN> with appropriate code
stopfile = os.path.join(baseDir, inputPath, STOPWORDS_PATH)
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords

def tokenize(string):
    """ An implementation of input string tokenization that excludes stopwords
    Args:
        string (str): input string
    Returns:
        list: a list of tokens without stopwords
    """
    return [x for x in simpleTokenize(string) if x not in stopwords]

print tokenize(quickbrownfox) # Should give ['quick', 'brown', ... ]

Tokenizing the small datasets

这里是对之前的汇总。这里要统计一个文档的所有token,直接把每行的token总数加起来就行。

# TODO: Replace <FILL IN> with appropriate code
amazonRecToToken = amazonSmall.map(lambda (a,b): (a,tokenize(b)))
googleRecToToken = googleSmall.map(lambda (a,b): (a,tokenize(b)))

def countTokens(vendorRDD):
    """ Count and return the number of tokens
    Args:
        vendorRDD (RDD of (recordId, tokenizedValue)): Pair tuple of record ID to tokenized output
    Returns:
        count: count of all tokens
    """
    return vendorRDD.map(lambda x: len(x[1])).reduce(lambda x,y: x+y)

totalTokens = countTokens(amazonRecToToken) + countTokens(googleRecToToken)
print 'There are %s tokens in the combined datasets' % totalTokens

Amazon record with the most tokens

这里又要排序了。只不过要按照value的长度从大到小排序。

# TODO: Replace <FILL IN> with appropriate code
def findBiggestRecord(vendorRDD):
    """ Find and return the record with the largest number of tokens
    Args:
        vendorRDD (RDD of (recordId, tokens)): input Pair Tuple of record ID and tokens
    Returns:
        list: a list of 1 Pair Tuple of record ID and tokens
    """
    return vendorRDD.takeOrdered(1, key=lambda x: -1*len(x[1]))

biggestRecordAmazon = findBiggestRecord(amazonRecToToken)
print 'The Amazon record with ID "%s" has the most tokens (%s)' % (biggestRecordAmazon[0][0], len(biggestRecordAmazon[0][1]))

Part 2: ER as Text Similarity - Weighted Bag-of-Words using TF-IDF

Bag of words在实际中效果不太好,因为不同单词在一个文档里面的意义是不一样的,用数学的观点就是,权重不一样。仅仅用频率来衡量一个单词的权重是不科学的。所以有了TF-IDF算法。这里推荐阮一峰老师的TF-IDF与余弦相似性的应用,讲的非常通俗易懂。

Implement a TF function

这里要实现TF function。入参是string的list。我们要做的是统计每个单词出现的次数和总的单词次数,然后用每个单词的次数除以总的单词数,这就是TF-IDF里的TF了。

# TODO: Replace <FILL IN> with appropriate code
def tf(tokens):
    """ Compute TF
    Args:
        tokens (list of str): input list of tokens from tokenize
    Returns:
        dictionary: a dictionary of tokens to its TF values
    """
    dic = {}
    count = 0
    for word in tokens:
        if(word in dic):
            dic[word] += 1
        else:
            dic[word] = 1
        count += 1
    for key in dic:
        dic[key] = float(dic[key])/count
    return dic

print tf(tokenize(quickbrownfox)) # Should give { 'quick': 0.1666 ... }

Create a corpus

这里是把两个RDD合在一起。用union()就行。

# TODO: Replace <FILL IN> with appropriate code
corpusRDD = amazonRecToToken.union(googleRecToToken)

Implement an IDFs function

IDF算法需要全量的数据,所以上一步是把数据合在一起。这里需要大家理解了IDF才能做好。

# TODO: Replace <FILL IN> with appropriate code
def idfs(corpus):
    """ Compute IDF
    Args:
        corpus (RDD): input corpus
    Returns:
        RDD: a RDD of (token, IDF value)
    """
    N = corpus.count()
    uniqueTokens = corpus.map(lambda x:(x[0],list(set(x[1]))))
    tokenCountPairTuple = uniqueTokens.flatMap(lambda x:x[1]).map(lambda x: (x,1))
    tokenSumPairTuple = tokenCountPairTuple.reduceByKey(lambda a,b : a+b)
    return (tokenSumPairTuple.map(lambda x:(x[0],N/float(x[1]))))

idfsSmall = idfs(amazonRecToToken.union(googleRecToToken))
uniqueTokenCount = idfsSmall.count()

print 'There are %s unique tokens in the small datasets.' % uniqueTokenCount

Tokens with the smallest IDF

smallIDFTokens = idfsSmall.takeOrdered(11, lambda s: s[1])
print smallIDFTokens

IDF Histogram

import matplotlib.pyplot as plt

small_idf_values = idfsSmall.map(lambda s: s[1]).collect()
fig = plt.figure(figsize=(8,3))
plt.hist(small_idf_values, 50, log=True)
pass

Implement a TF-IDF function

这一步是把上面的步骤结合起来了。

# TODO: Replace <FILL IN> with appropriate code
def tfidf(tokens, idfs):
    """ Compute TF-IDF
    Args:
        tokens (list of str): input list of tokens from tokenize
        idfs (dictionary): record to IDF value
    Returns:
        dictionary: a dictionary of records to TF-IDF values
    """
    tfs = tf(tokens)
    tfIdfDict = {t:tfs[t]*idfs[t] for t in tfs}
    return tfIdfDict

recb000hkgj8k = amazonRecToToken.filter(lambda x: x[0] == 'b000hkgj8k').collect()[0][1]
idfsSmallWeights = idfsSmall.collectAsMap()
rec_b000hkgj8k_weights = tfidf(recb000hkgj8k, idfsSmallWeights)

print 'Amazon record "b000hkgj8k" has tokens and weights:\n%s' % rec_b000hkgj8k_weights

Part 3 ER as Text Similarity - Cosine Similarity

这里有关余弦相似度的问题,还是参见阮一峰老师的TF-IDF与余弦相似性的应用

Implement the components of a cosineSimilarity function

这里实现余弦相似度分三步:计算两个向量的内积;计算向量的长度;结合上面两步。

# TODO: Replace <FILL IN> with appropriate code
import math

def dotprod(a, b):
    """ Compute dot product
    Args:
        a (dictionary): first dictionary of record to value
        b (dictionary): second dictionary of record to value
    Returns:
        dotProd: result of the dot product with the two input dictionaries
    """  
    return sum(a[k]*b[k]for k in a if k in b)

def norm(a):
    """ Compute square root of the dot product
    Args:
        a (dictionary): a dictionary of record to value
    Returns:
        norm: a dictionary of tokens to its TF values
    """
    count=0
    for key in a:
        count += a[key]*a[key]
    return math.sqrt(count)

def cossim(a, b):
    """ Compute cosine similarity
    Args:
        a (dictionary): first dictionary of record to value
        b (dictionary): second dictionary of record to value
    Returns:
        cossim: dot product of two dictionaries divided by the norm of the first dictionary and
                then by the norm of the second dictionary
    """
    return dotprod(a,b)/(norm(a)*norm(b))

testVec1 = {'foo': 2, 'bar': 3, 'baz': 5 }
testVec2 = {'foo': 1, 'bar': 0, 'baz': 20 }
dp = dotprod(testVec1, testVec2)
nm = norm(testVec1)
print dp, nm

Implement a cosineSimilarity function

# TODO: Replace <FILL IN> with appropriate code
def cosineSimilarity(string1, string2, idfsDictionary):
    """ Compute cosine similarity between two strings
    Args:
        string1 (str): first string
        string2 (str): second string
        idfsDictionary (dictionary): a dictionary of IDF values
    Returns:
        cossim: cosine similarity value
    """
    w1 = tfidf(tokenize(string1),idfsDictionary) 
    w2 = tfidf(tokenize(string2),idfsDictionary) 
    return cossim(w1, w2)

cossimAdobe = cosineSimilarity('Adobe Photoshop',
                               'Adobe Illustrator',
                               idfsSmallWeights)

print cossimAdobe

Perform Entity Resolution

这里我们先计算google数据里面到amazon数据里面记录的相似度,计算结果保存成key为(Google URL, Amazon ID),value为余弦相似度的值。我们会用两种方法来计算,第一种是不用broadcast variable。

这里分三步走:1.计算所有的tuple,格式是[ ((Google URL1, Google String1), (Amazon ID1, Amazon String1)), ((Google URL1, Google String1), (Amazon ID2, Amazon String2)), ((Google URL2, Google String2), (Amazon ID1, Amazon String1)), ... ]。2.写个函数计算所有tuple的余弦相似度结果。3.把函数用到RDD中。

# TODO: Replace <FILL IN> with appropriate code
crossSmall = (googleSmall
              .cartesian(amazonSmall)
              .cache())

def computeSimilarity(record):
    """ Compute similarity on a combination record
    Args:
        record: a pair, (google record, amazon record)
    Returns:
        pair: a pair, (google URL, amazon ID, cosine similarity value)
    """
    googleRec = record[0]
    amazonRec = record[1]
    googleURL = googleRec[0]
    amazonID = amazonRec[0]
    googleValue = googleRec[1]
    amazonValue = amazonRec[1]
    cs = cosineSimilarity(googleValue,amazonValue,idfsSmallWeights)
    return (googleURL, amazonID, cs)

similarities = (crossSmall
                .map(lambda line:computeSimilarity(line))
                .cache())


def similar(amazonID, googleURL):
    """ Return similarity value
    Args:
        amazonID: amazon ID
        googleURL: google URL
    Returns:
        similar: cosine similarity value
    """
    return (similarities
            .filter(lambda record: (record[0] == googleURL and record[1] == amazonID))
            .collect()[0][2])

similarityAmazonGoogle = similar('b000o24l3q', 'http://www.google.com/base/feeds/snippets/17242822440574356561')
print 'Requested similarity is %s.' % similarityAmazonGoogle

Perform Entity Resolution with Broadcast Variables

上面那一步对小数据集还可以,但是数据量太大的时候,Spark需要把计算好的idf传向各个worker。假如我们没有缓存的话,Spark可能会重复计算相似度。这会导致Spark多次传idf值。

所以我们用broadcast variable解决这个问题。它只需要传一次就行。代码和上一步差不多。

# TODO: Replace <FILL IN> with appropriate code
def computeSimilarityBroadcast(record):
    """ Compute similarity on a combination record, using Broadcast variable
    Args:
        record: a pair, (google record, amazon record)
    Returns:
        pair: a pair, (google URL, amazon ID, cosine similarity value)
    """
    googleRec = record[0]
    amazonRec = record[1]
    googleURL = googleRec[0]
    amazonID = amazonRec[0]
    googleValue = googleRec[1]
    amazonValue = amazonRec[1]
    cs = cosineSimilarity(googleValue,amazonValue,idfsSmallBroadcast.value)
    return (googleURL, amazonID, cs)

idfsSmallBroadcast = sc.broadcast(idfsSmallWeights)
similaritiesBroadcast = (crossSmall
                         .map(lambda record:computeSimilarityBroadcast(record))
                         .cache())

def similarBroadcast(amazonID, googleURL):
    """ Return similarity value, computed using Broadcast variable
    Args:
        amazonID: amazon ID
        googleURL: google URL
    Returns:
        similar: cosine similarity value
    """
    return (similaritiesBroadcast
            .filter(lambda record: (record[0] == googleURL and record[1] == amazonID))
            .collect()[0][2])

similarityAmazonGoogleBroadcast = similarBroadcast('b000o24l3q', 'http://www.google.com/base/feeds/snippets/17242822440574356561')
print 'Requested similarity is %s.' % similarityAmazonGoogleBroadcast

Perform a Gold Standard evaluation

下面我们要用gold standard的数据来回答一些问题,我们首先读取和解析数据。

GOLDFILE_PATTERN = '^(.+),(.+)'

# Parse each line of a data file useing the specified regular expression pattern
def parse_goldfile_line(goldfile_line):
    """ Parse a line from the 'golden standard' data file
    Args:
        goldfile_line: a line of data
    Returns:
        pair: ((key, 'gold', 1 if successful or else 0))
    """
    match = re.search(GOLDFILE_PATTERN, goldfile_line)
    if match is None:
        print 'Invalid goldfile line: %s' % goldfile_line
        return (goldfile_line, -1)
    elif match.group(1) == '"idAmazon"':
        print 'Header datafile line: %s' % goldfile_line
        return (goldfile_line, 0)
    else:
        key = '%s %s' % (removeQuotes(match.group(1)), removeQuotes(match.group(2)))
        return ((key, 'gold'), 1)

goldfile = os.path.join(baseDir, inputPath, GOLD_STANDARD_PATH)
gsRaw = (sc
         .textFile(goldfile)
         .map(parse_goldfile_line)
         .cache())

gsFailed = (gsRaw
            .filter(lambda s: s[1] == -1)
            .map(lambda s: s[0]))
for line in gsFailed.take(10):
    print 'Invalid goldfile line: %s' % line

goldStandard = (gsRaw
                .filter(lambda s: s[1] == 1)
                .map(lambda s: s[0])
                .cache())

print 'Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (gsRaw.count(),
                                                                                 goldStandard.count(),
                                                                                 gsFailed.count())
assert (gsFailed.count() == 0)
assert (gsRaw.count() == (goldStandard.count() + 1))

接下来,我们把之前算好的有相似度的RDD和这里的gold standard的RDD用join()结合起来,然后统计有多少个pairs,以及计算相似度的平均值。然后计算没有匹配上的相似度平均值。

# TODO: Replace <FILL IN> with appropriate code
sims = similaritiesBroadcast.map(lambda line:('%s %s' %(line[1],line[0]),line[2]))

trueDupsRDD = (sims.join(goldStandard)) 
trueDupsCount = trueDupsRDD.count() 
avgSimDups = trueDupsRDD.map(lambda (k,v):v[0]).mean()

nonDupsRDD = (sims 
.leftOuterJoin(goldStandard).map(lambda (k,v):v[0] if v[1] is None else -1)).filter(lambda v:v!=-1) 
avgSimNon = nonDupsRDD.mean()

print 'There are %s true duplicates.' % trueDupsCount
print 'The average similarity of true duplicates is %s.' % avgSimDups
print 'And for non duplicates, it is %s.' % avgSimNon

Part 4 Scalable ER

上面的例子不完全属于分布式实现,所以时间复杂度会很高。我们在这一部分将介绍更加适合分布式的算法。

我们在计算token和权重时,由于记录之间的token比较,所以消耗了大量的计算。我们这里用一个叫inverted index的数据结构来避免平方增长率的token比较。它把数据集映射成token和文档,key为token,value为含有该token的文档。

Tokenize the full dataset

# TODO: Replace <FILL IN> with appropriate code
amazonFullRecToToken = amazon.map(lambda (k,v):(k,tokenize(v)))
googleFullRecToToken = google.map(lambda (k,v):(k,tokenize(v)))
print 'Amazon full dataset is %s products, Google full dataset is %s products' % (amazonFullRecToToken.count(),
                                                                                    googleFullRecToToken.count())

Compute IDFs and TF-IDFs for the full datasets

这里会用到之前的代码。我们要做的是,把新的RDD组合到一起,实现idf算法,并设为broadcast variable。

# TODO: Replace <FILL IN> with appropriate code
fullCorpusRDD = amazonFullRecToToken.union(googleFullRecToToken) 
idfsFull = idfs(fullCorpusRDD) 
idfsFullCount = idfsFull.count() 
print 'There are %s unique tokens in the full datasets.' % idfsFullCount

# Recompute IDFs for full dataset
idfsFullWeights = idfsFull.collectAsMap() 
idfsFullBroadcast = sc.broadcast(idfsFullWeights)

# Pre-compute TF-IDF weights.  Build mappings from record ID weight vector.
amazonWeightsRDD = amazonFullRecToToken.map(lambda (k,v):(k,tfidf(v, idfsFullBroadcast.value))) 
googleWeightsRDD = googleFullRecToToken.map(lambda (k,v):(k,tfidf(v, idfsFullBroadcast.value))) 
print 'There are %s Amazon weights and %s Google weights.' % (amazonWeightsRDD.count(),
                                                              googleWeightsRDD.count())

Compute Norms for the weights from the full datasets

# TODO: Replace <FILL IN> with appropriate code
amazonNorms = amazonWeightsRDD.map(lambda (k,d):(k,norm(d))) 
amazonNormsBroadcast = sc.broadcast(amazonNorms.collectAsMap()) 
googleNorms = googleWeightsRDD.map(lambda (k,d):(k,norm(d))) 
googleNormsBroadcast = sc.broadcast(googleNorms.collectAsMap())

Create inverted indicies from the full datasets

这里我们要做两步:实现invert function,输入是(ID, tokens),输出是(ID, token vector);把该函数用到上面的结果中,得到token和含有token的文档的映射。

# TODO: Replace <FILL IN> with appropriate code
def invert(record):
    """ Invert (ID, tokens) to a list of (token, ID)
    Args:
        record: a pair, (ID, token vector)
    Returns:
        pairs: a list of pairs of token to ID
    """
    ID, tokenvector = record
    pairs = [(k,ID) for k in tokenvector]
    return (pairs)

amazonInvPairsRDD = (amazonWeightsRDD 
                        .flatMap(invert) 
                        .cache())

googleInvPairsRDD = (googleWeightsRDD 
                        .flatMap(invert) 
                        .cache())

print 'There are %s Amazon inverted pairs and %s Google inverted pairs.' % (amazonInvPairsRDD.count(),
                                                                            googleInvPairsRDD.count())

Identify common tokens from the full dataset

这里把amazon的RDD和google的RDD合起来,得到((ID, URL), token)。

# TODO: Replace <FILL IN> with appropriate code
def swap(record):
    """ Swap (token, (ID, URL)) to ((ID, URL), token)
    Args:
        record: a pair, (token, (ID, URL))
    Returns:
        pair: ((ID, URL), token)
    """
    token = record[0]
    keys = record[1]
    return (keys, token)

commonTokens = (amazonInvPairsRDD 
                    .join(googleInvPairsRDD) 
                    .map(swap) 
                    .groupByKey() 
                    .cache())

print 'Found %d common tokens' % commonTokens.count()

Identify common tokens from the full dataset

最后一步了,这里要把之前计算的两个RDD:amazonWeights和googleWeights和上面的结果结合起来,计算权重。

# TODO: Replace <FILL IN> with appropriate code
amazonWeightsBroadcast = sc.broadcast(amazonWeightsRDD.collectAsMap()) 
googleWeightsBroadcast = sc.broadcast(googleWeightsRDD.collectAsMap())

def fastCosineSimilarity(record):
    """ Compute Cosine Similarity using Broadcast variables
    Args:
        record: ((ID, URL), token)
    Returns:
        pair: ((ID, URL), cosine similarity value)
    """
    amazonRec = record[0][0] 
    googleRec = record[0][1] 
    tokens = record[1]
    value = sum((amazonWeightsBroadcast.value[amazonRec][t])*(googleWeightsBroadcast.value[googleRec][t])\
            for t in tokens if t in amazonWeightsBroadcast.value[amazonRec] and t in googleWeightsBroadcast.value[googleRec])\
        /((amazonNormsBroadcast.value[amazonRec])*(googleNormsBroadcast.value[googleRec]))
    key = (amazonRec, googleRec)
    return (key, value)

similaritiesFullRDD = (commonTokens
                       .map(fastCosineSimilarity) 
                       .cache())

print similaritiesFullRDD.count()

Part 5 Analysis

计算部分到此结束。我们现在要验证结果了。我们需要选个阈值来觉得两个数据集的记录是否是一个实体。我们可以通过precision and recall来判断。一般来说用F-score来衡量模型的好坏。

Counting True Positives, False Positives, and False Negatives

# Create an RDD of ((Amazon ID, Google URL), similarity score)
simsFullRDD = similaritiesFullRDD.map(lambda x: ("%s %s" % (x[0][0], x[0][1]), x[1]))
assert (simsFullRDD.count() == 2441100)

# Create an RDD of just the similarity scores
simsFullValuesRDD = (simsFullRDD
                     .map(lambda x: x[1])
                     .cache())
assert (simsFullValuesRDD.count() == 2441100)

# Look up all similarity scores for true duplicates

# This helper function will return the similarity score for records that are in the gold standard and the simsFullRDD (True positives), and will return 0 for records that are in the gold standard but not in simsFullRDD (False Negatives).
def gs_value(record):
    if (record[1][1] is None):
        return 0
    else:
        return record[1][1]

# Join the gold standard and simsFullRDD, and then extract the similarities scores using the helper function
trueDupSimsRDD = (goldStandard
                  .leftOuterJoin(simsFullRDD)
                  .map(gs_value)
                  .cache())
print 'There are %s true duplicates.' % trueDupSimsRDD.count()
assert(trueDupSimsRDD.count() == 1300)

为了选一个合适的阈值,我们用Spark Accumulators来实现counting function。这是第一次出现Spark Accumulators的用法。

from pyspark.accumulators import AccumulatorParam
class VectorAccumulatorParam(AccumulatorParam):
    # Initialize the VectorAccumulator to 0
    def zero(self, value):
        return [0] * len(value)

    # Add two VectorAccumulator variables
    def addInPlace(self, val1, val2):
        for i in xrange(len(val1)):
            val1[i] += val2[i]
        return val1

# Return a list with entry x set to value and all other entries set to 0
def set_bit(x, value, length):
    bits = []
    for y in xrange(length):
        if (x == y):
          bits.append(value)
        else:
          bits.append(0)
    return bits

# Pre-bin counts of false positives for different threshold ranges
BINS = 101
nthresholds = 100
def bin(similarity):
    return int(similarity * nthresholds)

# fpCounts[i] = number of entries (possible false positives) where bin(similarity) == i
zeros = [0] * BINS
fpCounts = sc.accumulator(zeros, VectorAccumulatorParam())

def add_element(score):
    global fpCounts
    b = bin(score)
    fpCounts += set_bit(b, 1, BINS)

simsFullValuesRDD.foreach(add_element)

# Remove true positives from FP counts
def sub_element(score):
    global fpCounts
    b = bin(score)
    fpCounts += set_bit(b, -1, BINS)

trueDupSimsRDD.foreach(sub_element)

def falsepos(threshold):
    fpList = fpCounts.value
    return sum([fpList[b] for b in range(0, BINS) if float(b) / nthresholds >= threshold])

def falseneg(threshold):
    return trueDupSimsRDD.filter(lambda x: x < threshold).count()

def truepos(threshold):
    return trueDupSimsRDD.count() - falsenegDict[threshold]

Precision, Recall, and F-measures

# Precision = true-positives / (true-positives + false-positives)
# Recall = true-positives / (true-positives + false-negatives)
# F-measure = 2 x Recall x Precision / (Recall + Precision)

def precision(threshold):
    tp = trueposDict[threshold]
    return float(tp) / (tp + falseposDict[threshold])

def recall(threshold):
    tp = trueposDict[threshold]
    return float(tp) / (tp + falsenegDict[threshold])

def fmeasure(threshold):
    r = recall(threshold)
    p = precision(threshold)
    return 2 * r * p / (r + p)

Line Plots

thresholds = [float(n) / nthresholds for n in range(0, nthresholds)]
falseposDict = dict([(t, falsepos(t)) for t in thresholds])
falsenegDict = dict([(t, falseneg(t)) for t in thresholds])
trueposDict = dict([(t, truepos(t)) for t in thresholds])

precisions = [precision(t) for t in thresholds]
recalls = [recall(t) for t in thresholds]
fmeasures = [fmeasure(t) for t in thresholds]

print precisions[0], fmeasures[0]
assert (abs(precisions[0] - 0.000532546802671) < 0.0000001)
assert (abs(fmeasures[0] - 0.00106452669505) < 0.0000001)


fig = plt.figure()
plt.plot(thresholds, precisions)
plt.plot(thresholds, recalls)
plt.plot(thresholds, fmeasures)
plt.legend(['Precision', 'Recall', 'F-measure'])
pass

用最先进的方法,我们的F-score能有60%,但是这里只有40%。我们可能从三个方面来改进:1.用其他的特征;2.用其他更好的模型来处理特征,比如stemming, n-grams;3.换一个衡量相似度的函数。

posted @ 2017-04-15 16:31  james+zhao  阅读(1021)  评论(0编辑  收藏  举报