CS100.1x-lab3_text_analysis_and_entity_resolution_student

这次作业叫Text Analysis and Entity Resolution，比前几次作业难度要大很多。相关ipynb文件见我github。

实体解析在数据清洗和数据整合中是一个很重要，且有难度的问题。这次作业将用Apache Spark和文本分析的方法应用到实体解析。实体解析是指，从不同的数据源的记录里找到相同的实体，当进行数据融合时，这个步骤是必须的。

这次作业的数据来源于metric-learning project，主要目录包括：

Google.csv, the Google Products dataset
Amazon.csv, the Amazon dataset
Google_small.csv, 200 records sampled from the Google data
Amazon_small.csv, 200 records sampled from the Amazon data
Amazon_Google_perfectMapping.csv, the "gold standard" mapping
stopwords.txt, a list of common English words

除此之外，作业还有一些样本数据用于Part 1和一个存储着google和amazon所有的实体的映射表，这个表是用来评价算法的性能。

Part 0 Preliminaries

下面我们要读取google和amazon的数据，并转化为RDD。其中，这两个数据集的格式是这样的。

The file format of an Amazon line is:                                 
"id","title","description","manufacturer","price"                                
The file format of a Google line is:                             
"id","name","description","manufacturer","price"

我们这一步要把ID这一列抽取出来。google的数据集里面，ID是指url，而amazon里面，ID是指包括数字和字母的字符串。我们第一步就是把数据变为pair RDD的形式，其中，ID是key，name/title, description, and manufacturer是value。

import re
DATAFILE_PATTERN = '^(.+),"(.+)",(.*),(.*),(.*)'

def removeQuotes(s):
    """ Remove quotation marks from an input string
    Args:
        s (str): input string that might have the quote "" characters
    Returns:
        str: a string without the quote characters
    """
    return ''.join(i for i in s if i!='"')


def parseDatafileLine(datafileLine):
    """ Parse a line of the data file using the specified regular expression pattern
    Args:
        datafileLine (str): input string that is a line from the data file
    Returns:
        str: a string parsed using the given regular expression and without the quote characters
    """
    match = re.search(DATAFILE_PATTERN, datafileLine)
    if match is None:
        print 'Invalid datafile line: %s' % datafileLine
        return (datafileLine, -1)
    elif match.group(1) == '"id"':
        print 'Header datafile line: %s' % datafileLine
        return (datafileLine, 0)
    else:
        product = '%s %s %s' % (match.group(2), match.group(3), match.group(4))
        return ((removeQuotes(match.group(1)), product), 1)

import sys
import os
from test_helper import Test

baseDir = os.path.join('data')
inputPath = os.path.join('cs100', 'lab3')

GOOGLE_PATH = 'Google.csv'
GOOGLE_SMALL_PATH = 'Google_small.csv'
AMAZON_PATH = 'Amazon.csv'
AMAZON_SMALL_PATH = 'Amazon_small.csv'
GOLD_STANDARD_PATH = 'Amazon_Google_perfectMapping.csv'
STOPWORDS_PATH = 'stopwords.txt'

def parseData(filename):
    """ Parse a data file
    Args:
        filename (str): input file name of the data file
    Returns:
        RDD: a RDD of parsed lines
    """
    return (sc
            .textFile(filename, 4, 0)
            .map(parseDatafileLine)
            .cache())

def loadData(path):
    """ Load a data file
    Args:
        path (str): input file name of the data file
    Returns:
        RDD: a RDD of parsed valid lines
    """
    filename = os.path.join(baseDir, inputPath, path)
    raw = parseData(filename).cache()
    failed = (raw
              .filter(lambda s: s[1] == -1)
              .map(lambda s: s[0]))
    for line in failed.take(10):
        print '%s - Invalid datafile line: %s' % (path, line)
    valid = (raw
             .filter(lambda s: s[1] == 1)
             .map(lambda s: s[0])
             .cache())
    print '%s - Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (path,
                                                                                        raw.count(),
                                                                                        valid.count(),
                                                                                        failed.count())
    assert failed.count() == 0
    assert raw.count() == (valid.count() + 1)
    return valid

googleSmall = loadData(GOOGLE_SMALL_PATH)
google = loadData(GOOGLE_PATH)
amazonSmall = loadData(AMAZON_SMALL_PATH)
amazon = loadData(AMAZON_PATH)

通过跑这段代码，我们得到如下结果。

Google_small.csv - Read 201 lines, successfully parsed 200 lines, failed to parse 0 lines
Google.csv - Read 3227 lines, successfully parsed 3226 lines, failed to parse 0 lines
Amazon_small.csv - Read 201 lines, successfully parsed 200 lines, failed to parse 0 lines
Amazon.csv - Read 1364 lines, successfully parsed 1363 lines, failed to parse 0 lines

我们跑这段代码看看数据长什么样子。

for line in googleSmall.take(3):
    print 'google: %s: %s\n' % (line[0], line[1])

for line in amazonSmall.take(3):
    print 'amazon: %s: %s\n' % (line[0], line[1])

google: http://www.google.com/base/feeds/snippets/11448761432933644608: spanish vocabulary builder "expand your vocabulary! contains fun lessons that both teach and entertain you'll quickly find yourself mastering new terms. includes games and more!" 

google: http://www.google.com/base/feeds/snippets/8175198959985911471: topics presents: museums of world "5 cd-rom set. step behind the velvet rope to examine some of the most treasured collections of antiquities art and inventions. includes the following the louvre - virtual visit 25 rooms in full screen interactive video detailed map of the louvre ..." 

google: http://www.google.com/base/feeds/snippets/18445827127704822533: sierrahome hse hallmark card studio special edition win 98 me 2000 xp "hallmark card studio special edition (win 98 me 2000 xp)" "sierrahome"

amazon: b000jz4hqo: clickart 950 000 - premier image pack (dvd-rom)  "broderbund"

amazon: b0006zf55o: ca international - arcserve lap/desktop oem 30pk "oem arcserve backup v11.1 win 30u for laptops and desktops" "computer associates"

amazon: b00004tkvy: noah's ark activity center (jewel case ages 3-8)  "victory multimedia"

Part 1 ER as Text Similarity - Bags of Words

我们在解析实体的时候，经常把所有的记录都当成字符串来处理，然后计算它们的相似度。这里我们用bag of words的方法。这个在文本分析中是简单且有效的方法。其主要思想是，把一个文档当作是没有顺序的word的合集，或者是tokens的合集。token是我们分解完文档后的最小单位，它可能是单词，数字，缩写等等。

当我们比较两个文档的相似度的时候，我们会看看这两个文档有多少个共同的token。而且当我们用关键词搜索文档时，我们可以直接看转换后的文档是否含有这个key。这个方法的优点就是，它对单词顺序和标点符号存在一定的鲁棒性。

Tokenize a String

下面是开始了作业部分了，注释里面含有TODO的，就表示这个功能要我们实现。我们要实现的函数是把一个String转换为一个token的list，要注意的是，把所以的token转换为小写。

# TODO: Replace <FILL IN> with appropriate code
quickbrownfox = 'A quick brown fox jumps over the lazy dog.'
split_regex = r'\W+'

def simpleTokenize(string):
    """ A simple implementation of input string tokenization
    Args:
        string (str): input string
    Returns:
        list: a list of tokens
    """
    return [x for x in filter(lambda x:len(x) > 0, re.split(split_regex,string.lower()))]

print simpleTokenize(quickbrownfox) # Should give ['a', 'quick', 'brown', ... ]

这个比较难，我稍微解释一下。filter(function, sequence)：对sequence中的item依次执行function(item)，将执行结果为True的item组成一个List/String/Tuple（取决于sequence的类型）返回。所以这里的function就是lambda x:len(x) > 0，sequence就是re.split(split_regex,string.lower()))。re.split和str.split不一样。

>>>'hello, world'.split(',')
>>>['hello',' world']
>>>re.split(r'\W+','hello, world')
>>>['hello','world']

Removing stopwords

在英语中，stopwords指那些对整个句子的意义没多大作用的单词，比如"the", "a", "is", "to"，这在bag of words方法里，就是噪声了。因为这些单词在句子太常见了，两个没有联系的句子可能因为stopwords太多而被判断相似。环境给我们提供了stopwords的文档，我们读取后转换为set，直接用in来判断就行。

# TODO: Replace <FILL IN> with appropriate code
stopfile = os.path.join(baseDir, inputPath, STOPWORDS_PATH)
stopwords = set(sc.textFile(stopfile).collect())
print 'These are the stopwords: %s' % stopwords

def tokenize(string):
    """ An implementation of input string tokenization that excludes stopwords
    Args:
        string (str): input string
    Returns:
        list: a list of tokens without stopwords
    """
    return [x for x in simpleTokenize(string) if x not in stopwords]

print tokenize(quickbrownfox) # Should give ['quick', 'brown', ... ]

Tokenizing the small datasets

这里是对之前的汇总。这里要统计一个文档的所有token，直接把每行的token总数加起来就行。

# TODO: Replace <FILL IN> with appropriate code
amazonRecToToken = amazonSmall.map(lambda (a,b): (a,tokenize(b)))
googleRecToToken = googleSmall.map(lambda (a,b): (a,tokenize(b)))

def countTokens(vendorRDD):
    """ Count and return the number of tokens
    Args:
        vendorRDD (RDD of (recordId, tokenizedValue)): Pair tuple of record ID to tokenized output
    Returns:
        count: count of all tokens
    """
    return vendorRDD.map(lambda x: len(x[1])).reduce(lambda x,y: x+y)

totalTokens = countTokens(amazonRecToToken) + countTokens(googleRecToToken)
print 'There are %s tokens in the combined datasets' % totalTokens

Amazon record with the most tokens

这里又要排序了。只不过要按照value的长度从大到小排序。

# TODO: Replace <FILL IN> with appropriate code
def findBiggestRecord(vendorRDD):
    """ Find and return the record with the largest number of tokens
    Args:
        vendorRDD (RDD of (recordId, tokens)): input Pair Tuple of record ID and tokens
    Returns:
        list: a list of 1 Pair Tuple of record ID and tokens
    """
    return vendorRDD.takeOrdered(1, key=lambda x: -1*len(x[1]))

biggestRecordAmazon = findBiggestRecord(amazonRecToToken)
print 'The Amazon record with ID "%s" has the most tokens (%s)' % (biggestRecordAmazon[0][0], len(biggestRecordAmazon[0][1]))

Part 2: ER as Text Similarity - Weighted Bag-of-Words using TF-IDF

Bag of words在实际中效果不太好，因为不同单词在一个文档里面的意义是不一样的，用数学的观点就是，权重不一样。仅仅用频率来衡量一个单词的权重是不科学的。所以有了TF-IDF算法。这里推荐阮一峰老师的TF-IDF与余弦相似性的应用，讲的非常通俗易懂。

Implement a TF function

这里要实现TF function。入参是string的list。我们要做的是统计每个单词出现的次数和总的单词次数，然后用每个单词的次数除以总的单词数，这就是TF-IDF里的TF了。

# TODO: Replace <FILL IN> with appropriate code
def tf(tokens):
    """ Compute TF
    Args:
        tokens (list of str): input list of tokens from tokenize
    Returns:
        dictionary: a dictionary of tokens to its TF values
    """
    dic = {}
    count = 0
    for word in tokens:
        if(word in dic):
            dic[word] += 1
        else:
            dic[word] = 1
        count += 1
    for key in dic:
        dic[key] = float(dic[key])/count
    return dic

print tf(tokenize(quickbrownfox)) # Should give { 'quick': 0.1666 ... }

Create a corpus

这里是把两个RDD合在一起。用union()就行。

# TODO: Replace <FILL IN> with appropriate code
corpusRDD = amazonRecToToken.union(googleRecToToken)

Implement an IDFs function

IDF算法需要全量的数据，所以上一步是把数据合在一起。这里需要大家理解了IDF才能做好。

# TODO: Replace <FILL IN> with appropriate code
def idfs(corpus):
    """ Compute IDF
    Args:
        corpus (RDD): input corpus
    Returns:
        RDD: a RDD of (token, IDF value)
    """
    N = corpus.count()
    uniqueTokens = corpus.map(lambda x:(x[0],list(set(x[1]))))
    tokenCountPairTuple = uniqueTokens.flatMap(lambda x:x[1]).map(lambda x: (x,1))
    tokenSumPairTuple = tokenCountPairTuple.reduceByKey(lambda a,b : a+b)
    return (tokenSumPairTuple.map(lambda x:(x[0],N/float(x[1]))))

idfsSmall = idfs(amazonRecToToken.union(googleRecToToken))
uniqueTokenCount = idfsSmall.count()

print 'There are %s unique tokens in the small datasets.' % uniqueTokenCount

Tokens with the smallest IDF

smallIDFTokens = idfsSmall.takeOrdered(11, lambda s: s[1])
print smallIDFTokens

IDF Histogram

import matplotlib.pyplot as plt

small_idf_values = idfsSmall.map(lambda s: s[1]).collect()
fig = plt.figure(figsize=(8,3))
plt.hist(small_idf_values, 50, log=True)
pass

Implement a TF-IDF function

这一步是把上面的步骤结合起来了。

# TODO: Replace <FILL IN> with appropriate code
def tfidf(tokens, idfs):
    """ Compute TF-IDF
    Args:
        tokens (list of str): input list of tokens from tokenize
        idfs (dictionary): record to IDF value
    Returns:
        dictionary: a dictionary of records to TF-IDF values
    """
    tfs = tf(tokens)
    tfIdfDict = {t:tfs[t]*idfs[t] for t in tfs}
    return tfIdfDict

recb000hkgj8k = amazonRecToToken.filter(lambda x: x[0] == 'b000hkgj8k').collect()[0][1]
idfsSmallWeights = idfsSmall.collectAsMap()
rec_b000hkgj8k_weights = tfidf(recb000hkgj8k, idfsSmallWeights)

print 'Amazon record "b000hkgj8k" has tokens and weights:\n%s' % rec_b000hkgj8k_weights

Part 3 ER as Text Similarity - Cosine Similarity

这里有关余弦相似度的问题，还是参见阮一峰老师的TF-IDF与余弦相似性的应用。

Implement the components of a cosineSimilarity function

这里实现余弦相似度分三步：计算两个向量的内积；计算向量的长度；结合上面两步。

# TODO: Replace <FILL IN> with appropriate code
import math

def dotprod(a, b):
    """ Compute dot product
    Args:
        a (dictionary): first dictionary of record to value
        b (dictionary): second dictionary of record to value
    Returns:
        dotProd: result of the dot product with the two input dictionaries
    """  
    return sum(a[k]*b[k]for k in a if k in b)

def norm(a):
    """ Compute square root of the dot product
    Args:
        a (dictionary): a dictionary of record to value
    Returns:
        norm: a dictionary of tokens to its TF values
    """
    count=0
    for key in a:
        count += a[key]*a[key]
    return math.sqrt(count)

def cossim(a, b):
    """ Compute cosine similarity
    Args:
        a (dictionary): first dictionary of record to value
        b (dictionary): second dictionary of record to value
    Returns:
        cossim: dot product of two dictionaries divided by the norm of the first dictionary and
                then by the norm of the second dictionary
    """
    return dotprod(a,b)/(norm(a)*norm(b))

testVec1 = {'foo': 2, 'bar': 3, 'baz': 5 }
testVec2 = {'foo': 1, 'bar': 0, 'baz': 20 }
dp = dotprod(testVec1, testVec2)
nm = norm(testVec1)
print dp, nm

Implement a cosineSimilarity function

# TODO: Replace <FILL IN> with appropriate code
def cosineSimilarity(string1, string2, idfsDictionary):
    """ Compute cosine similarity between two strings
    Args:
        string1 (str): first string
        string2 (str): second string
        idfsDictionary (dictionary): a dictionary of IDF values
    Returns:
        cossim: cosine similarity value
    """
    w1 = tfidf(tokenize(string1),idfsDictionary) 
    w2 = tfidf(tokenize(string2),idfsDictionary) 
    return cossim(w1, w2)

cossimAdobe = cosineSimilarity('Adobe Photoshop',
                               'Adobe Illustrator',
                               idfsSmallWeights)

print cossimAdobe

Perform Entity Resolution

这里我们先计算google数据里面到amazon数据里面记录的相似度，计算结果保存成key为(Google URL, Amazon ID)，value为余弦相似度的值。我们会用两种方法来计算，第一种是不用broadcast variable。

这里分三步走：1.计算所有的tuple,格式是[ ((Google URL1, Google String1), (Amazon ID1, Amazon String1)), ((Google URL1, Google String1), (Amazon ID2, Amazon String2)), ((Google URL2, Google String2), (Amazon ID1, Amazon String1)), ... ]。2.写个函数计算所有tuple的余弦相似度结果。3.把函数用到RDD中。

# TODO: Replace <FILL IN> with appropriate code
crossSmall = (googleSmall
              .cartesian(amazonSmall)
              .cache())

def computeSimilarity(record):
    """ Compute similarity on a combination record
    Args:
        record: a pair, (google record, amazon record)
    Returns:
        pair: a pair, (google URL, amazon ID, cosine similarity value)
    """
    googleRec = record[0]
    amazonRec = record[1]
    googleURL = googleRec[0]
    amazonID = amazonRec[0]
    googleValue = googleRec[1]
    amazonValue = amazonRec[1]
    cs = cosineSimilarity(googleValue,amazonValue,idfsSmallWeights)
    return (googleURL, amazonID, cs)

similarities = (crossSmall
                .map(lambda line:computeSimilarity(line))
                .cache())


def similar(amazonID, googleURL):
    """ Return similarity value
    Args:
        amazonID: amazon ID
        googleURL: google URL
    Returns:
        similar: cosine similarity value
    """
    return (similarities
            .filter(lambda record: (record[0] == googleURL and record[1] == amazonID))
            .collect()[0][2])

similarityAmazonGoogle = similar('b000o24l3q', 'http://www.google.com/base/feeds/snippets/17242822440574356561')
print 'Requested similarity is %s.' % similarityAmazonGoogle

Perform Entity Resolution with Broadcast Variables

上面那一步对小数据集还可以，但是数据量太大的时候，Spark需要把计算好的idf传向各个worker。假如我们没有缓存的话，Spark可能会重复计算相似度。这会导致Spark多次传idf值。

所以我们用broadcast variable解决这个问题。它只需要传一次就行。代码和上一步差不多。

# TODO: Replace <FILL IN> with appropriate code
def computeSimilarityBroadcast(record):
    """ Compute similarity on a combination record, using Broadcast variable
    Args:
        record: a pair, (google record, amazon record)
    Returns:
        pair: a pair, (google URL, amazon ID, cosine similarity value)
    """
    googleRec = record[0]
    amazonRec = record[1]
    googleURL = googleRec[0]
    amazonID = amazonRec[0]
    googleValue = googleRec[1]
    amazonValue = amazonRec[1]
    cs = cosineSimilarity(googleValue,amazonValue,idfsSmallBroadcast.value)
    return (googleURL, amazonID, cs)

idfsSmallBroadcast = sc.broadcast(idfsSmallWeights)
similaritiesBroadcast = (crossSmall
                         .map(lambda record:computeSimilarityBroadcast(record))
                         .cache())

def similarBroadcast(amazonID, googleURL):
    """ Return similarity value, computed using Broadcast variable
    Args:
        amazonID: amazon ID
        googleURL: google URL
    Returns:
        similar: cosine similarity value
    """
    return (similaritiesBroadcast
            .filter(lambda record: (record[0] == googleURL and record[1] == amazonID))
            .collect()[0][2])

similarityAmazonGoogleBroadcast = similarBroadcast('b000o24l3q', 'http://www.google.com/base/feeds/snippets/17242822440574356561')
print 'Requested similarity is %s.' % similarityAmazonGoogleBroadcast

Perform a Gold Standard evaluation

下面我们要用gold standard的数据来回答一些问题，我们首先读取和解析数据。

GOLDFILE_PATTERN = '^(.+),(.+)'

# Parse each line of a data file useing the specified regular expression pattern
def parse_goldfile_line(goldfile_line):
    """ Parse a line from the 'golden standard' data file
    Args:
        goldfile_line: a line of data
    Returns:
        pair: ((key, 'gold', 1 if successful or else 0))
    """
    match = re.search(GOLDFILE_PATTERN, goldfile_line)
    if match is None:
        print 'Invalid goldfile line: %s' % goldfile_line
        return (goldfile_line, -1)
    elif match.group(1) == '"idAmazon"':
        print 'Header datafile line: %s' % goldfile_line
        return (goldfile_line, 0)
    else:
        key = '%s %s' % (removeQuotes(match.group(1)), removeQuotes(match.group(2)))
        return ((key, 'gold'), 1)

goldfile = os.path.join(baseDir, inputPath, GOLD_STANDARD_PATH)
gsRaw = (sc
         .textFile(goldfile)
         .map(parse_goldfile_line)
         .cache())

gsFailed = (gsRaw
            .filter(lambda s: s[1] == -1)
            .map(lambda s: s[0]))
for line in gsFailed.take(10):
    print 'Invalid goldfile line: %s' % line

goldStandard = (gsRaw
                .filter(lambda s: s[1] == 1)
                .map(lambda s: s[0])
                .cache())

print 'Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (gsRaw.count(),
                                                                                 goldStandard.count(),
                                                                                 gsFailed.count())
assert (gsFailed.count() == 0)
assert (gsRaw.count() == (goldStandard.count() + 1))

接下来，我们把之前算好的有相似度的RDD和这里的gold standard的RDD用join()结合起来，然后统计有多少个pairs，以及计算相似度的平均值。然后计算没有匹配上的相似度平均值。

# TODO: Replace <FILL IN> with appropriate code
sims = similaritiesBroadcast.map(lambda line:('%s %s' %(line[1],line[0]),line[2]))

trueDupsRDD = (sims.join(goldStandard)) 
trueDupsCount = trueDupsRDD.count() 
avgSimDups = trueDupsRDD.map(lambda (k,v):v[0]).mean()

nonDupsRDD = (sims 
.leftOuterJoin(goldStandard).map(lambda (k,v):v[0] if v[1] is None else -1)).filter(lambda v:v!=-1) 
avgSimNon = nonDupsRDD.mean()

print 'There are %s true duplicates.' % trueDupsCount
print 'The average similarity of true duplicates is %s.' % avgSimDups
print 'And for non duplicates, it is %s.' % avgSimNon

Part 4 Scalable ER

上面的例子不完全属于分布式实现，所以时间复杂度会很高。我们在这一部分将介绍更加适合分布式的算法。

我们在计算token和权重时，由于记录之间的token比较，所以消耗了大量的计算。我们这里用一个叫inverted index的数据结构来避免平方增长率的token比较。它把数据集映射成token和文档，key为token，value为含有该token的文档。

Tokenize the full dataset

# TODO: Replace <FILL IN> with appropriate code
amazonFullRecToToken = amazon.map(lambda (k,v):(k,tokenize(v)))
googleFullRecToToken = google.map(lambda (k,v):(k,tokenize(v)))
print 'Amazon full dataset is %s products, Google full dataset is %s products' % (amazonFullRecToToken.count(),
                                                                                    googleFullRecToToken.count())

Compute IDFs and TF-IDFs for the full datasets

这里会用到之前的代码。我们要做的是，把新的RDD组合到一起，实现idf算法，并设为broadcast variable。

# TODO: Replace <FILL IN> with appropriate code
fullCorpusRDD = amazonFullRecToToken.union(googleFullRecToToken) 
idfsFull = idfs(fullCorpusRDD) 
idfsFullCount = idfsFull.count() 
print 'There are %s unique tokens in the full datasets.' % idfsFullCount

# Recompute IDFs for full dataset
idfsFullWeights = idfsFull.collectAsMap() 
idfsFullBroadcast = sc.broadcast(idfsFullWeights)

# Pre-compute TF-IDF weights.  Build mappings from record ID weight vector.
amazonWeightsRDD = amazonFullRecToToken.map(lambda (k,v):(k,tfidf(v, idfsFullBroadcast.value))) 
googleWeightsRDD = googleFullRecToToken.map(lambda (k,v):(k,tfidf(v, idfsFullBroadcast.value))) 
print 'There are %s Amazon weights and %s Google weights.' % (amazonWeightsRDD.count(),
                                                              googleWeightsRDD.count())

Compute Norms for the weights from the full datasets

# TODO: Replace <FILL IN> with appropriate code
amazonNorms = amazonWeightsRDD.map(lambda (k,d):(k,norm(d))) 
amazonNormsBroadcast = sc.broadcast(amazonNorms.collectAsMap()) 
googleNorms = googleWeightsRDD.map(lambda (k,d):(k,norm(d))) 
googleNormsBroadcast = sc.broadcast(googleNorms.collectAsMap())

Create inverted indicies from the full datasets

这里我们要做两步：实现invert function，输入是(ID, tokens)，输出是(ID, token vector)；把该函数用到上面的结果中，得到token和含有token的文档的映射。

# TODO: Replace <FILL IN> with appropriate code
def invert(record):
    """ Invert (ID, tokens) to a list of (token, ID)
    Args:
        record: a pair, (ID, token vector)
    Returns:
        pairs: a list of pairs of token to ID
    """
    ID, tokenvector = record
    pairs = [(k,ID) for k in tokenvector]
    return (pairs)

amazonInvPairsRDD = (amazonWeightsRDD 
                        .flatMap(invert) 
                        .cache())

googleInvPairsRDD = (googleWeightsRDD 
                        .flatMap(invert) 
                        .cache())

print 'There are %s Amazon inverted pairs and %s Google inverted pairs.' % (amazonInvPairsRDD.count(),
                                                                            googleInvPairsRDD.count())

Identify common tokens from the full dataset

这里把amazon的RDD和google的RDD合起来，得到((ID, URL), token)。

# TODO: Replace <FILL IN> with appropriate code
def swap(record):
    """ Swap (token, (ID, URL)) to ((ID, URL), token)
    Args:
        record: a pair, (token, (ID, URL))
    Returns:
        pair: ((ID, URL), token)
    """
    token = record[0]
    keys = record[1]
    return (keys, token)

commonTokens = (amazonInvPairsRDD 
                    .join(googleInvPairsRDD) 
                    .map(swap) 
                    .groupByKey() 
                    .cache())

print 'Found %d common tokens' % commonTokens.count()

Identify common tokens from the full dataset

最后一步了，这里要把之前计算的两个RDD：amazonWeights和googleWeights和上面的结果结合起来，计算权重。

# TODO: Replace <FILL IN> with appropriate code
amazonWeightsBroadcast = sc.broadcast(amazonWeightsRDD.collectAsMap()) 
googleWeightsBroadcast = sc.broadcast(googleWeightsRDD.collectAsMap())

def fastCosineSimilarity(record):
    """ Compute Cosine Similarity using Broadcast variables
    Args:
        record: ((ID, URL), token)
    Returns:
        pair: ((ID, URL), cosine similarity value)
    """
    amazonRec = record[0][0] 
    googleRec = record[0][1] 
    tokens = record[1]
    value = sum((amazonWeightsBroadcast.value[amazonRec][t])*(googleWeightsBroadcast.value[googleRec][t])\
            for t in tokens if t in amazonWeightsBroadcast.value[amazonRec] and t in googleWeightsBroadcast.value[googleRec])\
        /((amazonNormsBroadcast.value[amazonRec])*(googleNormsBroadcast.value[googleRec]))
    key = (amazonRec, googleRec)
    return (key, value)

similaritiesFullRDD = (commonTokens
                       .map(fastCosineSimilarity) 
                       .cache())

print similaritiesFullRDD.count()

Part 5 Analysis

计算部分到此结束。我们现在要验证结果了。我们需要选个阈值来觉得两个数据集的记录是否是一个实体。我们可以通过precision and recall来判断。一般来说用F-score来衡量模型的好坏。

Counting True Positives, False Positives, and False Negatives

# Create an RDD of ((Amazon ID, Google URL), similarity score)
simsFullRDD = similaritiesFullRDD.map(lambda x: ("%s %s" % (x[0][0], x[0][1]), x[1]))
assert (simsFullRDD.count() == 2441100)

# Create an RDD of just the similarity scores
simsFullValuesRDD = (simsFullRDD
                     .map(lambda x: x[1])
                     .cache())
assert (simsFullValuesRDD.count() == 2441100)

# Look up all similarity scores for true duplicates

# This helper function will return the similarity score for records that are in the gold standard and the simsFullRDD (True positives), and will return 0 for records that are in the gold standard but not in simsFullRDD (False Negatives).
def gs_value(record):
    if (record[1][1] is None):
        return 0
    else:
        return record[1][1]

# Join the gold standard and simsFullRDD, and then extract the similarities scores using the helper function
trueDupSimsRDD = (goldStandard
                  .leftOuterJoin(simsFullRDD)
                  .map(gs_value)
                  .cache())
print 'There are %s true duplicates.' % trueDupSimsRDD.count()
assert(trueDupSimsRDD.count() == 1300)

为了选一个合适的阈值，我们用Spark Accumulators来实现counting function。这是第一次出现Spark Accumulators的用法。

from pyspark.accumulators import AccumulatorParam
class VectorAccumulatorParam(AccumulatorParam):
    # Initialize the VectorAccumulator to 0
    def zero(self, value):
        return [0] * len(value)

    # Add two VectorAccumulator variables
    def addInPlace(self, val1, val2):
        for i in xrange(len(val1)):
            val1[i] += val2[i]
        return val1

# Return a list with entry x set to value and all other entries set to 0
def set_bit(x, value, length):
    bits = []
    for y in xrange(length):
        if (x == y):
          bits.append(value)
        else:
          bits.append(0)
    return bits

# Pre-bin counts of false positives for different threshold ranges
BINS = 101
nthresholds = 100
def bin(similarity):
    return int(similarity * nthresholds)

# fpCounts[i] = number of entries (possible false positives) where bin(similarity) == i
zeros = [0] * BINS
fpCounts = sc.accumulator(zeros, VectorAccumulatorParam())

def add_element(score):
    global fpCounts
    b = bin(score)
    fpCounts += set_bit(b, 1, BINS)

simsFullValuesRDD.foreach(add_element)

# Remove true positives from FP counts
def sub_element(score):
    global fpCounts
    b = bin(score)
    fpCounts += set_bit(b, -1, BINS)

trueDupSimsRDD.foreach(sub_element)

def falsepos(threshold):
    fpList = fpCounts.value
    return sum([fpList[b] for b in range(0, BINS) if float(b) / nthresholds >= threshold])

def falseneg(threshold):
    return trueDupSimsRDD.filter(lambda x: x < threshold).count()

def truepos(threshold):
    return trueDupSimsRDD.count() - falsenegDict[threshold]

Precision, Recall, and F-measures

# Precision = true-positives / (true-positives + false-positives)
# Recall = true-positives / (true-positives + false-negatives)
# F-measure = 2 x Recall x Precision / (Recall + Precision)

def precision(threshold):
    tp = trueposDict[threshold]
    return float(tp) / (tp + falseposDict[threshold])

def recall(threshold):
    tp = trueposDict[threshold]
    return float(tp) / (tp + falsenegDict[threshold])

def fmeasure(threshold):
    r = recall(threshold)
    p = precision(threshold)
    return 2 * r * p / (r + p)

Line Plots

thresholds = [float(n) / nthresholds for n in range(0, nthresholds)]
falseposDict = dict([(t, falsepos(t)) for t in thresholds])
falsenegDict = dict([(t, falseneg(t)) for t in thresholds])
trueposDict = dict([(t, truepos(t)) for t in thresholds])

precisions = [precision(t) for t in thresholds]
recalls = [recall(t) for t in thresholds]
fmeasures = [fmeasure(t) for t in thresholds]

print precisions[0], fmeasures[0]
assert (abs(precisions[0] - 0.000532546802671) < 0.0000001)
assert (abs(fmeasures[0] - 0.00106452669505) < 0.0000001)


fig = plt.figure()
plt.plot(thresholds, precisions)
plt.plot(thresholds, recalls)
plt.plot(thresholds, fmeasures)
plt.legend(['Precision', 'Recall', 'F-measure'])
pass

用最先进的方法，我们的F-score能有60%，但是这里只有40%。我们可能从三个方面来改进：1.用其他的特征；2.用其他更好的模型来处理特征，比如stemming, n-grams；3.换一个衡量相似度的函数。

posted @ 2017-04-15 16:31 james+zhao 阅读(1113) 评论(0) 收藏举报

刷新页面返回顶部

james+zhao

CS100.1x-lab3_text_analysis_and_entity_resolution_student

Part 0 Preliminaries

Part 1 ER as Text Similarity - Bags of Words

Tokenize a String

Removing stopwords

Tokenizing the small datasets

Amazon record with the most tokens

Part 2: ER as Text Similarity - Weighted Bag-of-Words using TF-IDF

Implement a TF function

Create a corpus

Implement an IDFs function

Tokens with the smallest IDF

IDF Histogram

Implement a TF-IDF function

Part 3 ER as Text Similarity - Cosine Similarity

Implement the components of a cosineSimilarity function

Implement a cosineSimilarity function

Perform Entity Resolution

Perform Entity Resolution with Broadcast Variables

Perform a Gold Standard evaluation

Part 4 Scalable ER

Tokenize the full dataset

Compute IDFs and TF-IDFs for the full datasets

Compute Norms for the weights from the full datasets

Create inverted indicies from the full datasets

Identify common tokens from the full dataset

Identify common tokens from the full dataset

Part 5 Analysis

Counting True Positives, False Positives, and False Negatives

Precision, Recall, and F-measures

Line Plots

公告