集体智慧编程-第三章-得到词汇在指定博客源出现的次数

为聚类算法准备数据的常见做法是定义一组公共的数值型属性，可以利用这些属性对数据项进行比较。
在当前数据集中，被用来聚类的是一系列博客。
原文中是给了现成的数据集，由于网站访问不到，这里我使用任意几组数据集进行测试。
feedlist.txt

https://blog.csdn.net/liang19890820/rss/list
https://blog.csdn.net/xiaoquantouer/rss/list
https://blog.csdn.net/u011240877/rss/list
https://blog.csdn.net/sunhuaqiangl/rss/list

feedparser包在我上个博客中给出了安装方法。

下面代码执行结果：

blogdata.txt

好，正文来了，下面的代码是运行成功的版本，对书上的跑不通的部分进行了更改。

import feedparser as fe
import re

#对订阅源中的单词进行计数

def getwordcounts(url):
    d = fe.parse(url)
    wc = {}
    for e in d.entries:
        if 'summary' in e:
            summary = e.summary
        else:
            summary = e.description

        words = getwords(e.title + ' ' + summary)
        for word in words:
            wc.setdefault(word, 0)
            wc[word] += 1
        return d.feed.title, wc

#函数getwordcounts将摘要传给getwords,后者会将所有的html标记剥离掉，并将非字母字符作为分隔符拆分出来，再将结果以列表的形式加以返回。
def getwords(html):
    # 去除所有HTML标记
    txt = re.compile(r'<[^>]+>').sub('', html)
    # 利用所有非字母字符拆分出单词
    words = re.compile(r'[^A-Z^a-z]').split(txt)
    # 转化成小写形式
    return [word.lower() for word in words if word != '']

#主体代码
#循环遍历订阅源并生成数据集代码的第一部分遍历feedlist.txt的文件中的每一行，然后生成针对每个博客的单词统计，以及出现这些单词的博客数目（apcount)
apcount = {}
wordcounts = {}
feedList = []
with open("feedlist.txt") as lines:
    for line in lines:
        feedList.append(line)

for feedurl in feedList:
    title, wc = getwordcounts(feedurl)
    wordcounts[title] = wc
    for word, count in wc.items():
        apcount.setdefault(word, 0)
        if count > 1:
            apcount[word] += 1
    #print(apcount[word])
#将10%定为下界，将50%定为上界，选择介于这个百分比范围内的单词

wordlist = []
for  w,bc in apcount.items():
    frac = float(bc)/len(feedList)
    #if frac < 0.5 and frac > 0.1 :
    wordlist.append(w)
    print(apcount.items)
#利用上述单词列表和博客列表建立一个文本文件，其中包含一个大的矩阵，记录着针对每个博客的所有单词的统计情况：
#书上是open直接使用，我改成了with open的形式

with open('blogdata.txt','w') as out:
    out.write('Blog')
    for word in wordlist: out.write('\t%s' % word)
    out.write('\n')
    for blog,wc in wordcounts.items():
        out.write(blog)
        for word in wordlist:
            if word in wc: out.write('\t%d' % wc[word])
            else: out.write('\t0')
    out.write('\n')

posted @ 2020-08-28 20:21 求进步的娃阅读(224) 评论(0) 收藏举报

刷新页面返回顶部

求进步的娃

集体智慧编程-第三章-得到词汇在指定博客源出现的次数

公告