Hadoop综合大作业

 

1.用Hive对爬虫大作业产生的文本文件(或者英文词频统计下载的英文长篇小说)词频统计。

f = open('note.txt', 'r')
song = f.read()
f.close()

def writeFilenote(contnet):

    f = open('newnote.txt', 'a', encoding='utf-8')
    f.write(contnet)
    f.close()

symbol = ''',.?!’;?!:"“”-%$'''

exclude = '''
a an the in on to at and of is was are were i he she you your they us their our it or for be too do no 
that s so as but it's
'''

for i in symbol:
    song = song.replace(i, ' ')
writeFilenote(song)
print(song)

先用python将文本当中的不合法词汇剔除,然后另存为newnote.txt

然后hive一系列猛操作,出现结果如下图。(过程不贴了,毕竟跟上次差不多)

2.用Hive对爬虫大作业产生的csv文件进行数据分析,写一篇博客描述你的分析过程和分析结果。

def getNewsDetail(newsUrl):
    resd = requests.get(newsUrl)
    resd.encoding = 'utf-8'
    soupd = BeautifulSoup(resd.text, 'html.parser')
    NewsDict={}


    NewsDict['source']=soupd.select('.comeFrom')[0].select('a')[0].text
    NewsDict['title']=soupd.select('.headline')[0].text
    NewsDict['time']=soupd.select('#pubtime_baidu')[0].text
    #NewsDict['content'] = soupd.select('.artical-main-content')[0].text

    return NewsDict

def Get_page(url):
    res = requests.get(url)
    res.encoding = 'utf-8'
    pagelist=[]
    soup = BeautifulSoup(res.text, 'html.parser')
    # print(soup.select('.tag-list-box')[0].select('.list'))
    for new in soup.select('.tag-list-box')[0].select('.list'):
        #print(new.select('.list-content')[0] .select('.name')[0].select('.n1')[0].select('a')[0]['href'])
        url =new.select('.list-content')[0] .select('.name')[0].select('.n1')[0].select('a')[0]['href']
        pagedict=getNewsDetail(url)
        pagelist.append(pagedict)

    return pagelist
        #break
        # break

        # print(url)




url = 'https://voice.hupu.com/nba/tag/3023-1.html'
resd = requests.get(url)
resd.encoding = 'utf-8'
soup1 = BeautifulSoup(resd.text, 'html.parser')
total=[]
# listCount = int(soup.select('.a1')[0].text.rstrip('条'))//10+1
pagelist=Get_page(url)
total.extend(pagelist)

for i in range(2, 25):
    total.extend(Get_page('https://voice.hupu.com/nba/tag/3023-{}.html'.format(i)))
    pan = pandas.DataFrame(total)
    pan.to_csv('result3.csv')

 

---恢复内容结束---

1.用Hive对爬虫大作业产生的文本文件(或者英文词频统计下载的英文长篇小说)词频统计。

f = open('note.txt', 'r')
song = f.read()
f.close()

def writeFilenote(contnet):

    f = open('newnote.txt', 'a', encoding='utf-8')
    f.write(contnet)
    f.close()

symbol = ''',.?!’;?!:"“”-%$'''

exclude = '''
a an the in on to at and of is was are were i he she you your they us their our it or for be too do no 
that s so as but it's
'''

for i in symbol:
    song = song.replace(i, ' ')
writeFilenote(song)
print(song)

先用python将文本当中的不合法词汇剔除,然后另存为newnote.txt

然后hive一系列猛操作,出现结果如下图。(过程不贴了,毕竟跟上次差不多)

2.用Hive对爬虫大作业产生的csv文件进行数据分析,写一篇博客描述你的分析过程和分析结果。

def getNewsDetail(newsUrl):
    resd = requests.get(newsUrl)
    resd.encoding = 'utf-8'
    soupd = BeautifulSoup(resd.text, 'html.parser')
    NewsDict={}


    NewsDict['source']=soupd.select('.comeFrom')[0].select('a')[0].text
    NewsDict['title']=soupd.select('.headline')[0].text
    NewsDict['time']=soupd.select('#pubtime_baidu')[0].text
    #NewsDict['content'] = soupd.select('.artical-main-content')[0].text

    return NewsDict

def Get_page(url):
    res = requests.get(url)
    res.encoding = 'utf-8'
    pagelist=[]
    soup = BeautifulSoup(res.text, 'html.parser')
    # print(soup.select('.tag-list-box')[0].select('.list'))
    for new in soup.select('.tag-list-box')[0].select('.list'):
        #print(new.select('.list-content')[0] .select('.name')[0].select('.n1')[0].select('a')[0]['href'])
        url =new.select('.list-content')[0] .select('.name')[0].select('.n1')[0].select('a')[0]['href']
        pagedict=getNewsDetail(url)
        pagelist.append(pagedict)

    return pagelist
        #break
        # break

        # print(url)




url = 'https://voice.hupu.com/nba/tag/3023-1.html'
resd = requests.get(url)
resd.encoding = 'utf-8'
soup1 = BeautifulSoup(resd.text, 'html.parser')
total=[]
# listCount = int(soup.select('.a1')[0].text.rstrip('条'))//10+1
pagelist=Get_page(url)
total.extend(pagelist)

for i in range(2, 25):
    total.extend(Get_page('https://voice.hupu.com/nba/tag/3023-{}.html'.format(i)))
    pan = pandas.DataFrame(total)
    pan.to_csv('result3.csv')

因为title太长的原因影响到后面的排版,所以我就删掉了,然后就是这个表导出来后id第一列显示是空的,与我后面的操作相违背,所以我直接删掉。

查了以下百度,好像是用了index=false 夹在to_csv(csv,index=false)这样。出来的表就是我想要的

 

然后就是基本操作,插入excel,然后删掉第一行这些拉。写下我的,水平不行跟老师有点像。 这个sh文件如以下

 

刚开始没注意到时间要改,随意就变成了格式不正确,修改之后数据才正确

小小一步废了我好长时间。

然后发现这个时间,不对于是我就去改了下pre_deal.sh的范围,结果如图。

 

 然后继续hive操作日常猛如虎系列:

然后看到这玩意,我呆滞了,我的时间出来md是null

积极的我时间定义改为string,然后正常输出

 

 然后进行数据分析,本来想把5月10日至5月14日的新闻发布次数提取出来的,竟然结果为0,好的这个操作看来只能拿针对date,而不能用来String,碍于重新开始用爬虫爬其他网站,于是就改成计算表里面一共有多少条数据把。

 

总结:虽然结果很简单,但是过程很曲折,想要后面不踩坑,前面要踏踏实实一步一步走,不然像我上面这些步骤做了好几次,气得我要死,下次选择爬网站的时候要看看自己扒下来的数据类型是什么,这东西影响到最后的数据分析,我反正我是在这载了,下次我会好好选择爬网站的对象了。

 

posted on 2018-05-16 20:42  163-王晓峰  阅读(290)  评论(0编辑  收藏  举报