一个完整的大作业

目的：爬取4399小游戏网站相关数据，分析受欢迎的小游戏

目的网址：http://www.4399.com/gamehw.htm

手段：利用谷歌浏览器查看源代码并加以分析，用python语言爬取自己需要的数据

下图为4399小游戏首页，我们发现首页有着各种栏目，信息杂乱，爬取此页的数据并不能达到此次目的，观察整个页面选取了“最新好玩游戏列表”的栏目，下面将对进入这个栏目的游戏进行数据分析。

下图为好玩游戏的网页，左边为网页的界面，展示了各种小游戏，右边则是网页的源代码。

在源代码里找到我们需要的数据在哪个标签里，然后就可以进行行读取。如下图所示，我们可以看到，需要的数据在“tm_list”这个类的下级标签的各个同级标签里，搜索可以发现这个类在整个代码里有两处，第一处是最好玩游戏，第二处是一般好玩游戏，而我们只需要第一处的数据。

根据上图的分析，提取数据代码如下：

import requests
from bs4 import BeautifulSoup

def get(url):
    res = requests.get(url)
    res.encoding='gb2312'
    soup = BeautifulSoup(res.text,'html.parser')
    
    tm=soup.select('.tm_list')[0]#找到tm_list类
    #print(tm)
    for games in tm:#在找到的结果里遍历
        try:
            game=games.select('a')[0].text
            print(game)
        except:
            pass

gameurl = 'http://www.4399.com/flash/gamehw.htm'
print(get(gameurl))

运行以上的代码之后我们就可以得到最好玩的小游戏的名字，如下图所示：

再把这些名字整合成一个字符串，代码如下：

import requests
from bs4 import BeautifulSoup

def get(url,txt):
    res = requests.get(url)
    res.encoding='gb2312'
    soup = BeautifulSoup(res.text,'html.parser')

    tm=soup.select('.tm_list')[0]
    #print(tm)
    for games in tm:
        try:
            game=games.select('a')[0].text
            txt=txt+game
        except:
            pass
        
    return txt
    
gameurl = 'http://www.4399.com/flash/gamehw.htm'
txt=''
print(get(gameurl,txt))

运行修改后的代码，如下图我们可以看到游戏名整合了在一起，接下来我们就可以利用jieba来分词做进一步的分析。

分词并排序后效果如下（详细代码见完整代码）：

生成词云我们可以更直观地看到数据的比对。下面说说我对结果的几点分析：

最受欢迎的是游戏是冒险类的。
中文版也很突出，我猜想这说明有很多小游戏是汉化过来的。
其次，入目的是“怪物”、“僵尸”、“猎人”，结合第一条分析，可以说这种元素的冒险游戏是最受欢迎的。

生成txt文件，将获取的数据写入文本中，效果如下图。方便我们浏览，也能快速找到自己要玩的小游戏。

完整代码如下：

import requests
from bs4 import BeautifulSoup
import jieba

def get(url,txt):
    res = requests.get(url)
    res.encoding='gb2312'
    soup = BeautifulSoup(res.text,'html.parser')

    tm=soup.select('.tm_list')[0]
    #print(tm)
    
    for games in tm:
        try:
            game=games.select('a')[0].text#游戏名
            gurl=games.select('a')[0]['href']#游戏链接
            txt=txt+game#整合游戏名
           
            with open("c:\games.txt",'a') as text:#创建并写入txt文件
                text.write(game)
                text.write(':')
                text.write(url+gurl)
                text.write('\n\n')
        except:
            pass

    words = jieba.lcut(txt)#利用jieba分词
    ls = []
    counts = {}
    for word in words:
        ls.append(word)
        if len(word) == 1:
            continue
        else:
            counts[word] = counts.get(word,0)+1
    
    items = list(counts.items())
    items.sort(key = lambda x:x[1], reverse = True)
    for i in range(25):
        word , count = items[i]
        print ("{:<5}{:>5}".format(word,count))

    from wordcloud import WordCloud
    import matplotlib.pyplot as plt

    w=" ".join(words)#键入空格以生成词云

    wc=WordCloud().generate(w)

    plt.imshow(wc)
    plt.axis("off")
    plt.show()
    
gameurl = 'http://www.4399.com/flash/gamehw.htm'
txt=''
print(get(gameurl,txt))

posted @ 2017-10-24 16:04 08邢琼佳阅读(368) 评论(2) 收藏举报

琼佳

一个完整的大作业

公告