Scrapy爬虫入门实例

网上关于Scracpy的讲述已经非常丰富了，而且还有大神翻译的官方文档，我就不重复造轮子了，自己写了一个小爬虫，遇到不少坑，也学到不少东西，在这里给大家分享一下，自己也做个备忘录。

主要功能就是爬取cnvd漏洞库每个漏洞的名称、发布时间，漏洞描述，漏洞编号...

先说下我的环境：

Ubuntu 16.04
python 2.7
scracpy 1.0.3

1、Scracpy的如何自定义UA？

因为某些网站有根据ua做反爬机制，Scracpy已经为我们定义好了处理机制，首先在settings.py中设置处理request的类，然后实现类即可。

settings.py中增加如下内容

#网上找代码，有侵权请通知
WNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
    #工程名（cnvd）、类文件名（middlewares）、文件中类名(RotateUserAgentMiddleware)
    'cnvd.middlewares.RotateUserAgentMiddleware':400,
}

在同目录下新建文件Middlewares.py，代码如下

#网上找的代码，有侵权请告知

#!/usr/bin/python  
#-*-coding:utf-8-*-  
  
import random  
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware  
  
class RotateUserAgentMiddleware(UserAgentMiddleware):  
    def __init__(self, user_agent=''):  
        self.user_agent = user_agent  
  
    def process_request(self, request, spider):  
        #随机选择list中的ua
        ua = random.choice(self.user_agent_list)  
        if ua:  
            #设置request头部中的ua信息
            request.headers.setdefault('User-Agent', ua)  
  
    #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape  
    #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php  
    user_agent_list = [\  
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\  
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\  
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\  
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\  
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\  
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\  
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\  
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\  
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\  
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\  
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\  
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\  
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\  
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\  
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\  
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\  
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\  
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"  
       ]

2、Scracpy中log怎么用？

在网上搜索和官方例子中用的是log.msg("log msg", level=DEBUG)，但是自己使用的时候提示该方法已经被废弃，建议使用python的logging模块。使用方法如下

import logging

logging.debug("msg")

3、yield什么意思

我对python也是一知半解，一直不明白这个是怎么个用法，也说不好，看代码注释吧

#Scracpy会自动向url发一个请求，并使用index_parse解析response
yield scrapy.Request(url, callback=self.index_parse)
#实现回调函数
def index_parse(self, response):

4、Xpath的一些小知识

#获得a标签的href属性值
response.xpath(//tr[@class='current']/td/a/@href)
#查找文本是xxx的标签/对与查找汉字，需要在字符串前加u
response.xpath(u"//td[text()='漏洞描述'])
#找到当前节点的所有兄弟节点
response.xpath(u"//td[text()='漏洞描述']/following-sibling::*")

测试xpath的方法：

1、使用scracpy shell url，会返回一个交互终端，可以使用response.xpath("")查看该url页面的匹配结果。个人不喜欢这种方式，终端操作，不好看。

2、使用firefox插件WebDriver Element Locator，安装之后，当在页面选中元素右击，可以看到该元素的xpath，然并卵，大多时候不是我们想要的方式

3、使用firefox插件firebug，firexpath，在firebug页面会出现一个firepath的框，输入xpath可以测试匹配是否成功，个人比较喜欢这种方式

5、数据库操作

直接上数据库操作的代码

import logging
from cnvd.items import CnvdItem
import MySQLdb
import MySQLdb.cursors

class CnvdPipeline(object):
    def __init__(self):
        self.conn = MySQLdb.connect(user='root', passwd ='123456', db='cnvddb', host='localhost', charset='utf8')
        self.cursor = self.conn.cursor()
        self.cursor.execute("truncate table cnvd")
        self.conn.commit()
    def process_item(self, item, spider):
        self.cursor.execute("insert into cnvd(cnvd_id, name, time, description) values(%s, %s, %s, %s)", 
                           (item['cnvd_id'], item['name'], item['time'], item['description']))
        self.conn.commit()
        logging.debug(item['name']) 
        logging.debug(item['cnvd_id']) 
        logging.debug(item['time']) 
        return item

总结：

曾经使用urllib2写过一个爬虫爬cnnvd上的漏洞库，爬了一天发现网站漏洞页面不能访问来，用手机4G可以访问，猜测是把公司的ip加入黑名单来，本来想用scracpy再实现一个，奈何页面打不开，不好调试。于是选择来cnvd这个漏洞库，不过测试也仅仅是用了一页，没有全爬，担心再被拉黑。各位测试的时候最好也注意一下。。。

附件是完整的代码，仅仅是一个小demo，后来要做的还有数据库查重，反反爬，漏洞其他信息的爬取等等，但是作为一个demo，够用了

完整代码

posted @ 2017-03-10 15:04 Gordon0918 阅读(723) 评论(0) 收藏举报

刷新页面返回顶部

拾荒人

学而不思则罔，思而不学则殆

Scrapy爬虫入门实例

公告