python爬虫 scrapy框架（一）爬取壁纸照片

此项目仅供学习参考，不用于任何商业用途

若侵权留言，立刻删除

项目名称：爬取彼岸图网照片

使用scrapy爬虫框架

python版本：3.6

初学爬虫，代码多有不妥之处，还请各位教正

在这只介绍一下spider文件，整个项目的代码大家可以到我的GitHub去Clone or download 👉 https://github.com/Dawn-bin/netbian_spider.git

浏览一下这个网站，每页有21张图片，下面是页码数1000多页，两万多张照片

点其中一张，将进入一个单独的界面，只有这一张图，对比发现这张图的清晰度比主页的要高，这就是我们想要的图片了

先看一下这个网站的主页url什么规律

第二页的url是： http://pic.netbian.com/index_2.html

很简单，第三页是：http://pic.netbian.com/index_3.html

所以要爬取的网页可以直接暴力构造，但那也太low了吧

在第一页（ http://pic.netbian.com）我们打开F12查看网页源码，用左上角那个小指针选择下一页

发现指向 <a href='/index_2.html' class='prev'> 下一页 </a> 点一下 href 后的链接，打开的就是我们想要的下一页（http://pic.netbian.com/index_2.html）

我们requests返回网页源码后可以直接用XPath将href=’内容‘切出来：

//a[contains(text(), '下一页')]/@href

上图是Google浏览器XPath Helper 插件得到的结果

构造出下页的url

self.next_page = self.index + html_content.xpath("//a[contains(text(), '下一页')]/@href")[0]

下面爬取照片单独页面的url

再次F12 用小指针选择一张图片，例如：

img标签下直接暴露出一个照片的url 但那只是主页的那张小图，不是我们想要的

点击href=后的url 进入照片单独的页面

再次用XPath切出照片单独页面的url

html_url = html_content.xpath("//ul[@class='clearfix']/li/a[@target='_blank']/@href")

返回的html_url是一个list ，每一页有21张照片，所以应当有21个url

循环遍历list里的url， request得到的网页源码中切出照片的url，名字，类别

picture_url = html_content.xpath("//div/a[@target='_blank']/img/@src")[0]                
picture_name = html_content.xpath("//div/a[@target='_blank']/img/@title")[0]                 
picture_class = html_content.xpath("//div/span/a/text()")[1]

然后就可以requests照片的url保存下来了

简单介绍一下scrapy中spider执行的机制。

1. 需要先装好scrapy库参考：https://blog.csdn.net/yctjin/article/details/70658811

2. allowed_domains = [ ] 限制爬虫爬取的域，例如这个项目 allowed_domains = ['http://pic.netbian.com'] 意思就是限制爬虫只能爬取http://pic.netbian.com

3. start_url = [ ] 是爬虫开始的url 会自动的传入 parse函数爬虫开启时会制动跳入parse

4. 启动爬虫：

> cd ./netbian/spiders
> scrapy crawl netbianspider

cmd命令行cd到netbian文件夹中的spiders文件中

输入scrapy crawl netbianspider(爬虫的名字）

这么多照片怎么分类

if not os.path.exists(self.storePath +'\{0}'.format(str(picture_class))):
　　os.mkdir(self.storePath +'\{0}'.format(str(picture_class)))
   #判断图片类别是否存在  不存在则以picture_class为名建立一个文件夹
with open('{0}\{1}\{2}{3}'.format(self.storePath, str(picture_class), picture_name,'.jpg'), 'wb') as f:
   f.write(html_content)
   #保存图片

经过测试爬取成功率达到99%以上

当然高质量的图片还是要注册才能下载

下面是spider.py的内容：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 import requests
 4 from scrapy import Request
 5 from lxml import etree
 6 from netbian._header import getHeaders  #将_header.py 中的getHeader引入
 7 import re                               #正则
 8 import os
 9
10 
11 
12 class NetbianSpider(scrapy.Spider):
13     name = 'netbianspider'                                  #名字
14     #allowed_domains = ['http://pic.netbian.com']           #限制爬虫
15     index = 'http://pic.netbian.com'                        #构造url使用得主页url
16     start_urls = ['http://pic.netbian.com']  　　　　　　　　　#开始爬取链接
17     storePath = r'E:\picture'                               #保存路径
18     next_page = ''                                          #保存下一页得url
19     page_num = 0                                            #保存爬取页码数
20 
21     #请求并返回网页源码
22     def getHTMLText(self,url):
23         headers =getHeaders()                               #获得一个header  函数位于 _header.py 文件中
24         try:
25             r = requests.get(url,headers=headers, timeout=5)
26             r.raise_for_status()                            #当requests失败时，raise_for_status()会raise错误，程序跳到except
27             return r.content
28         except:
29             #print('\n'+ url, 'requests error...\n')
30             print(u"\n{0}:requests error...\n".format(url))
31             #打印错误信息     
32 
33     #传入图片的url、名字、类别
34     #请求并保存图片
35     def pictureStore(self,picture_url,picture_name,picture_class):
36         picture_name = re.sub('[\/:*?"<>|\t]', ' ', picture_name)
37         #用正则去掉非法命名符号   爬取过程中发现有的照片名字里面有 \t
38         url = self.index + picture_url      #构造图片url
39         html_content = self.getHTMLText(url)  #获得源码， 这里是照片的二进制文件
40         if html_content is not None:          #判断requests返回是否为空
41             if not os.path.exists(self.storePath +'\{0}'.format(str(picture_class))):
42                 os.mkdir(self.storePath +'\{0}'.format(str(picture_class)))
43                 #判断图片类别是否存在
44             with open('{0}\{1}\{2}{3}'.format(self.storePath, str(picture_class), picture_name,'.jpg'), 'wb') as f:
45                 f.write(html_content)
46                 #保存图片
47         else:
48             with open("url_Error.txt",'a') as f:
49                 f.write(str(picture_class) + ' ' + str(picture_name) + ' ' + picture_url+ '\n')
50             #将requests错误的图片信息保存到txt文件
51 
52     def parse(self, response):
53         html_content = etree.HTML(response.text)
54         html_url = html_content.xpath("//ul[@class='clearfix']/li/a[@target='_blank']/@href")
55         #从主页中切出图片单独所在的页面的url   返回一个list
56         self.next_page = self.index + html_content.xpath("//a[contains(text(), '下一页')]/@href")[0]
57         #切出下一页的url
58         count = re.findall(r'index_(\d*).html', self.next_page)
59         #获得页码数
60         #print("当前进度：{:.2f}%'.format(int(count[0]) * 100 / 1035)\n")
61         print(u"\n当前进度：{0:.2f}%    第 {1:} 页\n".format((int(count[0])-1) * 100 / 1035,int(count[0])-1))
62         #打印进度  printYellowRed 函数位于 Color.py
63         for url in html_url:                            #遍历list
64             url = self.index + url                      #构造图片单独所在的页面的url
65             html_PicturePage = self.getHTMLText(url)    #获得网页源码
66             if html_PicturePage is not None:            #判断是否为空
67 
68                 html_content = etree.HTML(html_PicturePage)   #etree处理
69 
70                 #切出图片的url、名字、类别
71                 picture_url = html_content.xpath("//div/a[@target='_blank']/img/@src")[0]
72                 picture_name = html_content.xpath("//div/a[@target='_blank']/img/@title")[0]
73                 picture_class = html_content.xpath("//div/span/a/text()")[1]
74 
75                 self.pictureStore(picture_url, picture_name, picture_class)
76                 #保存
77             #else:
78             #    html_url.append(url)                  #为空时再次加入list
79             #经测试，失败后加入list的url再次请求也不也能成功
80 
81         #print(self.next_page)
82         if self.next_page is not None:
83             yield Request(self.next_page, headers=getHeaders(), callback=self.parse)
84             #callback自身  当next_page为空时结束爬取

posted @ 2018-08-16 15:29 Dawn-bin 阅读(467) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Dawn

命运所赠送的礼物，早已在暗中标好了价格。

python爬虫 scrapy框架（一）爬取壁纸照片

公告