scrapy框架

Posted on 2019-10-09 16:59  大白不白  阅读(159)  评论(0)    收藏  举报

Scrapy框架:

spiders 发送请求 ==>引擎==> 调度器scheduler==>Downloader下载器,响应文件==>spiders==>处理数据,item,pipeline.                                                                                        

新建项目(scrapy startproject xxx):新建一个爬虫项目

明确目标(编写items.py):明确抓取的目标

制作爬虫(spiders/xxspider.py):制作爬虫开始爬取数据

存储内容(pipelines.py):设计管道存储爬取内容

 

运行爬虫项目:

命令行运行:scrapy  crawl  myspider

pycharm运行:from scrapy import cmdline
      cmdline.execute('scrapy crawl myspider'.split(" "))

 管道:

先在settings.py里面:

ITEM_PIPELINES = {
# 'mySpider.pipelines.mySpiderPipelines':100,
'mySpider.pipelines.MyspiderPipeline': 300,
}

 然后在pipelines.py里面:

import json

class MyspiderPipeline(object):
def __init__(self):
self.filename = open('teacher.json','w',encoding='utf8')

# 处理item数据
def process_item(self, item, spider):
jsontxt = json.dumps(dict(item),ensure_ascii=False)+ "\n"
self.filename.write(jsontxt)
# return item

# 结束调用
def close_spider(self,spider):
self.filename.close()

 回调函数到下一页:myspider.py:写在for循环外

# 将请求重新发送给调度器入队列,交给下载器下载

yield  scrapy.Request(self.url+str(self.offest),callback = self.parse)

 

 设置报头:

DEFAULT_REQUEST_HEADERS = {
'User-Agent':'Mozilla/5.0(compatible; MSIE 9.0;Windows NT 6.1;Trident/5.0;',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
}

 

 设置延迟:

#DOWNLOAD_DELAY = 3

 

 设置管道:

ITEM_PIPELINES = {
# 'mySpider.pipelines.mySpiderPipelines':100,
'mySpider.pipelines.MyspiderPipeline': 300,
}

 

管道处理文字:

import json

class MyspiderPipeline(object):
def __init__(self):
self.filename = open('teacher.json','w',encoding='utf8')

# 处理item数据
def process_item(self, item, spider):
jsontxt = json.dumps(dict(item),ensure_ascii=False)+ "\n"
self.filename.write(jsontxt)
# return item

# 结束调用
def close_spider(self,spider):
self.filename.close()

 

管道处理图片:

import scrapy
from scrapy.utils.project import get_project_settings
from scrapy.pipelines.images import ImagesPipeline
import os

class ImagesPipeline(ImagesPipeline):
#def process_item(self, item, spider):
# return item
# 获取settings文件里设置的变量值
IMAGES_STORE = get_project_settings().get("IMAGES_STORE")

def get_media_requests(self, item, info):
image_url = item["imagelink"]
yield scrapy.Request(image_url)

def item_completed(self, result, item, info):
image_path = [x["path"] for ok, x in result if ok]

os.rename(self.IMAGES_STORE + "/" + image_path[0], self.IMAGES_STORE + "/" + item["nickname"] + ".jpg")

item["imagePath"] = self.IMAGES_STORE + "/" + item["nickname"]

return item

 注意的点:

url = re.sub('\d+',str(page),response.url)

re.sub(s1,s2,s3) 把s3里的s1替换成s2

content = json.dumps(dict(item),ensure_ascii=False)把中文转成Unicode

 

博客园  ©  2004-2025
浙公网安备 33010602011771号 浙ICP备2021040463号-3