scrapy

内部集成twisted的异步网络框架

5部分（一般写spider和pipline）：

spider(定义爬取位置 start_urls,解析返回的数据 response)
engine(调度其他部门的中介部门)
scheduler（request对象的入列和出列）
downloader（下载数据）
pipline（保存数据）

基本使用

创建爬虫项目（scrap startproject xxx）
创建爬虫（scrapy genspider xxx xxx.com)
运行爬虫（scrapy crawl xxx）
item作为封装类 itemxxx=response.xpath().extract
数据存储（jsonitemexporter,file,exporter(file),exporter.start_exportering(),exporter.export_item(item),exporter.finish_exportering(),file.close,setting文件开启通道）

其他使用

setting

配置最大并发
配置延迟
配置中间件
配置管道

爬虫中间件，下载中间件
proxy request.meta["proxy"]
User_Agent
cookies 手动粘贴cookie 自动密码登录
meta 跨跨界面传递数据

crawlspider

创建爬虫：scrapy genspider -t crawl xxx zzz.com

scrapy_redis

posted @ 2022-11-20 23:25 千里兮兮阅读(48) 评论(0) 收藏举报

刷新页面返回顶部