爬虫 - 随笔分类(第2页) - 市丸银

scrapy持久化到Excel表格

摘要：前提条件：防止乱码产生 ITEM_PIPELINES = { 'xpc.pipelines.ExcelPipeline': 300, } 方法一 1、安装openpyxl conda install openpyxl 2、pipline from openpyxl import Workbook 阅读全文

posted @ 2019-11-15 17:21 市丸银阅读(667) 评论(0) 推荐(0)

scrapy在存储数据到json文件中时，中文变成为\u开头的字符串的处理方法

摘要：在settings.py文件中添加 FEED_EXPORT_ENCODING = 'utf-8' 阅读全文

posted @ 2019-11-15 16:08 市丸银阅读(519) 评论(0) 推荐(0)

ancconda创建爬虫项目

摘要：# 安装 conda env list conda create -n <envname> conda activate <envname> conda install scrapy scrapy # 检测安装是否成功 # 创建项目 cd /d 目标路径目录 scrapy startproject 阅读全文

posted @ 2019-11-14 11:22 市丸银阅读(194) 评论(0) 推荐(0)

BeautifulSoup

摘要：官网：https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 菜鸟教程：http://www.jsphp.net/python/show-24-214-1.html 自己写的日记：https://i-beta.cnblogs.com/diarie 阅读全文

posted @ 2019-11-13 09:31 市丸银阅读(155) 评论(0) 推荐(0)

requests

摘要：官网： https://requests.kennethreitz.org//zh_CN/latest/user/quickstart.html 测试网站：httpbin.org 注意：其它看官网 1、带headers的请求 2、带cookies的请求 3、带Basic-auth(auth)的请求阅读全文

posted @ 2019-11-13 09:29 市丸银阅读(154) 评论(0) 推荐(0)

Scrapy-redis组件

摘要：核心：共享爬取队列目的：实现分布式一、安装 pip3 install -i https://pypi.douban.com/simple scrapy-redis 二、去重 1、配置文件 scrapy 去重 DUPEFILTER_KEY = 'dupefilter:%(timestamp)s' 阅读全文

posted @ 2019-10-28 23:47 市丸银阅读(222) 评论(0) 推荐(0)

Scrapy信号量

摘要：1、类 2、配置文件阅读全文

posted @ 2019-10-28 23:24 市丸银阅读(246) 评论(0) 推荐(0)

Scrapy定制命令开启爬虫

摘要：一、单爬虫运行每次运行scrapy都要在终端输入命令太麻烦了在项目的目录下创建manager.py（任意名称）二、所有爬虫运行 1、在spiders同级创建commands目录(任意) 2、在其中创建 crawlall.py 文件，决定命令的运行 3、配置文件 4、manager.py 阅读全文

posted @ 2019-10-28 23:11 市丸银阅读(261) 评论(0) 推荐(0)

Scrapy中间件

摘要：一、下载中间件 1、应用场景代理 USER_AGENT(在setting文件中配置即可) 2、定义类 a、process_request 返回None 执行顺序 md1 request -> md2 request -> md2 response -> md1 response b、process 阅读全文

posted @ 2019-10-28 22:56 市丸银阅读(245) 评论(0) 推荐(0)

Scrapy简介

摘要：一、架构图二、流程 1、引擎从调度器中取出一个URL，用于抓取 2、引擎把URL封装成一个请求(start_requests) 传递给下载器 3、下载器把资源下载下来，并封装成Response 4、爬虫解析(parse) Response 5、解析出实体(yield Item)，交给pipelin 阅读全文

posted @ 2019-10-27 23:25 市丸银阅读(149) 评论(0) 推荐(0)

Scrapy解析器xpath

摘要：一、使用xpath 不在scrapy框架中通过response HtmlResponse->TextResponse->self.selector.xpath(query, **kwargs)->selector(self)->from scrapy.selector import Selector 阅读全文

posted @ 2019-10-27 23:04 市丸银阅读(2948) 评论(0) 推荐(0)

Scrapy设置代理

摘要：设置代理的位置:下载中间件一、内置代理(优点：简单，缺点：只能代理一个ip) 1、源码分析 process_request(self, request, spider)在下载器执行前执行 _set_proxy方法(设置代理)->self.proxies[scheme]->self.proxies 阅读全文

posted @ 2019-10-27 22:15 市丸银阅读(2618) 评论(0) 推荐(0)

Scrapy定制起始请求

摘要：Scrapy引擎来爬虫中取起始的URL 1、调用start_requests方法(父类)，并获取返回值 2、将放回值变成迭代器，通过iter() 3、执行__next__()方法取值 4、把返回值全部放到调度器中在爬虫类中重写start_requests方法 from scrapy import 阅读全文

posted @ 2019-10-26 20:00 市丸银阅读(218) 评论(0) 推荐(0)

Scrapy深度和优先级

摘要：一、深度配置文件 settings.py 二、优先级配置文件优先级为正数时，随着深度越大，优先级越低源码中，优先级三、源码分析 1、深度前提：scrapy yield request对象 -> 中间件 ->调度器... yield Request对象没有设置meta的值，meta默认为N 阅读全文

posted @ 2019-10-26 16:29 市丸银阅读(1437) 评论(0) 推荐(0)

Scrapy去重

摘要：一、原生 1、模块 2、RFPDupeFilter方法 a、request_seen 核心：爬虫每执行一次yield Request对象，则执行一次request_seen方法作用：用来去重，相同的url只能访问一次实现：将url值变成定长、唯一的值，如果这个url对象存在，则返回True表名已阅读全文

posted @ 2019-10-25 23:45 市丸银阅读(726) 评论(0) 推荐(0)

Scrapy持久化(items+pipelines)

摘要：一、items保存爬取的文件 items.py import scrapy class QuoteItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() text = scrapy 阅读全文

posted @ 2019-10-23 23:13 市丸银阅读(322) 评论(0) 推荐(0)

Scrapy的基本使用

摘要：爬取：http://quotes.toscrape.com 单页面 # -*- coding: utf-8 -*- import scrapy class QuoteSpider(scrapy.Spider): name = 'quote' allowed_domains = ['quotes.to 阅读全文

posted @ 2019-10-23 22:41 市丸银阅读(170) 评论(0) 推荐(0)

scrapy框架安装及创建

摘要：介绍：大而全的爬虫组件使用Anaconda conda install -c conda-forge scrapy 一、安装： windows 1.下载 https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 耐心等待网页刷新 pip3 instal 阅读全文

posted @ 2019-10-22 22:47 市丸银阅读(214) 评论(0) 推荐(0)

requests请求

摘要：requests：伪造浏览器请求请求 1.get requests.get( url='', params={ 'k1': ''v1, 'k2': 'v2' } ) 即 url?k1=k2&k2=v2 2.post requests.post( url='', # data 提交的数据 data={key: value}, # 请求头 headeres={}， # cookies值需要从get请阅读全文

posted @ 2019-10-22 15:28 市丸银阅读(174) 评论(0) 推荐(0)

爬虫简单使用

摘要：一、常识二、示例阅读全文

posted @ 2019-10-19 22:37 市丸银阅读(200) 评论(0) 推荐(0)

市丸银

知行合一

随笔分类 - 爬虫

公告