道高一尺

2017年7月7日

摘要： 1，spider打开某网页，获取到一个或者多个request，经由scrapy engine传送给调度器schedulerrequest特别多并且速度特别快会在scheduler形成请求队列queue，由scheduler安排执行2，schelduler会按照一定的次序取出请求，经由引擎, 下载器中阅读全文

posted @ 2017-07-07 11:20 道高一尺阅读(5626) 评论(0) 推荐(0)

2017年7月4日

js的处理技巧

摘要：目前来说,处理js有两种方法: 1,通过第三方工具执行js脚本, selenium,会驱动浏览器把js全部加载出来并返回. 2,手动模拟js的执行 2.1)找到js链接,可以在idle中用print(u'*******')来检测 2.2)模拟js执行,从里面提取数据,一般是返回json格式的数据阅读全文

posted @ 2017-07-04 15:44 道高一尺阅读(303) 评论(0) 推荐(0)

网站登陆的两种方法

摘要：目前来看,需要登陆才能爬取的页面有两种可用方法方法一:FormRequest 里面传入用户名和密码方法二:添加cookie 阅读全文

posted @ 2017-07-04 14:54 道高一尺阅读(1420) 评论(0) 推荐(0)

scrapy批量下载图片

摘要： # -*- coding: utf-8 -*- import scrapy from rihan.items import RihanItem class RihanspiderSpider(scrapy.Spider): name = "rihanspider" # allowed_domains = ["*******"] start_urls = [******... 阅读全文

posted @ 2017-07-04 08:40 道高一尺阅读(1321) 评论(0) 推荐(0)

2017年7月3日

[转]解决scrapy下载图片时相对路径转绝对路径的问题

摘要：专注自:http://blog.csdn.net/hjy_six/article/details/6862648 阅读全文

posted @ 2017-07-03 16:23 道高一尺阅读(1171) 评论(0) 推荐(0)

scrapy爬取西刺网站ip

摘要： # scrapy爬取西刺网站ip # -*- coding: utf-8 -*- import scrapy from xici.items import XiciItem class XicispiderSpider(scrapy.Spider): name = "xicispider" allowed_domains = ["www.xicidaili.com/nn"]... 阅读全文

posted @ 2017-07-03 11:43 道高一尺阅读(1104) 评论(0) 推荐(0)

2017年7月2日

logging的使用方法

摘要： logging的使用方法 1,简单使用方法 >>> import logging >>> logging.warning('this is a warning') WARNING:root:this is a warning 2,通用的记录日志的方法,可加入日志的级别 >>> import logging >>> logging.log(logging.WARNING,"this is a... 阅读全文

posted @ 2017-07-02 16:38 道高一尺阅读(445) 评论(0) 推荐(0)

scrapy中的response

摘要：初始化参数 class scrapy.http.Response( url[, status=200, headers, body, flags ] ) 其他成员 url status headers body request meta flags copy() replace() 子类介绍 class scrapy.http.TextResponse(url... 阅读全文

posted @ 2017-07-02 16:10 道高一尺阅读(3799) 评论(0) 推荐(0)

scrapy中的request

摘要： scrapy中的request 初始化参数 class scrapy.http.Request( url [ , callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, don't_filter=False, errback ] ) 1，生成Request的方法 def par... 阅读全文

posted @ 2017-07-02 16:05 道高一尺阅读(4122) 评论(0) 推荐(0)

scrapy.Spider的属性和方法

摘要： scrapy.Spider的属性和方法属性: name:spider的名称,要求唯一 allowed_domains:允许的域名,限制爬虫的范围 start_urls:初始urls custom_settings:个性化设置,会覆盖全局的设置 crawler:抓取器,spider将绑定到它上面 custom_settings:配置实例,包含工程中所有的配置变量 logger:日志实例,打印调试... 阅读全文

posted @ 2017-07-02 12:08 道高一尺阅读(2836) 评论(0) 推荐(0)

公告