随笔分类 -  spider

摘要:import json import httpx # 注意要安装 pip install h2 data = { 'page': '2' } headers={ 'method': 'POST', 'authority': '', 'scheme': 'https', 'path': '/api/c 阅读全文
posted @ 2021-10-18 15:14 Mr_Smith 阅读(570) 评论(0) 推荐(0)
摘要:COOKIES_ENABLED 默认: True 是否启用cookiesmiddleware。如果关闭,cookies将不会发送给web server。 COOKIES_DEBUG 默认: False 如果启用,Scrapy将记录所有在request(cookie 请求头)发送的cookies及re 阅读全文
posted @ 2019-08-26 19:30 Mr_Smith 阅读(367) 评论(0) 推荐(0)
摘要:import execjs # eval 和 complie 是要构建一个JS的环境 e = execjs.eval('a = new Array(1,2,3)') # 可以直接执行JS代码 print(e) x = execjs.compile(''' function add(x,y){ ret 阅读全文
posted @ 2019-08-12 20:20 Mr_Smith 阅读(446) 评论(0) 推荐(0)
摘要:from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings # 注意与scrapy.cfg在同一级目录 if __name__ == '__main__': process = CrawlerProcess(get_project_settings()) ... 阅读全文
posted @ 2019-08-12 18:33 Mr_Smith 阅读(126) 评论(0) 推荐(0)
摘要:import js2py # 实例化一个执行js的环境对象 context_js_obj = js2py.EvalJs() js_str = """ function A(a,b){ return a+b } """ # 传递js_str,执行js context_js_obj.execute(js_str) result = context_js_obj.A... 阅读全文
posted @ 2019-08-03 00:12 Mr_Smith 阅读(1303) 评论(0) 推荐(0)
摘要:import threading import time from queue import Queue from multiprocessing.dummy import Pool import requests from lxml import etree class QiuBaiSpider(object): # 1.爬取的的网站,和请求头 def __init__(s... 阅读全文
posted @ 2019-08-01 18:38 Mr_Smith 阅读(147) 评论(0) 推荐(0)
摘要:import threading import time from queue import Queue import requests from lxml import etree class QiuBaiSpider(object): # 1.爬取的的网站,和请求头 def __init__(self): self.base_url = 'https:/... 阅读全文
posted @ 2019-07-31 15:48 Mr_Smith 阅读(153) 评论(0) 推荐(0)
摘要:import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from incrementPro.items import IncrementproItem class MovieSpider(C... 阅读全文
posted @ 2019-05-17 22:52 Mr_Smith 阅读(191) 评论(0) 推荐(0)
摘要:pip install scrapy-redisscrapy genspider -t crawl xxx www.xxx.com class ChoutiSpider(RedisCrawlSpider): name = 'chouti' # allowed_domains = ['www.chouti.com'] # start_urls = ['http://www.ch... 阅读全文
posted @ 2019-05-16 21:36 Mr_Smith 阅读(146) 评论(0) 推荐(0)
摘要:1.创建scrapy工程:scrapy startproject projectName 2.创建爬虫文件:scrapy genspider -t crawl spiderName www.xxx.com # -*- coding: utf-8 -*- import scrapy from scra 阅读全文
posted @ 2019-05-16 13:11 Mr_Smith 阅读(611) 评论(0) 推荐(0)
摘要:class MovieSpider(scrapy.Spider): name = 'movie' allowed_domains = ['www.id97.com'] start_urls = ['http://www.id97.com/'] def parse(self, response): div_list = response.xpath... 阅读全文
posted @ 2019-05-16 11:24 Mr_Smith 阅读(871) 评论(0) 推荐(0)
摘要:增加并发: 默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。 降低日志级别: 在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘INFO’... 阅读全文
posted @ 2019-05-16 11:18 Mr_Smith 阅读(172) 评论(0) 推荐(0)
摘要:#创建项目 scrapy startproject demo #开一个爬虫项目 cd demo scrapy genspider first www.baidu.com #setting 中设置 ROBOTSTXT_OBEY = False USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/... 阅读全文
posted @ 2019-05-14 16:10 Mr_Smith 阅读(153) 评论(0) 推荐(0)
摘要:#人人网的模拟登录 import requests import urllib from lxml import etree #获取session对象 session = requests.Session() #将验证码图片进行下载 headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit... 阅读全文
posted @ 2019-05-14 10:01 Mr_Smith 阅读(1427) 评论(0) 推荐(0)
摘要:需要下载webdriver 对应自己的谷歌浏览器版本下载 http://chromedriver.storage.googleapis.com/index.html 阅读全文
posted @ 2019-05-13 21:43 Mr_Smith 阅读(1918) 评论(0) 推荐(0)
摘要:import requestsfrom lxml import etree url='https://bj.58.com/shunyi/ershoufang/?PGTID=0d30000c-0047-6aa6-0218-69d1ed59a77b&ClickID=3'headers = {'User- 阅读全文
posted @ 2019-05-11 00:07 Mr_Smith 阅读(241) 评论(0) 推荐(0)
摘要:正解解析 常用正则表达式回顾: 阅读全文
posted @ 2019-05-09 21:53 Mr_Smith 阅读(1702) 评论(0) 推荐(0)
摘要:#爬取百度翻译结果 import requestsurl = 'https://fanyi.baidu.com/sug' wd = input('enter a word:') data = { 'kw':wd } response = requests.post(url=url,data=data) print(response.json()) #response.text : 字符串 #... 阅读全文
posted @ 2019-05-09 15:21 Mr_Smith 阅读(141) 评论(0) 推荐(0)
摘要:anaconda jupyter notebook 阅读全文
posted @ 2019-05-07 23:31 Mr_Smith 阅读(141) 评论(0) 推荐(0)