随笔分类 - spider
摘要:import json import httpx # 注意要安装 pip install h2 data = { 'page': '2' } headers={ 'method': 'POST', 'authority': '', 'scheme': 'https', 'path': '/api/c
阅读全文
摘要:COOKIES_ENABLED 默认: True 是否启用cookiesmiddleware。如果关闭,cookies将不会发送给web server。 COOKIES_DEBUG 默认: False 如果启用,Scrapy将记录所有在request(cookie 请求头)发送的cookies及re
阅读全文
摘要:import execjs # eval 和 complie 是要构建一个JS的环境 e = execjs.eval('a = new Array(1,2,3)') # 可以直接执行JS代码 print(e) x = execjs.compile(''' function add(x,y){ ret
阅读全文
摘要:from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings # 注意与scrapy.cfg在同一级目录 if __name__ == '__main__': process = CrawlerProcess(get_project_settings()) ...
阅读全文
摘要:import js2py # 实例化一个执行js的环境对象 context_js_obj = js2py.EvalJs() js_str = """ function A(a,b){ return a+b } """ # 传递js_str,执行js context_js_obj.execute(js_str) result = context_js_obj.A...
阅读全文
摘要:import threading import time from queue import Queue from multiprocessing.dummy import Pool import requests from lxml import etree class QiuBaiSpider(object): # 1.爬取的的网站,和请求头 def __init__(s...
阅读全文
摘要:import threading import time from queue import Queue import requests from lxml import etree class QiuBaiSpider(object): # 1.爬取的的网站,和请求头 def __init__(self): self.base_url = 'https:/...
阅读全文
摘要:import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from incrementPro.items import IncrementproItem class MovieSpider(C...
阅读全文
摘要:pip install scrapy-redisscrapy genspider -t crawl xxx www.xxx.com class ChoutiSpider(RedisCrawlSpider): name = 'chouti' # allowed_domains = ['www.chouti.com'] # start_urls = ['http://www.ch...
阅读全文
摘要:1.创建scrapy工程:scrapy startproject projectName 2.创建爬虫文件:scrapy genspider -t crawl spiderName www.xxx.com # -*- coding: utf-8 -*- import scrapy from scra
阅读全文
摘要:class MovieSpider(scrapy.Spider): name = 'movie' allowed_domains = ['www.id97.com'] start_urls = ['http://www.id97.com/'] def parse(self, response): div_list = response.xpath...
阅读全文
摘要:增加并发: 默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。 降低日志级别: 在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘INFO’...
阅读全文
摘要:#创建项目 scrapy startproject demo #开一个爬虫项目 cd demo scrapy genspider first www.baidu.com #setting 中设置 ROBOTSTXT_OBEY = False USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/...
阅读全文
摘要:#人人网的模拟登录 import requests import urllib from lxml import etree #获取session对象 session = requests.Session() #将验证码图片进行下载 headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit...
阅读全文
摘要:需要下载webdriver 对应自己的谷歌浏览器版本下载 http://chromedriver.storage.googleapis.com/index.html
阅读全文
摘要:import requestsfrom lxml import etree url='https://bj.58.com/shunyi/ershoufang/?PGTID=0d30000c-0047-6aa6-0218-69d1ed59a77b&ClickID=3'headers = {'User-
阅读全文
摘要:#爬取百度翻译结果 import requestsurl = 'https://fanyi.baidu.com/sug' wd = input('enter a word:') data = { 'kw':wd } response = requests.post(url=url,data=data) print(response.json()) #response.text : 字符串 #...
阅读全文

浙公网安备 33010602011771号