spider - 随笔分类 - Mr_Smith

http2.0 发送请求

摘要：import json import httpx # 注意要安装 pip install h2 data = { 'page': '2' } headers={ 'method': 'POST', 'authority': '', 'scheme': 'https', 'path': '/api/c 阅读全文

posted @ 2021-10-18 15:14 Mr_Smith 阅读(576) 评论(0) 推荐(0)

scrapy cookie 问题

摘要：COOKIES_ENABLED 默认： True 是否启用cookiesmiddleware。如果关闭，cookies将不会发送给web server。 COOKIES_DEBUG 默认： False 如果启用，Scrapy将记录所有在request(cookie 请求头)发送的cookies及re 阅读全文

posted @ 2019-08-26 19:30 Mr_Smith 阅读(368) 评论(0) 推荐(0)

execjs

摘要：import execjs # eval 和 complie 是要构建一个JS的环境 e = execjs.eval('a = new Array(1,2,3)') # 可以直接执行JS代码 print(e) x = execjs.compile(''' function add(x,y){ ret 阅读全文

posted @ 2019-08-12 20:20 Mr_Smith 阅读(447) 评论(0) 推荐(0)

scrapy---pycharm调试

摘要：from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings # 注意与scrapy.cfg在同一级目录 if __name__ == '__main__': process = CrawlerProcess(get_project_settings()) ... 阅读全文

posted @ 2019-08-12 18:33 Mr_Smith 阅读(130) 评论(0) 推荐(0)

python中运行js代码—js2py

摘要：import js2py # 实例化一个执行js的环境对象 context_js_obj = js2py.EvalJs() js_str = """ function A(a,b){ return a+b } """ # 传递js_str,执行js context_js_obj.execute(js_str) result = context_js_obj.A... 阅读全文

posted @ 2019-08-03 00:12 Mr_Smith 阅读(1304) 评论(0) 推荐(0)

糗事百科爬虫_基于线程池

摘要：import threading import time from queue import Queue from multiprocessing.dummy import Pool import requests from lxml import etree class QiuBaiSpider(object): # 1.爬取的的网站,和请求头 def __init__(s... 阅读全文

posted @ 2019-08-01 18:38 Mr_Smith 阅读(149) 评论(0) 推荐(0)

糗事百科_基于队列和多线程

摘要：import threading import time from queue import Queue import requests from lxml import etree class QiuBaiSpider(object): # 1.爬取的的网站,和请求头 def __init__(self): self.base_url = 'https:/... 阅读全文

posted @ 2019-07-31 15:48 Mr_Smith 阅读(154) 评论(0) 推荐(0)

增量式爬虫

摘要：import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from incrementPro.items import IncrementproItem class MovieSpider(C... 阅读全文

posted @ 2019-05-17 22:52 Mr_Smith 阅读(192) 评论(0) 推荐(0)

分布式爬虫

摘要：pip install scrapy-redisscrapy genspider -t crawl xxx www.xxx.com class ChoutiSpider(RedisCrawlSpider): name = 'chouti' # allowed_domains = ['www.chouti.com'] # start_urls = ['http://www.ch... 阅读全文

posted @ 2019-05-16 21:36 Mr_Smith 阅读(149) 评论(0) 推荐(0)

scrapy 的分页爬取 CrawlSpider

摘要：1.创建scrapy工程：scrapy startproject projectName 2.创建爬虫文件：scrapy genspider -t crawl spiderName www.xxx.com # -*- coding: utf-8 -*- import scrapy from scra 阅读全文

posted @ 2019-05-16 13:11 Mr_Smith 阅读(614) 评论(0) 推荐(0)

scrapy 请求传参

摘要：class MovieSpider(scrapy.Spider): name = 'movie' allowed_domains = ['www.id97.com'] start_urls = ['http://www.id97.com/'] def parse(self, response): div_list = response.xpath... 阅读全文

posted @ 2019-05-16 11:24 Mr_Smith 阅读(872) 评论(0) 推荐(0)

scrapy增加爬取效率

摘要：增加并发：默认scrapy开启的并发线程为32个，可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。降低日志级别：在运行scrapy时，会有大量日志信息的输出，为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写：LOG_LEVEL = ‘INFO’... 阅读全文

posted @ 2019-05-16 11:18 Mr_Smith 阅读(176) 评论(0) 推荐(0)

scrapy框架

摘要：#创建项目 scrapy startproject demo #开一个爬虫项目 cd demo scrapy genspider first www.baidu.com #setting 中设置 ROBOTSTXT_OBEY = False USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/... 阅读全文

posted @ 2019-05-14 16:10 Mr_Smith 阅读(154) 评论(0) 推荐(0)

模拟登陆request-session

摘要：#人人网的模拟登录 import requests import urllib from lxml import etree #获取session对象 session = requests.Session() #将验证码图片进行下载 headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit... 阅读全文

posted @ 2019-05-14 10:01 Mr_Smith 阅读(1427) 评论(0) 推荐(0)

自动化测试selenium + request + 动态加载页面

摘要：需要下载webdriver 对应自己的谷歌浏览器版本下载 http://chromedriver.storage.googleapis.com/index.html 阅读全文

posted @ 2019-05-13 21:43 Mr_Smith 阅读(1921) 评论(0) 推荐(0)

通用爬虫

摘要：import requestsfrom lxml import etree url='https://bj.58.com/shunyi/ershoufang/?PGTID=0d30000c-0047-6aa6-0218-69d1ed59a77b&ClickID=3'headers = {'User- 阅读全文

posted @ 2019-05-11 00:07 Mr_Smith 阅读(242) 评论(0) 推荐(0)

正则解析

摘要：正解解析常用正则表达式回顾：阅读全文

posted @ 2019-05-09 21:53 Mr_Smith 阅读(1707) 评论(0) 推荐(0)

爬虫入门

摘要：#爬取百度翻译结果 import requestsurl = 'https://fanyi.baidu.com/sug' wd = input('enter a word:') data = { 'kw':wd } response = requests.post(url=url,data=data) print(response.json()) #response.text : 字符串 #... 阅读全文

posted @ 2019-05-09 15:21 Mr_Smith 阅读(141) 评论(0) 推荐(0)

Anaconda

摘要：anaconda jupyter notebook 阅读全文

posted @ 2019-05-07 23:31 Mr_Smith 阅读(142) 评论(0) 推荐(0)

健忘的我

随笔分类 - spider

公告