爬虫 - 随笔分类 - liuxianglong

开源爬虫监控平台crawler-studio

摘要：项目地址及安装说明 https://github.com/crawler-studio/crawler-studio 介绍 Crawler-Studio是基于Scrapy、Scrapyd、Scrapyd-Client、Scrapyd-API、Django、DRF、Vue.js 开发的一款开源分布式S 阅读全文

posted @ 2022-11-23 10:04 liuxianglong 阅读(412) 评论(0) 推荐(0)

mac frida安装（使用网易木木模拟器）

摘要：1.安装frida 本机环境mac 或者win10 (AMD64)都一样、python3.6.4 pip install frida 如果报错: ERROR: Command errored out with exit status 1解决方法: 安装Wordcloud.whl文件,下载地址:htt 阅读全文

posted @ 2020-05-27 15:25 liuxianglong 阅读(2279) 评论(0) 推荐(0)

\\u开头两个字符的是什么编码？

摘要：最近碰到一种奇怪的编码，如下：这种其实是一种二进制码，我获取的时候是字符串类型，这种需要通过如下方式处理，才能正常显示： words = [word.replace('u', '') for word in str.split('\\')] words.remove('') transfer = 阅读全文

posted @ 2020-05-27 11:13 liuxianglong 阅读(4476) 评论(0) 推荐(0)

es 建立mapping 报错 settings_exception

摘要：找了很久，才发现mapping后面还得有个s才行阅读全文

posted @ 2020-04-15 17:50 liuxianglong 阅读(239) 评论(0) 推荐(0)

chromedriver 安装方法

摘要：https://cuiqingcai.com/5135.html?tdsourcetag=s_pctim_aiomsg 阅读全文

posted @ 2020-04-15 15:01 liuxianglong 阅读(365) 评论(0) 推荐(0)

raise TypeError("Unable to serialize %r (type: %s)" % (data, type(data)))

摘要：es 不能直接插入scrapy的item类型，只能插入字典类型，所以需要使用dict(item) 转换一下阅读全文

posted @ 2020-03-15 00:20 liuxianglong 阅读(935) 评论(0) 推荐(1)

scrapy-redis 设置

摘要：SCHEDULER = 'scrapy_redis.scheduler.Scheduler' DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' REDIS_HOST = 'xxxx' REDIS_PORT = xxxx REDIS_ 阅读全文

posted @ 2020-02-28 20:36 liuxianglong 阅读(293) 评论(0) 推荐(0)

es 6.x scroll用法

摘要：我们可以使用from +size来获取所有数据，但是，如果数据量大的时候，这样的操作开销很大，这时候可以使用scroll操作 1.第一步发起一个scroll 的post请求，带上参数scroll=1m （1m的意思是1分钟的意思） POST /twitter/_search?scroll=1m { 阅读全文

posted @ 2020-02-28 15:16 liuxianglong 阅读(1589) 评论(0) 推荐(0)

scrapy-redis 报 invalid literal for int() with base 10:

摘要：我在scrapy settings.py中填的REDIS_URL是这样的, 密码中含有特俗符合, 导致连接不上redis服务器 REDIS_URL = 'redis://:^*,dfdas.*,@192.168.10.34:6379/1' 网上有人说,先encode密码, 连接的时候再decode, 阅读全文

posted @ 2020-02-22 23:14 liuxianglong 阅读(359) 评论(0) 推荐(0)

scrapyd 配置文件

摘要：Configuration file Scrapyd searches for configuration files in the following locations, and parses them in order with the latest one taking more prior 阅读全文

posted @ 2020-01-27 15:50 liuxianglong 阅读(1413) 评论(0) 推荐(0)

scrapy 带认证的代理

摘要：官方方法： from w3lib.http import basic_auth_header class CustomProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] 阅读全文

posted @ 2020-01-17 18:09 liuxianglong 阅读(449) 评论(0) 推荐(0)

scrapy 为什么要用yield item 而不用yield dict来传输数据

摘要：经过实践, yield dict和yield item一样有效果，不过为什么官方要用yield item ，以下是官方解释： The main goal in scraping is to extract structured data from unstructured sources, typi 阅读全文

posted @ 2020-01-08 20:18 liuxianglong 阅读(1181) 评论(0) 推荐(0)

Scrapy payload 报错400

摘要：首先Scrapy 发送payload请求格式如下： def start_requests(self): querystr = { "ctoken": "U-ang1zmpP6c3VO4", "sceneKey": "DEFAULT", "pdKey": "P_ECTBILL_QUOTATION1", 阅读全文

posted @ 2019-12-11 19:10 liuxianglong 阅读(467) 评论(0) 推荐(0)

chrome xpath调试

摘要：阅读全文

posted @ 2019-11-29 16:22 liuxianglong 阅读(211) 评论(0) 推荐(0)

CrawlerRunner没有Log输出

摘要：官网log说明：https://docs.scrapy.org/en/latest/topics/logging.html#scrapy.utils.log.configure_logging 这里记一点容易遗漏的问题：就是使用CrawlerProcesser类scrapy会加载settings. 阅读全文

posted @ 2019-11-26 16:45 liuxianglong 阅读(420) 评论(0) 推荐(0)

celery 调用scrapy

摘要：我的环境： celery 3.1.25 python 3.6.9 window10 celery tasks 代码如下，其中 QuotesSpider 是我的scrapy项目爬虫类名称 from celery_app import app from scrapy.crawler import Cra 阅读全文

posted @ 2019-09-20 17:37 liuxianglong 阅读(1684) 评论(2) 推荐(0)

windows 安装 celery 避坑指南，看这篇就够了

摘要：阅读全文

posted @ 2019-09-19 20:47 liuxianglong 阅读(1044) 评论(0) 推荐(0)

windows 下安装ElasticSearch方法

摘要：1.https://www.oracle.com/technetwork/java/javase/downloads/jdk12-downloads-5295953.html 在此页面下载安装JDK12，版本可能有更新 2.https://www.elastic.co/cn/downloads/pa 阅读全文

posted @ 2019-08-07 16:17 liuxianglong 阅读(305) 评论(0) 推荐(0)

win7 docker安装文件及安装问题

摘要：最近在玩爬虫，需要装docker，但是官网对于win7版本，只支持docker tool box，在官网找了半天才找到安装包，特此上传百度网盘，方便各位下载链接：https://pan.baidu.com/s/1kB1yM2pjLakA61x80RX8sg 提取码：q0dx 安装好后，打开桌面的D 阅读全文

posted @ 2019-06-28 11:54 liuxianglong 阅读(906) 评论(0) 推荐(0)

windows pyspider WEB显示框太小解决方法

摘要：环境：windows7 + chrome + pyspider 解决方法： WEB预览框过小的原因在于页面元素的css属性height被替换为60px； CSS文件所在地方：C:\Users\Administrator\AppData\Local\Programs\Python\Python37\L 阅读全文

posted @ 2019-06-28 11:18 liuxianglong 阅读(987) 评论(0) 推荐(0)

liuxianglong

随笔分类 - 爬虫

公告