第一种报错:exceptions.AttributeError: type object 'Scheduler' has no attribute 'from_crawler'
|
|
scheduler = self.scheduler_cls.from_crawler(self.crawler)
exceptions.AttributeError: type object 'Scheduler' has no attribute 'from_crawler'
|
第二种报错:ValueError: ("Failed to instantiate dupefilter class '%s': %s", 'scrapy.dupefilters.RFPDupeFilter', TypeError("__init__() got an unexpected keyword argument 'server'",
|
|
TypeError: __init__() got an unexpected keyword argument 'server'
|
前几天在用scrapy写一个抓取博客园的爬虫,测试发现抓取的url有大量的重复url.无奈,技术比较LOW,翻阅scrapy文档,该设置的也都设置了,还是没有解决掉url重复的问题。因此,只好用scrapy-redis来写了。以前没有编程底子,英语很LOW,尼玛,各种莫名的报错,要死的节奏,幸运的是问题还是解决了。记录下踩到的坑。至此,抓取博客园的scrapy-reids爬虫90多分钟抓取18万条数据,还算可以了,是4线程运行。
第一种详细报错:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
Traceback (most recent call last):
File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\twisted\internet\defer.py", line 1125, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\twisted\python\failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\crawler.py", line 87, in crawl
yield self.engine.close()
File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 100, in close
return self._close_all_spiders()
File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 340, in _close_all_spiders
dfds = [self.close_spider(s, reason='shutdown') for s in self.open_spiders]
File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 340, in <listcomp>
dfds = [self.close_spider(s, reason='shutdown') for s in self.open_spiders]
File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 298, in close_spider
dfd = slot.close()
File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 44, in close
self._maybe_fire_closing()
File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\core\engine.py", line 51, in _maybe_fire_closing
self.heartbeat.stop()
File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\twisted\internet\task.py", line 202, in stop
assert self.running, ("Tried to stop a LoopingCall that was "
AssertionError: Tried to stop a LoopingCall that was not running.
Process finished with exit code 0
|
坑史:对于这个报错,有点摸不着头脑,我试了N遍,包括scrapy-redis案例,自己写测试,还有一些github上的案例。尼玛,就是放我MAC上报这个错。最后scrapy 版本降级后解决了这个问题。现在的版本是:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
Scrapy 1.0.5 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
commands
fetch Fetch a URL using the Scrapy downloader
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
|
资料来源:
http://stackoverflow.com/questions/14154700/why-the-scrapy-redis-not-work
https://github.com/LiuXingMing/SinaSpider/issues/8
第二种详细报错:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
<code>builtins.AssertionError: Tried to stop a LoopingCall that was not running.
2016-07-17 01:04:50 [twisted] CRITICAL:
Traceback (most recent call last):
File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy_redis\scheduler.py", line 120, in open
debug=spider.settings.getbool('DUPEFILTER_DEBUG'),
TypeError: __init__() got an unexpected keyword argument 'server'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\dell\AppData\Local\Programs\Python\Python35-32\lib\site-packages\scrapy\crawler.py", line 74, in crawl
yield self.engine.open_spider(self.spider, start_requests)
<span style="color: #ff0000;">ValueError: ("Failed to instantiate dupefilter class '%s': %s", 'scrapy.dupefilters.RFPDupeFilter', TypeError("__init__() got an unexpected keyword argument 'server'",))</span></code>
|
解决方法:
检查后发现setting.py中缺少此设置:加上即可
|
|
<code>DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"</code>
|
我之前的setting.py设置是:
|
|
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_ORDER = 'BFO'
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
|
完整的setting.py设置:
|
|
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_ORDER = 'BFO'
SCHEDULER_PERSIST = True
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = None
REDIS_HOST = '127.0.0.1'
REDIS_PORT = 6379
|
参考来源:https://github.com/rolando/scrapy-redis/issues/59
PS:终极解决方法,google 一下。
新浪爬虫:
https://github.com/LiuXingMing/SinaSpider
http://blog.csdn.net/bone_ace/article/details/53379904
支持原创,盗版可耻,转载请注明出处!