假设spiders文件夹下多个文件:

name.py     name = 'name'

name1.py    name = 'name1'

name2.py    name = 'name2'

...

这里可以根据上篇文章http://www.cnblogs.com/chaihy/p/9044574.html  

根据条件查询的列表,查询的时候可以设置where 前1000条,1000-2000条,2000-3000条 ... 可以同时爬取文件相当于多进程处理

首先创建commands文件夹 和 spiders同级目录

commands 文件夹创建文件:

        crawlall.py文件

        __init__.py空文件 

crawlall.py文件内容如下:(获取spiders文件夹下所有的文件)

from scrapy.commands import ScrapyCommand
from scrapy.crawler import CrawlerRunner
from scrapy.utils.conf import arglist_to_dict
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def add_options(self, parser):
ScrapyCommand.add_options(self, parser)
parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
help="set spider argument (may be repeated)")
parser.add_option("-o", "--output", metavar="FILE",
help="dump scraped items into FILE (use - for stdout)")
parser.add_option("-t", "--output-format", metavar="FORMAT",
help="format to use for dumping items with -o")
def process_options(self, args, opts):
ScrapyCommand.process_options(self, args, opts)
try:
opts.spargs = arglist_to_dict(opts.spargs)
except ValueError:
pass
# raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)
def run(self, args, opts):
#settings = get_project_settings()

spider_loader = self.crawler_process.spider_loader
for spidername in args or spider_loader.list():
print "*********cralall spidername************" + spidername
self.crawler_process.crawl(spidername, **opts.spargs)
self.crawler_process.start()

settings 配置:

COMMANDS_MODULE = 'project.commands'

执行命令:

scrapy crawlall
posted on 2018-04-28 10:51  程序小院  阅读(414)  评论(0编辑  收藏  举报