我很好u

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

Tutorial

  1. You can do things like setting a download delay between each request, limiting amount of concurrent requests per domain or per IP, and even using an auto-throttling extension that tries  to figure out these automatically. ---link
  2. The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.---link
  3. As a shortcut for creating Request objects you can use response.follow Unlike scrapy.Request, response.follow supports relative URLs directly - no need to call urljoin. Note that response.follow just returns a Request instance; you still have to yield this Request.
  4. Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the settingDUPEFILTER_CLASS.
  5. As yet another example spider that leverages the mechanism of following links, check out the CrawlSpider class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.
  6. Also, a common pattern is to build an item with data from more than one page, using a trick to pass additional data to the callbacks.
  7. Using spider arguments

    You can provide command line arguments to your spiders by using the -a option when running them:

    scrapy crawl quotes -o quotes-humor.json -a tag=humor

    tag = getattr(self, 'tag', None)

    You can learn more about handling spider arguments here.

Command line tool

 

posted on 2019-04-30 17:12  我很好u  阅读(108)  评论(0)    收藏  举报