scrapy shell

shell

  • Syntax: scrapy shell [url]
  • Requires project: no

Starts the Scrapy shell for the given URL (if given) or empty if no URL is given. Also supports UNIX-style local file paths, either relative with ./ or ../ prefixes or absolute file paths. See Scrapy shellfor more info.

Supported options:

  • --spider=SPIDER: bypass spider autodetection and force use of specific spider
  • -c code: evaluate the code in the shell, print the result and exit
  • --no-redirect: do not follow HTTP 3xx redirects (default is to follow them); this only affects the URL you may pass as argument on the command line; once you are inside the shell, fetch(url)will still follow HTTP redirects by default.

Usage example:

1、基本shell使用

scrapy shell -s ROBOTSTXT_OBEY=False --no-redirect "https://jigsaw.w3.org/HTTP/300/301.html"

-s:对settings进行设置 ROBOTSTXT_OBEY=False(不遵守机器人协议)

--no-redirect:不进行重定向

2、当url 带参数时,url必须带上引号,否则后面的请求参数会被截掉

scrapy shell -s ROBOTSTXT_OBEY=False --no-redirect "https://www.toutiao.com/search_content/?offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&cur_tab=3&from=gallery"
#2018-09-10 15:45:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.toutiao.com/search_content/?offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&cur_tab=3&from=gallery> (referer: None)


scrapy shell -s ROBOTSTXT_OBEY=False --no-redirect https://www.toutiao.com/search_content/?offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&cur_tab=3&from=gallery
#2018-09-10 15:55:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.toutiao.com/search_content/?offset=20> (referer: None)

3、当shell遇到重定向

scrapy shell -s ROBOTSTXT_OBEY=False --no-redirect "https://jigsaw.w3.org/HTTP/300/301.html"
#fetch('https://www.toutiao.com/group/6599226921567912452/',redirect=False)
fetch(response.headers['Location'])

 3、添加User-Agent

scrapy shell -s USER_AGENT='Mozilla/5.0'
 fetch('http://www.baidu.com',headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'})
from scrapy import Request
req = Request('yoururl.com', headers={"header1":"value1"})
fetch(req)

 4、shell清屏

import os
os.system("clear")

 

posted @ 2018-09-10 15:59  逐梦客!  阅读(341)  评论(0)    收藏  举报