scrapy shell

shell

Syntax: scrapy shell [url]
Requires project: no

Starts the Scrapy shell for the given URL (if given) or empty if no URL is given. Also supports UNIX-style local file paths, either relative with ./ or ../ prefixes or absolute file paths. See Scrapy shellfor more info.

Supported options:

--spider=SPIDER: bypass spider autodetection and force use of specific spider
-c code: evaluate the code in the shell, print the result and exit
--no-redirect: do not follow HTTP 3xx redirects (default is to follow them); this only affects the URL you may pass as argument on the command line; once you are inside the shell, fetch(url)will still follow HTTP redirects by default.

Usage example:

1、基本shell使用

scrapy shell -s ROBOTSTXT_OBEY=False --no-redirect "https://jigsaw.w3.org/HTTP/300/301.html"

-s:对settings进行设置 ROBOTSTXT_OBEY=False（不遵守机器人协议）

--no-redirect：不进行重定向

2、当url 带参数时，url必须带上引号，否则后面的请求参数会被截掉

scrapy shell -s ROBOTSTXT_OBEY=False --no-redirect "https://www.toutiao.com/search_content/?offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&cur_tab=3&from=gallery"
#2018-09-10 15:45:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.toutiao.com/search_content/?offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&cur_tab=3&from=gallery> (referer: None)


scrapy shell -s ROBOTSTXT_OBEY=False --no-redirect https://www.toutiao.com/search_content/?offset=20&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&cur_tab=3&from=gallery
#2018-09-10 15:55:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.toutiao.com/search_content/?offset=20> (referer: None)

3、当shell遇到重定向

scrapy shell -s ROBOTSTXT_OBEY=False --no-redirect "https://jigsaw.w3.org/HTTP/300/301.html"
#fetch('https://www.toutiao.com/group/6599226921567912452/',redirect=False)
fetch(response.headers['Location'])

3、添加User-Agent

scrapy shell -s USER_AGENT='Mozilla/5.0'

 fetch('http://www.baidu.com',headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36'})

from scrapy import Request
req = Request('yoururl.com', headers={"header1":"value1"})
fetch(req)

4、shell清屏

import os
os.system("clear")

posted @ 2018-09-10 15:59 逐梦客！阅读(341) 评论(0) 收藏举报

刷新页面返回顶部

scrapy shell

shell

公告