Scrapy分布式爬虫打造搜索引擎(慕课网)--爬取知乎(二)

通过Scrapy模拟登陆知乎

通过命令让系统自动新建zhihu.py文件

  • 首先进入工程目录下

 知乎path

 

  • 再进入虚拟环境

 cmd——path

  • 通过genspider命令新建zhihu.py
scrapy genspider zhihu www.zhihu.com

genspider

zhihu 

 

  • 新建main.py文件,使得程序可以调试
 1 #coding:utf-8
 2 
 3 from scrapy.cmdline import execute  #调用这个函数可以执行scrapy的脚本
 4 
 5 import sys
 6 import os
 7 #获取当前路径os模块的abspath
 8 os.path.abspath(__file__)#获取当前py文件即mainpy文件的路径
 9 #父目录dirname
10 sys.path.append(os.path.dirname(os.path.abspath(__file__)))
11 #调用execute函数执行scrapy命令
12 execute(["scrapy","crawl","zhihu"])

 

  • 在运行main.py调试之前,需要设置setting.py的文件内容(设置不遵循ROBO协议,防止很多url被过滤)
ROBOTSTXT_OBEY = False

 

  • 注:match默认只匹配一行,故添加re.DOTALL使其匹配所有参数:
match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text, re.DOTALL)

 

  • 最终zhihu.py文件代码:
 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 import re
 4 import json
 5 
 6 class ZhihuSpider(scrapy.Spider):
 7     name = 'zhihu'
 8     allowed_domains = ['www.zhihu.com']
 9     start_urls = ['http://www.zhihu.com/']
10 
11     headers = {
12         "HOST": "www.zhihu.com",
13         "Referer": "https://www.zhihu.com",
14         "User-Agent": "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/57.0"
15     }
16 
17     def parse(self, response):
18         pass
19 
20     def start_requests(self):
21         return [scrapy.Request('https://www.zhihu.com/signup?next=%2F', callback=self.login, headers=self.headers)]
22 
23     def login(self, response):
24         response_text = response.text
25         match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text, re.DOTALL)
26         xsrf = ''
27         if match_obj:
28             print (match_obj.group(1))
29         else:
30             return ""
31 
32         if xsrf:
33             post_url = "https://www.zhihu.com/signup?next=%2F"
34             post_data = {
35                 "_xsrf": xsrf,
36                 "phone_num": "15603367590",
37                 "password":"0019wan,.WEI3618"
38             }
39 
40             return [scrapy.FormRequest(
41                 url = post_url,
42                 formdata = post_data,
43                 headers = self.headers,
44                 callback = self.check_login  #传递的是函数名称,不加括号,加括号会被调用
45             )]
46 
47     def check_login(self, response):
48         #验证服务器返回数据判断是否成功
49         text_jason = json.loads(response.text)
50         if "msg" in text_jason and text_jason["msg"] == "登陆成功":
51             for url in self.start_urls:
52                 yield self.make_requests_from_url(url, dont_filter = True, headers = self.headers)

 

 
posted @ 2018-01-21 14:20  迟暮有话说  阅读(1055)  评论(1编辑  收藏  举报