Scrapy分布式爬虫打造搜索引擎(慕课网)--爬取知乎(一)
第一节:session和cookie的实现原理
session和cookie的区别
- cookie是浏览器的本地存储机制(以键值对的形式)
- http是无状态的协议(即服务器在接收到请求之后直接返回,不管是谁传输的————无状态请求)
- 有状态请求:
第二节:
- 状态码:
- zhihu_login_requests.py源代码1:
1 #coding:utf-8 2 3 import re 4 import requests 5 try: 6 import cookielib 7 except: 8 import http.cookiejar as cookielib 9 10 def get_xsrf(): 11 response = requests.get("https://www.zhihu.com") 12 print (response.text) 13 return "" 14 15 def zhihu_login(account, password): 16 #知乎登陆 17 if re.match("^1\d{10}", account): 18 print "手机号码登陆" 19 post_url = "https://www.zhihu.com/signup?next=%2F" 20 post_data = { 21 "_xsrf": "", 22 "phone_num": account, 23 "password": password 24 } 25 get_xsrf()
- 运行结果(返回500错误,因为此时的请求头是本地请求头,不是浏览器请求头
C:\Python27\python.exe F:/everyday/Zhihu/zhihu_login_requests.py <html><body><h1>500 Server Error</h1> An internal server error occured. </body></html>
- 解决500错误的方法————添加请求头
agent = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/57.0" header = { "HOST":"www.zhihu.com", "Referer": "https://www.zhihu.com", "User-Agent": agent }
- 通过session建立连接(注:response.text要转换成utf-8编码)
#coding:utf-8 import re import requests try: import cookielib except: import http.cookiejar as cookielib session = requests.session() session.cookies = cookielib.LWPCookieJar(filename="cookie.txt") try: session.cookies.load(ignore_discard = True) except: print "cookie未能加载" agent = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/57.0" header = { "HOST":"www.zhihu.com", "Referer": "https://www.zhihu.com", "User-Agent": agent } def get_xsrf(): #获取xsrf code response = session.get("https://www.zhihu.com", headers=header) match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text) if match_obj: print (match_obj.group(1)) else: return "" def get_index(): response = session.get("https://www.zhihu.com", headers = header) with open("index_page.html", "wb") as f: f.write(response.text.encode("utf-8")) print "ok" def zhihu_login(account, password): #知乎登陆 if re.match("^1\d{10}", account): print "手机号码登陆" post_url = "https://www.zhihu.com/signup?next=%2F" post_data = { "_xsrf": get_xsrf(), "phone_num": account, "password": password } response = session.post(post_url, data=post_data, headers=header) session.cookies.save() zhihu_login("15603367590","0019wan,.WEI3618") get_index()
- 运行结果(新增加了cookie.txt和index_page.html文件)
C:\Python27\python.exe F:/everyday/Zhihu/zhihu_login_requests.py 手机号码登陆 ok