Scrapy分布式爬虫打造搜索引擎（慕课网）--爬取知乎（一）

第一节：session和cookie的实现原理

　　session和cookie的区别

cookie是浏览器的本地存储机制（以键值对的形式）

http是无状态的协议（即服务器在接收到请求之后直接返回，不管是谁传输的————无状态请求）

有状态请求：

第二节：

状态码：

zhihu_login_requests.py源代码1：

 1 #coding:utf-8
 2 
 3 import re
 4 import requests
 5 try:
 6 import cookielib
 7 except:
 8 import http.cookiejar as cookielib
 9 
10 def get_xsrf():
11 response = requests.get("https://www.zhihu.com")
12 print (response.text)
13 return ""
14 
15 def zhihu_login(account, password):
16 #知乎登陆
17 if re.match("^1\d{10}", account):
18 print "手机号码登陆"
19 post_url = "https://www.zhihu.com/signup?next=%2F"
20 post_data = {
21 "_xsrf": "",
22 "phone_num": account,
23 "password": password
24 }
25 get_xsrf()

运行结果（返回500错误，因为此时的请求头是本地请求头，不是浏览器请求头

C:\Python27\python.exe F:/everyday/Zhihu/zhihu_login_requests.py
<html><body><h1>500 Server Error</h1>
An internal server error occured.
</body></html>

解决500错误的方法————添加请求头

agent = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/57.0"
header = {
"HOST":"www.zhihu.com",
"Referer": "https://www.zhihu.com",
"User-Agent": agent
}

通过session建立连接（注：response.text要转换成utf-8编码）

#coding:utf-8

import re
import requests
try:
import cookielib
except:
import http.cookiejar as cookielib

session = requests.session()
session.cookies = cookielib.LWPCookieJar(filename="cookie.txt")
try:
session.cookies.load(ignore_discard = True)
except:
print "cookie未能加载"

agent = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/57.0"
header = {
"HOST":"www.zhihu.com",
"Referer": "https://www.zhihu.com",
"User-Agent": agent
}

def get_xsrf():
#获取xsrf code
response = session.get("https://www.zhihu.com", headers=header)
match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text)
if match_obj:
print (match_obj.group(1))
else:
return ""

def get_index():
response = session.get("https://www.zhihu.com", headers = header)
with open("index_page.html", "wb") as f:
f.write(response.text.encode("utf-8"))
print "ok"

def zhihu_login(account, password):
#知乎登陆
if re.match("^1\d{10}", account):
print "手机号码登陆"
post_url = "https://www.zhihu.com/signup?next=%2F"
post_data = {
"_xsrf": get_xsrf(),
"phone_num": account,
"password": password
}

response = session.post(post_url, data=post_data, headers=header)

session.cookies.save()

zhihu_login("15603367590","0019wan,.WEI3618")
get_index()

运行结果（新增加了cookie.txt和index_page.html文件）

C:\Python27\python.exe F:/everyday/Zhihu/zhihu_login_requests.py
手机号码登陆
ok

posted @ 2018-01-21 14:09 迟暮有话说阅读(713) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

迟暮有话说

Scrapy分布式爬虫打造搜索引擎（慕课网）--爬取知乎（一）

公告