Scrapy分布式爬虫打造搜索引擎(慕课网)--爬取知乎(一)

 第一节:session和cookie的实现原理

  session和cookie的区别

  • cookie是浏览器的本地存储机制(以键值对的形式)
  • http是无状态的协议(即服务器在接收到请求之后直接返回,不管是谁传输的————无状态请求)

 

  •  有状态请求:


第二节:

  • 状态码:

 

  •  zhihu_login_requests.py源代码1:
     1 #coding:utf-8
     2 
     3 import re
     4 import requests
     5 try:
     6 import cookielib
     7 except:
     8 import http.cookiejar as cookielib
     9 
    10 def get_xsrf():
    11 response = requests.get("https://www.zhihu.com")
    12 print (response.text)
    13 return ""
    14 
    15 def zhihu_login(account, password):
    16 #知乎登陆
    17 if re.match("^1\d{10}", account):
    18 print "手机号码登陆"
    19 post_url = "https://www.zhihu.com/signup?next=%2F"
    20 post_data = {
    21 "_xsrf": "",
    22 "phone_num": account,
    23 "password": password
    24 }
    25 get_xsrf()

 

 

  •  运行结果(返回500错误,因为此时的请求头是本地请求头,不是浏览器请求头
    C:\Python27\python.exe F:/everyday/Zhihu/zhihu_login_requests.py
    <html><body><h1>500 Server Error</h1>
    An internal server error occured.
    </body></html>

 

  • 解决500错误的方法————添加请求头
    agent = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/57.0"
    header = {
    "HOST":"www.zhihu.com",
    "Referer": "https://www.zhihu.com",
    "User-Agent": agent
    }

 

  • 通过session建立连接(注:response.text要转换成utf-8编码)
    #coding:utf-8
    
    import re
    import requests
    try:
    import cookielib
    except:
    import http.cookiejar as cookielib
    
    session = requests.session()
    session.cookies = cookielib.LWPCookieJar(filename="cookie.txt")
    try:
    session.cookies.load(ignore_discard = True)
    except:
    print "cookie未能加载"
    
    agent = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/57.0"
    header = {
    "HOST":"www.zhihu.com",
    "Referer": "https://www.zhihu.com",
    "User-Agent": agent
    }
    
    def get_xsrf():
    #获取xsrf code
    response = session.get("https://www.zhihu.com", headers=header)
    match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text)
    if match_obj:
    print (match_obj.group(1))
    else:
    return ""
    
    def get_index():
    response = session.get("https://www.zhihu.com", headers = header)
    with open("index_page.html", "wb") as f:
    f.write(response.text.encode("utf-8"))
    print "ok"
    
    def zhihu_login(account, password):
    #知乎登陆
    if re.match("^1\d{10}", account):
    print "手机号码登陆"
    post_url = "https://www.zhihu.com/signup?next=%2F"
    post_data = {
    "_xsrf": get_xsrf(),
    "phone_num": account,
    "password": password
    }
    
    response = session.post(post_url, data=post_data, headers=header)
    
    session.cookies.save()
    
    zhihu_login("15603367590","0019wan,.WEI3618")
    get_index()

     

  • 运行结果(新增加了cookie.txt和index_page.html文件)
    C:\Python27\python.exe F:/everyday/Zhihu/zhihu_login_requests.py
    手机号码登陆
    ok

 

posted @ 2018-01-21 14:09  迟暮有话说  阅读(713)  评论(0编辑  收藏  举报