爬虫第一招《requests 与 Beautifulsoup 模块》与《不使用浏览器登录网站》

requests 常用的方法：

import requests
requests.get() 

request.post()

---------------------------------------------------------------
obj.content() 与  obj.text() 的区别：
text()是unicode 格式，在需要Beautifulsoup 分析时，就用text()
content() 是字节格式，在需要将文件写到本地时就使用content()

Beautifulsoup 的常用：

from bs4 import Beautifulsoup
创建对象
soup=Beautifulsoup(request.text,'html.parser')

soup.find() 与find()的参数

soup.find('h3') #在这一整个HTML对象找到H3的标签

find() 的参数：
soup.find_all(attrs={'class':'meta-title'})#不能直接用class，python环境的class冲突，使用attrs 还可以一次使用多个值.  e.g. attrs={'name':'authenticity_token','class':'meta-title'}

∆爬取伯乐在线的某一页面的文章：

import requests
from bs4 import BeautifulSoup

response=requests.get('http://python.jobbole.com')

soup=BeautifulSoup(response.text,'html.parse')

wrap=soup.find(id='widgets-homepage-fulwidth') #取到一个可以唯一标识的标签的ID,或者其它的都可以，自己从浏览器的“检查” 里找

target=wrap.find_all(attrs={'class':'meta-title'}) #根据上一个标签取到--再取到a 标签。 如下，取到它的href 和 文本，这样就拿到了标题与链接
for i in target:
    print(i.get('href'))
    print(i.get_text())

结果：

∆登录git hub #github 属于访问页面就已经发了一个cookie 到客户端，所以在登录时也需要带着这个cookie

import requests
from bs4 import BeautifulSoup
#需要先获取token
re=request.get('http://github.com/login')

#拿到token
s1=BeautifulSoup(re.text,'html.parser')
csrf_token=s1.find(name='input',attrs={'name':'authenticity_token'}).get('value')

r1_cookie_dict=re.cookies.get_dict() #可能再次需要cookie,如果打不开页面的话

#发送用户名密码\ token \  以POST 发送到服务端
r2=request.post(
    'https://github.com/session',
    data={'utf8''',
    'authenticity_token':token,
    'login':account,
    'password':password,
    'commit':'Sign in'
},
    cookies=r1_cookie_dict
   
)

#登录到git hub 会再次获取到一个授权的cookie,
r2_cookie_dict=r2.cookies.get_dict() #获取cookie的字典

#某些网站可能需要最开始访问时的一个COOKIE,和已经登录了的COOKIE,这里将两个cookie都放进去。
cookie_dict={}
cookie_dict.update(r1_cookie_dict)
cookie_dict.update(r2_cookie_dict)

#访问网站
r3=requests.get(
url='http://github.com/settions/emails',
cookies=cookie_dict,
)
print(r3.text) #查看是否已经登录

结果：在pycharm中查找自己帐户信息。如果则登录成功。

∆51CTO登录： #51cto 网站在登录时也需要 cookie

1、查找登录入口：

import requests
"""
_csrf:NWZPeWpYS2ZqKwlPWBUxCAwJGEsrbH0qAQ8ALCECH14MICNUUww6LA==
LoginForm[username]:震荡
LoginForm[password]:fda 
LoginForm[rememberMe]:0
login-button:登 录
"""
from bs4 import BeautifulSoup
cto=requests.get('http://home.51cto.com/index/')

soup=BeautifulSoup(cto.text,'html.parser')

token=soup.find(id='login-base').find('input').get('value')

cookie=cto.cookies.get_dict()

req=requests.post(
       'http://home.51cto.com/index',
      data={
      '_csrf':token,
    'LoginForm[username]':account,
    'LoginForm[password]':password,
    'LoginForm[rememberMe]':0,
    'login-button':'登 录'
},
    
    cookies=cookie
)

#获取登录过后的cookie
cookie2=req.cookies.get_dict()
cookie_dict={}
cookie_dict.update(cookie)
cookie_dict.update(cookie2)

#访问个人页面验证是否登录成功
result=requests.get(
'http://home.51cto.com/space',
cookies=cookie_dict
)
print(result.text)

∆爬取到的数据写入文件

import requests
from bs4 import BeautifulSoup

response=requests.get('http://www.autohome.com.cn/news/')

#获取到URL里面的内容
response.encoding='gbk'  #因为在pycharm 中是UTF8 , 可通过查看meta 标签查看字符集

soup=BeautifulSoup(response.text,'html.parser') #html.parser 嵌套了整个html对象。

tag=soup.find(id='auto-channel-lazyload-article')

h3_top_a=tag.find_all(name='li')

#找到所有新闻    标题，简介， URL，图片
text=soup.find(id='auto-channel-lazyload-article').find_all(name='li')
for i  in text:
    title=i.find('h3')
    if not title:
        continue
    summary=i.find('p').text
    link=i.find('a').get('href')
    img=i.find('img').get('src')# 图片下载，浏览器是需要再发一次请求到服务端，然后将内容写到本地
    res=requests.get(img)
    title=title.get_text().replace('/','').replace(" ","")
    
    file_name='img/%s.jpg'%(title)
    
    with open(file_name,'wb') as f :
        f.write(res.content)

附：

两种网站的区别；

一种是当打开这个网站时，服务器就发了一个cookie 到客户端，然后客户端浏览的后面的所有操作都需要带着这个Cookie ,

e.g. 51cto 在打开页面时就发送了一个 cookie 到客户端。后面的操作-登录时也需要带着这个cookie, 登录完后，再使用登录进去的cookie 进行其它的操作。

第二种网站则是在打开网站时，没有发送cookie，也就是在登录时，就不需要带cookie了。

------------------------------------------------------------------------------------------------

posted @ 2017-08-28 17:10 tonycloud 阅读(424) 评论(0) 收藏举报

刷新页面返回顶部

Cloud-Tony

Hello!

爬虫第一招《requests 与 Beautifulsoup 模块》与《不使用浏览器登录网站》

公告

Cloud-Tony

Hello!

爬虫第一招 《requests 与 Beautifulsoup 模块》与《不使用浏览器登录网站》

公告

爬虫第一招《requests 与 Beautifulsoup 模块》与《不使用浏览器登录网站》