python爬虫第一天

1.爬虫介绍，什么是爬虫？

就是根据URL获取网站信息

2.如何爬取网站信息？

a.伪造浏览器向某个地址发送http请求，获取返回的字符串

（导入requests模块，利用requests伪造浏览器访问，通过命令安装：pip3 install requests）

a1.下载页面
requests.get(url = "地址")

requests.encoding = requests.apparent_encoding #获取网页编码格式，并使用此编码

requests.text #获取文本信息

requests.content #获取字节

response = requests.post(

　　　　url='地址'，

　　　　data={

　　　　字典

　　　　}，

　　　　headers = {}, 　　　　　　　　　　　　#请求头

　　　　cookies = {}

)

cookies_dict = response.cookies.get_dict

注意：1.伪造浏览器，2.请求分析

b.（导入BeautifulSoup模块，用来解析HTML字符串）解析：获取想要的指定内容beautifulsoup

通过命令安装：pip3 install BeautifulSoup4

soup = BeautifulSoup('<html>........</html>','html.parser') 　　 #lxml速度更快，需要单独安装

div = soup.find(name='标签名') #找到标签

div = soup.find(name='标签名',id='il')

div = soup.find(name='标签名',_class = 'il')

div = soup.find(name='div',id='auto-channel-lazyload-article')

div.text 　　　　　　 #获取文本

div.attrs 　　　　　　 #获取所有属性

div.get('href') 　　　　　　#获取单个属性

divs = soup.find_all(name='标签名') #找到标签

divs = soup.find_all(name='标签名',id='il')

divs = soup.find_all(name='标签名',_class = 'il')

divs = soup.find_all(name='div',id='auto-channel-lazyload-article')

divs获取的是列表，只能通过索引查找，不能直接 find找到

示例1：

import requests
from bs4 import BeautifulSoup


#1.下载页面
ret = requests.get(url = "地址")
ret.encoding = ret.apparent_encoding
# print(ret.text)

#2.解析：获取想要的指定内容beautifulsoup
soup = BeautifulSoup(ret.text,'html.parser')   #lxml速度更快，需要单独安装

div = soup.find(name='div',id='auto-channel-lazyload-article')

li_list =div.find_all(name='li')
for li in li_list:
    h3 = li.find(name='h3')
    if not h3:
        continue
    p = li.find(name='p')
    a = li.find('a')
    img = li.find('img')
    src = img.get('src')

    file_name = src.rsplit('__',maxsplit=1)[1]

    ret_img = requests.get(
        url="https:"+src
    )
    with open(file_name,'wb') as f:
        f.write(ret_img.content)
    print(h3.text,a.get('href'))
    print(p.text)

示例2：

import requests
from bs4 import BeautifulSoup


# 下载页面，获取token及cookies
r1 = requests.get(
    url='********'
)

s1 = BeautifulSoup(r1.text,'html.parser')
token = s1.find(name='input',attrs={'name':'authenticity_token'}).get("value")

cookies_dict = r1.cookies.get_dict()

#登录
r2 = requests.post(
    url='********',
    data={
        'commit':'Sign in',
        'utf8':'✓',
        'authenticity_token':token,
        'login':'账号',
        'password':'密码'
    },
    headers = {
        'Host':'*********',
        'Origin':'*********',
        'Referer':'***********',
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'
    },
    cookies = cookies_dict
)

r3 = requests.get(
    url='************',
    headers={
        'Host': '*******',
        'Referer': '*******',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'
    },
    cookies = cookies_dict
)
#解析：获取想要的指定内容
soup3 = BeautifulSoup(r3.text,'html.parser')
div = soup3.find(name='div',attrs={'vcard-names-container py-3 js-sticky js-user-profile-sticky-fields '})

#由于新申请的账号，只填写了名字，故只做获取名字操作
name = div.find(name='span')

print(name.text)

思路：想要爬去一个网站的信息，就要尽可能的装的更像浏览器，做到以假乱真就可以无视反爬了

第一步下载，第二步解析

注意：有几个小坑，1.看看你浏览器F12时的模式是电脑模式还是手机模式，此处可能会有影响

　　　　　　　　2.注意浏览器请求的地址是否一致

　　　　　　　　3.注意cookies和token的获取

　　　　　　　　4.注意headers伪装的像不像

posted on 2018-07-05 19:53 易谦阅读(142) 评论(0) 收藏举报

刷新页面返回顶部

易谦

python爬虫第一天

导航

公告