python爬虫第一天
1.爬虫介绍,什么是爬虫?
就是根据URL获取网站信息
2.如何爬取网站信息?
a.伪造浏览器向某个地址发送http请求,获取返回的字符串
(导入requests模块,利用requests伪造浏览器访问,通过命令安装:pip3 install requests)
a1.下载页面
requests.get(url = "地址")
requests.encoding = requests.apparent_encoding #获取网页编码格式,并使用此编码
requests.text #获取文本信息
requests.content #获取字节
response = requests.post(
url='地址',
data={
字典
},
headers = {}, #请求头
cookies = {}
)
cookies_dict = response.cookies.get_dict
注意:1.伪造浏览器,2.请求分析
b.(导入BeautifulSoup模块,用来解析HTML字符串)解析:获取想要的指定内容beautifulsoup
通过命令安装:pip3 install BeautifulSoup4
soup = BeautifulSoup('<html>........</html>','html.parser') #lxml速度更快,需要单独安装
div = soup.find(name='标签名') #找到标签
div = soup.find(name='标签名',id='il')
div = soup.find(name='标签名',_class = 'il')
div = soup.find(name='div',id='auto-channel-lazyload-article')
div.text #获取文本
div.attrs #获取所有属性
div.get('href') #获取单个属性
divs = soup.find_all(name='标签名') #找到标签
divs = soup.find_all(name='标签名',id='il')
divs = soup.find_all(name='标签名',_class = 'il')
divs = soup.find_all(name='div',id='auto-channel-lazyload-article')
divs获取的是列表,只能通过索引查找,不能直接 find找到
示例1:
import requests
from bs4 import BeautifulSoup
#1.下载页面
ret = requests.get(url = "地址")
ret.encoding = ret.apparent_encoding
# print(ret.text)
#2.解析:获取想要的指定内容beautifulsoup
soup = BeautifulSoup(ret.text,'html.parser') #lxml速度更快,需要单独安装
div = soup.find(name='div',id='auto-channel-lazyload-article')
li_list =div.find_all(name='li')
for li in li_list:
h3 = li.find(name='h3')
if not h3:
continue
p = li.find(name='p')
a = li.find('a')
img = li.find('img')
src = img.get('src')
file_name = src.rsplit('__',maxsplit=1)[1]
ret_img = requests.get(
url="https:"+src
)
with open(file_name,'wb') as f:
f.write(ret_img.content)
print(h3.text,a.get('href'))
print(p.text)
示例2:
import requests
from bs4 import BeautifulSoup
# 下载页面,获取token及cookies
r1 = requests.get(
url='********'
)
s1 = BeautifulSoup(r1.text,'html.parser')
token = s1.find(name='input',attrs={'name':'authenticity_token'}).get("value")
cookies_dict = r1.cookies.get_dict()
#登录
r2 = requests.post(
url='********',
data={
'commit':'Sign in',
'utf8':'✓',
'authenticity_token':token,
'login':'账号',
'password':'密码'
},
headers = {
'Host':'*********',
'Origin':'*********',
'Referer':'***********',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'
},
cookies = cookies_dict
)
r3 = requests.get(
url='************',
headers={
'Host': '*******',
'Referer': '*******',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'
},
cookies = cookies_dict
)
#解析:获取想要的指定内容
soup3 = BeautifulSoup(r3.text,'html.parser')
div = soup3.find(name='div',attrs={'vcard-names-container py-3 js-sticky js-user-profile-sticky-fields '})
#由于新申请的账号,只填写了名字,故只做获取名字操作
name = div.find(name='span')
print(name.text)
思路:想要爬去一个网站的信息,就要尽可能的装的更像浏览器,做到以假乱真就可以无视反爬了
第一步下载,第二步解析
注意:有几个小坑,1.看看你浏览器F12时的模式是电脑模式还是手机模式,此处可能会有影响
2.注意浏览器请求的地址是否一致
3.注意cookies和token的获取
4.注意headers伪装的像不像
浙公网安备 33010602011771号