爬虫

requests

官方中文文档：https://2.python-requests.org/zh_CN/latest/
requests在爬虫中一般用于来处理网络请求

# 导入requests模块
import requests 

# 尝试向baidu发起请求 ,获得来命名为r的response对象
r = requests.get('https://www.baidu.com/')  

# 返回请求状态码，200即为请求成功
print(r.status_code)

# 返回页面代码
print(r.text)

# 对于特定类型请求，如Ajax请求返回的json数据
print(r.json())

# 添加headers的get请求
headers = {'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit'}
r = requests.get('https://www.baidu.com/', headers=headers)

# 添加headers的post请求
data = {'users': 'abc', 'password': '123'}
r = requests.post('https://www.weibo.com', data=data, headers=headers)

很多时候等于需要登录的站点我们可能需要保持一个会话，不然每次请求都先登录一遍效率太低，在requests里面一样很简单；

# 保持会话
# 新建一个session对象
sess = requests.session()
# 先完成登录
sess.post('maybe a login url', data=data, headers=headers)
# 然后再在这个会话下去访问其他的网址
sess.get('other urls')

beautifulsoup

当我们通过requests获取到整个页面的html5代码之后，我们还得进一步处理，因为我们需要的往往只是整个页面上的一小部分数据，所以我们需要对页面代码html5解析然后筛选提取出我们想要对数据，这时候beautifulsoup便派上用场了。
beautifulsoup之后通过标签+属性的方式来进行定位，譬如说我们想要百度的logo，我们查看页面的html5代码，我们可以发现logo图片是在一个div的标签下，然后class=index-logo-srcnew这个属性下。

beautifulsoup用法

from bs4 import BeautifulSoup

# 对页面代码进行解析，这边选用对html代码是官方示例中使用的爱丽丝页面代码
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 选用lxml解析器来解析
soup = BeautifulSoup(html, 'lxml')

# 获取标题
print(soup.title)

# 获取文本
print(soup.title.text)

# 通过标签定位
print(soup.find_all('a'))

# 通过属性定位
print(soup.find_all(attrs={'id': 'link1'}))

# 标签 + 属性定位
print(soup.find_all('a', id='link1'))

例子

首先找到想获取元素的位置

F12打开控制台，选择element标签查看页面代码；
点击控制台左上角箭头，然后点击页面上我们需要的信息，我们可以看到控制台中页面代码直接跳转到对应的位置；
页面代码中一直向上选择标签直至囊括我们需要的所有信息；
记住此时的标签以及熟悉等信息，这将会用于后面解析筛选数据。

from bs4 import BeautifulSoup
import requests


# 页面url地址
url = 'http://newgame.17173.com/game-list-0-0-0-0-0-0-0-0-0-0-1-2.html'

# 发送请求，r为页面响应
r = requests.get(url)

# r.text获取页面代码
# 使用lxml解析页面代码
soup = BeautifulSoup(r.text, 'lxml')

# 两次定位，先找到整个信息区域
info_list = soup.find_all(attrs={'class': 'ptlist ptlist-pc'})

# 在此区域内获取游戏名，find_all返回的是list
tit_list = info_list[0].find_all(attrs={'class': 'tit'})

# 遍历获取游戏名
# .text可获取文本内容，替换掉文章中的换行符
for title in tit_list:
    print(title.text.replace('\n', ''))

posted @ 2024-04-26 20:16 Lctrl 阅读(45) 评论(0) 收藏举报

刷新页面返回顶部

Lctrl

爬虫

requests

beautifulsoup

例子

公告