python初识爬虫

前言

在信息化的时代能快速收集查询自己想要的数据资料也是种很重要的能力爬虫就很好的体现了这一点既能从Internet上获取想要的知识又能提高识别code的能力

1 爬虫是什么

如果我们把互联网比作一张大的蜘蛛网数据便是存放于蜘蛛网的各个节点而爬虫就是一只小蜘蛛

沿着网络抓取自己的猎物(数据) 爬虫指的是向网站发起请求获取有效资源后分析并提取有用数据的程序

从技术层面来说就是通过程序模拟浏览器请求站点的行为把站点返回的HTML代码/JSON数据/二进制数据(图片视频) 进行分析并刷选出有用的数据下载到本地

2 爬虫基本流程

2.1 发起请求

使用第三方模块requests库向目标站点发起请求即发送一个request

request包含请求头请求体等

2.2 获取响应内容

如果服务器能正常响应，则会得到一个response

response包含：html json 格式的数据

2.3 解析内容

解析html数据正则表达式(re模块) 第三方解析库如Beautifulsoup pyquery lxml等

解析json数据 json模块

2.4 保存数据

将解析后的数据存入数据库(MySQL Mongdb Redis)

3 request模块

3.1 请求方法 GET POST PUT DELETE等等

requests.get requests.post
......
requests.request(method='POST')

3.2 请求参数url

url全球统一资源定位符，用来定义互联网上一个唯一的资源

3.3 请求头headers

headers={
'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
}

3.4 请求体data 用于post请求向后台传送数据

data={'user':'abc','pwd':'123'}

3.5 代理porxies 可以设置ip代理向服务器发送请求

roxie_dict = {
"http": "http://xxx.xxx.xxx.xxx",
"https": "http://xxx.xxx.xxx.xxx",
}
requests.get("https://www.xxx.com", proxies=proxie_dict)

3.6 cookie

cookies={"":""}

4 响应状态码

200：代表成功

301：代表跳转

404：文件不存在

403：无权限访问

503：服务器错误

5 一个完整的请求示例

request.post(

　　url="http://www.xxx.com/",

　　headers={

　　'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
　　},

　　proxies={

　　"http": "http://xxx.xxx.xxx.xxx",
　　"https": "http://xxx.xxx.xxx.xxx",

　　}

　　data={

　　'user':'abc','

　　pwd':'123'

　　}

)

用requests模块爬去一个视频网址的示例

import requests
from lxml import etree
class BaoTuDownloading(object):
    def __init__(self):
        self.url = 'https://ibaotu.com/shipin/'
        self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'}   #请求头部
        self.index = 1
    def start_requests(self):
        response = requests.get(self.url+'7-0-0-0-0-'+str(self.index),headers=self.headers)
        html = etree.HTML(response.text)
        self.index += 1
        self.xpath_data(html)
        self.start_requests()

    def xpath_data(self,html):
        src_list = html.xpath('//div[@class="video-play"]/video/@src')
        title_list = html.xpath('//span[@class="video-title"]/text()')
        for  src ,title in zip(src_list,title_list):                   #标题 链接一一对应
           src_url = 'https:'+src
           title_name = title +'.mp4'
           respon = requests.get(src_url,headers=self.headers)
           print('正在抓取'+title_name)
           with open(title_name,'wb') as f:
               f.write(respon.content)


b = BaoTuDownloading()
b.start_requests()

posted @ 2019-03-19 20:17 KIV 阅读(115) 评论(0) 收藏举报

刷新页面返回顶部

KIV