爬虫1requests以及beautifulsoup模块

初始爬虫:

  1 s8day132 爬虫 
  2 
  3 
  4 内容回顾：
  5     面试相关：
  6         - 第一部分：
  7             - Python基础
  8             - 函数
  9             - 面向对象
 10         - 第二部分：
 11             - 模块
 12                 - 你常用的模块？
 13                     - re/json/logging/os/sys
 14                     - requests/beautifulsoup4
 15                 - re正则 
 16                     - 写一个常见正则：邮箱/手机号/IP
 17                     - 贪婪匹配
 18                 - 问题：给你路径 "E:\mac苹果系统工具"? 提示：os
 19                 - 创建、删除文件
 20                 - 第三方软件安装：
 21                     - pip包管理器
 22                     - 源码安装
 23                         - 下载
 24                         - 解压 
 25                         - python setup.py build
 26                         - python setup.py install 
 27             - 网络编程
 28                 - OSI 7层协议
 29                 - 三次握手、四次挥手
 30                 - TCP和UDP
 31             - 并发编程
 32                 - 进程、线程、协程区别
 33                 - GIL锁
 34                 - 进程池和线程池
 35     Flask框架
 36         - 1. django和flask区别？
 37         - 2. flask内置组件
 38             ...
 39             
 40         - 3. Flask上下文管理是如何实现？
 41         
 42         - 4. 为什么要创建LocalStack类 or Local对象中为什么保存成 {111:{stack:[ctx, ]}}
 43             Web应用时：
 44                 - 服务端单线程：
 45                     {
 46                         111:{stack: [ctx, ]}
 47                     }
 48                 - 服务端多线程：
 49                     {
 50                         111:{stack: [ctx, ]}
 51                         112:{stack: [ctx, ]}
 52                     }
 53             离线脚本：
 54                 with app01.app_context():
 55                     print(current_app)
 56                     with app02.app_context():
 57                         print(current_app)
 58                     print(current_app)
 59             PS: 实现栈
 60         - 5.Flask第三组件：
 61             - flask-session 
 62             - flask-sqlalchemy
 63             - flask-migrate
 64             - flask-script
 65             - DBUtils
 66             - wtforms
 67             - 自定义 Auth，参考：flask-login组件 
 68                 
 69     
 70 今日内容：
 71     1. 爬虫介绍
 72     
 73     2. 爬取汽车之家新闻
 74     
 75     3. requests
 76     
 77     4. bs4
 78 
 79 内容详细：
 80     1. 爬虫介绍，什么是爬虫？
 81         编写程序，根据URL获取网站信息。
 82         
 83         注意：犯法
 84     
 85     2. 爬取汽车之家新闻
 86         a. 伪造浏览器向某个地址发送Http请求，获取返回的字符串
 87            pip3 install requests 
 88            
 89            response = requests.get(url='地址')
 90            response.content
 91            response.encoding = apparent_encoding
 92            response.text
 93            
 94         b. bs4，解析HTML格式的字符串
 95            pip3 install beautifulsoup4
 96            
 97            soup = BeautifulSoup('<html>....</html>',"html.parser")
 98            
 99            div = soup.find(name='标签名')
100            div = soup.find(name='标签名',id='i1')
101            div = soup.find(name='标签名',_class='i1')
102            div = soup.find(name='div',attrs={'id':'auto-channel-lazyload-article','class':'id'})
103             
104            div.text 
105            div.attrs 
106            div.get('href')
107            
108             
109            divs = soup.find_all(name='标签名')
110            divs = soup.find_all(name='标签名',id='i1')
111            divs = soup.find_all(name='标签名',_class='i1')
112            divs = soup.find_all(name='div',attrs={'id':'auto-channel-lazyload-article','class':'id'})
113            
114            divs是列表
115            divs[0]
116     
117     3. 
118         问题一：将抽屉页面所有新闻点赞
119         
120             response = requests.post(
121                 url="xx",
122                 data={
123                     
124                 },
125                 headers={},
126                 cookies={}
127             )
128             
129             cookie_dict = response.cookies.get_dict()
130             
131             注意： 
132                 - 伪造浏览器 
133                 - 请求分析
134                 
135         问题二：自动登录github，查看个人账户
136             
137         问题三：验证码，公司团队/服务 
138     
139     4. 模块
140         requests 
141             method: 
142             url:
143             params:
144             data:
145             json:
146             headers:
147             cookies: 
148             proxies: 封IP，用代理 
149             
150             
151             
152             files: 上传文件
153             auth: 基本认证
154             timeout: 超时时间里面有两个参数的话就是第一个是请求进来的时间,第二个是请求退出的时间,如果是一个参数就是请求进去的超时时间
155             allow_redirects: True (允许重定向,直至找到指定url为止,)
156             stream: 下载大文件时
157                 ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
158                 for i in r.iter_content():
159                     # print(i)
160                 from contextlib import closing
161                 with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
162                     # 在此处理响应。
163                     for i in r.iter_content():
164                         print(i)
165                                     
166             cert: 证书(一般大公司都是花钱找专业人做,尽最大可能避免劳烦用户,小网站一般都是需要用户自己下载证书并保存安装才可以访问网站,)
167             verify: 确认
168             
169         参考： http://www.cnblogs.com/wupeiqi/articles/6283017.html
170     
171     5. 预习 
172         bs4模块 
173         
174     
175 作业：
176     1. 面试题
177         
178         - 第一部分：
179             - Python基础
180             - 函数
181             - 面向对象
182         - 第二部分：
183             - 模块
184                 - 11. 你常用的模块？
185                     - re/json/logging/os/sys
186                     - requests/beautifulsoup4
187                 - 12. re正则 
188                     - 写一个常见正则：邮箱/手机号/IP
189                     - 贪婪匹配
190                 - 3. 给你路径 "E:\mac苹果系统工具"? 提示：os
191                 - 4. 创建、删除文件
192                 - 5. 第三方软件安装：
193                     - pip包管理器
194                     - 源码安装
195                         - 下载
196                         - 解压 
197                         - python setup.py build
198                         - python setup.py install 
199             - 网络编程
200                 - 6. OSI 7层协议
201                 - 7. 三次握手、四次挥手
202                 - 8. TCP和UDP
203             - 并发编程
204                 - 9. 进程、线程、协程区别
205                 - 10. GIL锁
206                 - 11. 进程池和线程池
207     
208     
209     2. Flask 
210         - 1. django和flask区别？
211         - 2. flask内置组件
212             ...
213             
214         - 3. Flask上下文管理是如何实现？
215         
216         - 4. 为什么要创建LocalStack类 or Local对象中为什么保存成 {111:{stack:[ctx, ]}}
217             Web应用时：
218             '''
219             localstack的目的是把{11:{stack:[2,3,]}}
220             把数据维护成一个栈,
221             当web应用时这个列表并无用,这里只放一个数据,
222             当编写离线脚本时,使用with语句才会用到这个栈
223             '''
224                 - 服务端单线程：
225                     {
226                         111:{stack: [ctx, ]}
227                     }
228                 - 服务端多线程：
229                     {
230                         111:{stack: [ctx, ]}
231                         112:{stack: [ctx, ]}
232                     }
233             离线脚本：
234                 with app01.app_context():
235                     print(current_app)
236                     with app02.app_context():
237                         print(current_app)
238                     print(current_app)
239             PS: 实现栈
240         - 5.Flask第三组件：
241             - flask-session 
242             - flask-sqlalchemy
243             - flask-migrate
244             - flask-script
245             - DBUtils
246             - wtforms
247             - 自定义 Auth，参考：flask-login组件 
248     
249     
250     3. github自动登录并查看个人信息
251     
252     4. 博客示例
253     
254     5. 预习 
255         bs4模块 
256         
257     重点：示例
258     
259     
260     
261     
262     
263     
264

概要

更多详情 scrapy流程图解

重点的几个参数:

method

url

params

data

json

headers

proxies

 1 import requests
 2 from bs4 import BeautifulSoup
 3 # 获取网页信息
 4 ret = requests.get(
 5     'https://blog.csdn.net/wbin233/article/details/73222027#%E5%BC%80%E5%90%AFqq%E9%82%AE%E7%AE%B1smtp%E6%9C%8D%E5%8A%A1')
 6 
 7 # ret.content  # 这里获取的是请求体
 8 ret.encoding = ret.apparent_encoding
 9 # print(ret.text)  # text是获取网页的文本信息
10 
11 
12 
13 # 爬取我们的豆瓣首页里面的所有图片内容
14 obj = requests.get('https://book.douban.com/')
15 bet = BeautifulSoup(obj.text, 'html.parser')
16 fh = bet.find(name='div', attrs={'id': 'content'})
17 fh1 = bet.find_all(name='img')
18 for i in fh1:
19     addr_img = i.get('src')
20     if not addr_img:
21         continue
22 
23     addr_img=i.get('src')
24     img_obj=addr_img.rsplit('/',maxsplit=1)[1]
25     ret=requests.get(url=addr_img)
26     with open(img_obj,'wb') as fh:  # 我们这里是把每一个图片都创立一个文件,然后把图片内容写进去
27         fh.write(ret.content)
28 
29     # print(addr_img)
30     # with open('list','wb') as fh:
31     #     fh.write(bytes(addr_img))
32 
33     # print(fh1)
34     # for i in fh1:
35     #     print(i)

获取网页信息

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 # 去github里面关注该页面里面的所有人,follow,
 5 fh = requests.get(url='https://github.com/login')
 6 obj = BeautifulSoup(fh.text, 'html.parser')
 7 f1 = obj.find(name='input', attrs={'name': 'authenticity_token'}).get('value')
 8 # print(f1)
 9 requests.post(
10     url='https://github.com/session',  # 我们的github里面form表单里面提交的url是github.com/session,而不是login,
11     # 当我们点击sign in的时候,就会显示session的url
12     data={
13         'commit': 'Sign in',
14         'utf8': '✓',
15         'login': 'dream-blue',
16         'password': 'dream1989!',
17         'authenticity_token': f1,
18     }
19 )
20 
21 response = requests.get(
22     url='https://github.com/gohugoio/hugo/stargazers',
23     headers={
24         'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
25     }
26 )
27 
28 soup = BeautifulSoup(response.text, 'html.parser')
29 obj = soup.find(attrs={'id': 'repos'})
30 obj0 = obj.find(name='ol')
31 obj1 = obj0.find_all(name='li')
32 for i in obj1:
33     obj = i.find('div', attrs={'class': 'd-inline-block'}).find('a', attrs={'class': 'd-inline-block'})
34     obj2 = obj.get('href').split('/')[1]
35     # print(obj2)
36     #     根据每一个target去关注该页面的所有人
37     whh=requests.post(
38         url='https://github.com/gohugoio/hugo/target=%s' % obj2,
39         data={
40             'authenticity_token': f1
41         }
42 
43     )
44     print(whh)
45 
46 
47 
48 zz = requests.get(url='https://github.com/dream-blue')
49 sp = BeautifulSoup(zz.text, 'html.parser')
50 # print(0,sp.text)  # 爬取到我们的github里面的所有个人信息,

登录github

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 # 1,先访问抽屉,获取cookie,这里是未授权版的
 5 obj0 = requests.get(
 6     url='https://dig.chouti.com/r/scoff/hot/1',
 7     headers={
 8         'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
 9     }  # 我们的网页会有反爬策略,它会识别你访问该网站所用的设备,如果不是手机端也不是pc端那么就是爬虫,就无法获取数据,我们在这里把我们的设备
10     # user-agent加上就可以继续访问数据了
11 )
12 cookie_ord=obj0.cookies.get_dict()
13 # print(cookie_ord)
14 
15 # 2,发送用户名和密码认证,+cookie此时是已经授权版的
16 fh=requests.post(
17     url='https://dig.chouti.com/login',
18     data={
19         'phone':'13522648164',
20         'password':'ll1xx2yy3',
21         'oneMonth':'1'
22     },
23     headers={
24         'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
25     },
26     cookies=cookie_ord
27 )
28 
29 for i in range(3,6):
30     opk=requests.get(
31         url='https://dig.chouti.com/r/scoff/hot/%s'% i,
32         headers={
33             'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
34         }
35     )
36 
37     su=BeautifulSoup(opk.text,'html.parser')
38     f1=su.find(attrs={'id':'content-list'})
39     f2=f1.find_all(attrs={'class':'item'})
40     for obj in f2:
41         fo=obj.find(attrs={'class':'part2'})
42         jk=fo.get('share-linkid')
43 
44         # 根据每一条数据的id去点赞
45         qqx=requests.post(
46             url='https://dig.chouti.com/r/pic/hot/1?linksId=%s' %jk,
47             # 我们的浏览器里面的network里面headers请求里面有linksId这个参数,这是浏览器里面自带的
48             headers={
49                 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36',
50             },
51             cookies=cookie_ord
52         )
53         # print(qqx.text)

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 ret = requests.get(url='https://www.douban.com/gallery/topic/3243/')
 5 
 6 obj = BeautifulSoup(ret.text, 'html.parser')
 7 fh = obj.find(attrs={'class': 'item-note'})
 8 # print(1,fh)
 9 # fh0=fh.children
10 # res=fh.find_all(name='div')
11 # print(res)
12 """
13 我们的豆瓣里面点赞部分,有一个div标签做拦截,该标签的属性里面有一个自定义的data-reactroot,不是键值对的格式写入的,无法获取它下面的所有数据
14 """
15 # for i in fh0:
16 #     print(i.text)
17 # print(fh0.content)
18 # for i in res:
19 #     print(i.text)
20 
21 
22 # ============================================================================
23 # obj0 = requests.get(
24 #     url='https://dig.chouti.com/r/scoff/hot/1',
25 #     headers={
26 #         'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'
27 #     }
28 # )
29 # opk = BeautifulSoup(obj0.text, 'html.parser')
30 # obj1 = opk.find(attrs={'id': 'content-list'})
31 # obj2=obj1.find_all(attrs={'class':'item'})
32 # for i in obj2:
33 #     fh=i.find(attrs={'class':'part2'})
34 #     fk=fh.get('share-linkid')
35 #     print(fk)

登录测试

1 from requests.auth import HTTPBasicAuth,HTTPDigestAuth
2 import requests
3 ret=requests.get('https://api.github.com/user',auth=HTTPBasicAuth('peter','pwotbgndkm'))
4 print(ret.text)

auth模块

 1 import requests,json
 2 # json参数
 3 # 标志:Form Data
 4 # 请求头:http://www.oldboyedu.com
 5 # 请求体:name=wusir&age=29
 6 obj=requests.post(
 7     url='http://www.oldboyedu.com',
 8     data={
 9         'name':'wusir',
10         'age':29
11     },
12     headers={},
13     cookies={}
14 )
15 
16 """
17 请求头:http://www.oldboyedu.com
18 请求体:'{"name":"alex","age":19}'
19 """
20 requests.post(
21     url='http://www.oldboyedu.com',
22     json={
23         'name':'alex',
24         'age':32
25     },
26     headers={},
27     cookies={}
28 )
29 
30 # 请求体:name=wusir&age=32
31 requests.post(
32     url='http://www.oldboyedu.com',
33     data={
34         'name':'wusir',
35         'age':32
36     },
37     headers={},
38     cookies={}
39 )
40 
41 # 请求体'{"name::"alex","age":21}'
42 requests.post(
43     url='http://www.oldboyedu.com',
44     data=json.dumps({
45         'name':'alex',
46         'age':20
47     }),
48     headers={},
49     cookies={}
50 )

requests模块

posted @ 2018-05-07 23:10 dream-子皿阅读(139) 评论(0) 收藏举报

刷新页面返回顶部

孟郊

爬虫1requests以及beautifulsoup模块

公告