1.爬虫
爬虫意在伪造浏览器的各种行为
- 并发方案:
- 异步IO: gevent/Twisted/asyncio/aiohttp
- 自定义异步IO模块
- IO多路复用:select 在linux上用eopll
Scrapy框架
介绍:异步IO:Twisted
- 基于Scrapy源码自定义爬虫框架
- 使用Scrapy
2。示例
基本操作:
import requests from bs4 import BeautifulSoup response = requests.get( url='https://www.autohome.com.cn/news/' ) response.encoding = response.apparent_encoding #解决乱码:以什么方式接收,就以什么方式解码 suop = BeautifulSoup(response.text,features='html.parser') #生产环境用lxml,性能比较好 。引擎 target = suop.find(id='auto-channel-lazyload-article') #find :找第一个 li_list = target.find_all('li') #这个已经是列表了。find_all for i in li_list: a = i.find('a') img = i.find('img') if a: print(a.attrs.get('href')) #attrs获取属性 txt = a.find('h3').text #这个是什么类型 print(txt) print(img.attrs.get('src'))
总结:
soup = beautifulsoup('<html>...</html>',features='html.parser') v1 = soup.find('div') v1 = soup.find(id='i1') v1 = soup.find('div',id='i1') v2 = soup.find_all('div') v2 = soup.find_all(id='i1') v2 = soup.find_all('div',id='i1') obj = v1 obj = v2[0] obj.text obj.attrs
自动登入:
post_dict = { "phone": '111111111', 'password': 'xxx', 'oneMonth': 1 } response = requests.post( url="http://dig.chouti.com/login", data = post_dict ) print(response.text) cookie_dict = response.cookies.get_dict()
3. requests模块参数
- 方法关系 requests.get(.....) requests.post(.....) requests.put(.....) requests.delete(.....) ... requests.request('POST'...) - 参数 request.request - method: 提交方式 - url: 提交地址 - params: 在URL中传递的参数,GET requests.request( method='GET', url= 'http://www.oldboyedu.com', params = {'k1':'v1','k2':'v2'} ) # http://www.oldboyedu.com?k1=v1&k2=v2 - data: 在请求体里传递的数据 requests.request( method='POST', url= 'http://www.oldboyedu.com', params = {'k1':'v1','k2':'v2'}, data = {'use':'alex','pwd': '123','x':[11,2,3]} ) 请求头: content-type: application/url-form-encod..... 请求体: use=alex&pwd=123 - json 在请求体里传递的数据 requests.request( method='POST', url= 'http://www.oldboyedu.com', params = {'k1':'v1','k2':'v2'}, json = {'use':'alex','pwd': '123'} ) 请求头: content-type: application/json 请求体: "{'use':'alex','pwd': '123'}" PS: 字典中嵌套字典时 - headers 请求头 requests.request( method='POST', url= 'http://www.oldboyedu.com', params = {'k1':'v1','k2':'v2'}, json = {'use':'alex','pwd': '123'}, headers={ 'Referer': 'http://dig.chouti.com/', 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" } ) - cookies Cookies - files 上传文件 - auth 基本认知(headers中加入加密的用户名和密码) - timeout 请求和响应的超市时间 - allow_redirects 是否允许重定向 - proxies 代理 - verify 是否忽略证书 - cert 证书文件 - stream 村长下大片 - session: 用于保存客户端历史访问信息
4.自动登入博客园:
#!/usr/bin/env python # -*- coding:utf-8 -*- import re import json import base64 import rsa import requests def js_encrypt(text): b64der = 'MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCp0wHYbg/NOPO3nzMD3dndwS0MccuMeXCHgVlGOoYyFwLdS24Im2e7YyhB0wrUsyYf0/nhzCzBK8ZC9eCWqd0aHbdgOQT6CuFQBMjbyGYvlVYU2ZP7kG9Ft6YV6oc9ambuO7nPZh+bvXH0zDKfi02prknrScAKC0XhadTHT3Al0QIDAQAB' der = base64.standard_b64decode(b64der) pk = rsa.PublicKey.load_pkcs1_openssl_der(der) v1 = rsa.encrypt(bytes(text, 'utf8'), pk) value = base64.encodebytes(v1).replace(b'\n', b'') value = value.decode('utf8') return value session = requests.Session() i1 = session.get('https://passport.cnblogs.com/user/signin') rep = re.compile("'VerificationToken': '(.*)'") v = re.search(rep, i1.text) verification_token = v.group(1) form_data = { 'input1': js_encrypt('wptawy'), 'input2': js_encrypt('asdfasdf'), 'remember': False } i2 = session.post(url='https://passport.cnblogs.com/user/signin', data=json.dumps(form_data), headers={ 'Content-Type': 'application/json; charset=UTF-8', 'X-Requested-With': 'XMLHttpRequest', 'VerificationToken': verification_token} ) i3 = session.get(url='https://i.cnblogs.com/EditDiary.aspx') print(i3.text) 博客园
5.自动登入知乎:
#!/usr/bin/env python # -*- coding:utf-8 -*- import time import requests from bs4 import BeautifulSoup session = requests.Session() i1 = session.get( url='https://www.zhihu.com/#signin', headers={ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36', } ) soup1 = BeautifulSoup(i1.text, 'lxml') xsrf_tag = soup1.find(name='input', attrs={'name': '_xsrf'}) xsrf = xsrf_tag.get('value') current_time = time.time() i2 = session.get( url='https://www.zhihu.com/captcha.gif', params={'r': current_time, 'type': 'login'}, headers={ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36', }) with open('zhihu.gif', 'wb') as f: f.write(i2.content) captcha = input('请打开zhihu.gif文件,查看并输入验证码:') form_data = { "_xsrf": xsrf, 'password': 'xxooxxoo', "captcha": 'captcha', 'email': '424662508@qq.com' } i3 = session.post( url='https://www.zhihu.com/login/email', data=form_data, headers={ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36', } ) i4 = session.get( url='https://www.zhihu.com/settings/profile', headers={ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36', } ) soup4 = BeautifulSoup(i4.text, 'lxml') tag = soup4.find(id='rename-section') nick_name = tag.find('span',class_='name').string print(nick_name) 知乎
http://www.cnblogs.com/wupeiqi/articles/6283017.html
高性能爬虫:
http://www.cnblogs.com/wupeiqi/articles/6229292.html
浙公网安备 33010602011771号