- 什么是代理
- 为何要使用代理?
- 有些时候,需要对网站服务器发起高频的请求,网站的服务器会检测到这样的异常现象,则会讲请求对应机器的ip地址加入黑名单,则该ip再次发起的请求,网站服务器就不在受理,则我们就无法再次爬取该网站的数据;
- 使用代理后,网站服务器接收到的请求,最终是由代理服务器发起,网站服务器通过请求获取的ip就是代理服务器的ip,并不是我们客户端本身的ip。
- 如何获取代理:小编推荐两种
- 测试:模拟在浏览器中输入ip,返回本机ip。这里我们使用代理ip,返回的则是代理ip
import requests
from lxml import etree
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36',
}
url = 'https://www.sogou.com/web?query=ip'
#使用代理服务器发起请求
#proxies={'代理类型':'ip:port'}
#通过芝麻代理生成
proxies = {'https':'58.46.249.5:4231'}
page_text = requests.get(url=url,headers=headers,proxies = proxies).text
tree = etree.HTML(page_text)
data = tree.xpath('//*[@id="ipsearchresult"]/strong/text()')[0]
print(data)
- 封装一个代理池,从中随机选取一个代理发起请求
- 1.从芝麻代理中提取20个ip,http://webapi.http.zhimacangku.com/getip?num=20&type=2&pro=440000&city=441900&yys=0&port=11&time=1&ts=0&ys=0&cs=0&lb=1&sb=0&pb=4&mr=1®ions=
- 2.在浏览器中,打开,范围的数据为:
{"code":0,"data":[{"ip":"113.78.24.87","port":4221},{"ip":"113.78.26.116","port":4221},{"ip":"113.77.87.173","port":4221},{"ip":"119.128.172.23","port":4221},{"ip":"113.77.85.100","port":4221},{"ip":"119.128.173.109","port":4221},{"ip":"113.77.86.112","port":4221},{"ip":"119.128.173.93","port":4221},{"ip":"113.78.26.67","port":4221},{"ip":"113.77.87.204","port":4221},{"ip":"113.77.85.134","port":4221},{"ip":"113.77.85.26","port":4221},{"ip":"113.78.26.187","port":4221},{"ip":"113.77.86.86","port":4221},{"ip":"113.78.27.182","port":4221},{"ip":"113.78.25.220","port":4221},{"ip":"119.128.173.250","port":4221},{"ip":"113.77.86.218","port":4221},{"ip":"113.77.87.73","port":4221},{"ip":"113.78.25.186","port":4221}],"msg":"0","success":true}
- 3.使用代理池中的ip,随机选择一个发起请求
import requests,random
from lxml import etree
headers = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.80 Safari/537.36',
}
#获取代理数据
proxy_url = 'http://webapi.http.zhimacangku.com/getip?num=20&type=2&pro=440000&city=441900&yys=0&port=11&time=1&ts=0&ys=0&cs=0&lb=1&sb=0&pb=4&mr=1®ions='
json_data = requests.get(url = proxy_url,headers = headers).json()['data']
#数据解析,并且封装成字典
proxy_list = [] #用来存放ip,每次从中随机选择ip发起请求
for data in json_data:
ip = data['ip']
port = data['port']
dic = {'https':ip+':'+str(port)} # {'https':'111.1.1.1:1234'}
proxy_list.append(dic)
url = 'https://www.sogou.com/web?query=ip'
#使用代理服务器发起请求
#proxies={'代理类型':'ip:port'}
#通过芝麻代理生成
# proxies = {'https':'58.46.249.5:4231'}
for i in range(100): #发起100次请求
page_text = requests.get(url=url,headers=headers,proxies = random.choice(proxy_list)).text #随机选择一个ip发起请求
tree = etree.HTML(page_text)
data = tree.xpath('//*[@id="ipsearchresult"]/strong/text()')[0]
print(data)