简单网页post爬虫
简单网页post爬虫
一、爬虫目标
http://tzls.hazw.gov.cn/jggs.jspx?apply_date_begin=2017-01-31&pageNo=1
要求在 [单位名称] 中检索目标企业,获取所有检索结果。
二、分析思路
在F12开启控制台的情况下,输入任意企业名点击查询:
关注Network页下的 [jggs_list.jspx] 文件,在右侧header中查看如下信息:
Request URL: http://tzls.hazw.gov.cn/jggs_list.jspx
Request Method: POST
Status Code: 200 OK
Remote Address: 222.143.21.177:80
Referrer Policy: strict-origin-when-cross-origin
Content-Language: zh-CN
Content-Type: text/html;charset=UTF-8
Date: Wed, 10 Nov 2021 08:10:35 GMT
Server: Apache-Coyote/1.1
SET-COOKIE: JSESSIONID=0851EB0C048D289DE640AC7EB1851181; HttpOnly
Set-Cookie: clientlanguage=zh_CN; Path=/
Transfer-Encoding: chunked
xmmc:
projectid:
spsx:
spdwmc: 扶沟牧原农牧有限公司
这里使用工具网站生成爬虫代码:
首先,在控制台中,将 [jggs_list.jspx] 文件复制为cbash,如下图:
然后粘贴到上文工具网站中生成代码:
代码如下:
import requests
cookies = {
'JSESSIONID': '0851EB0C048D289DE640AC7EB1851181',
'clientlanguage': 'zh_CN',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'Origin': 'http://tzls.hazw.gov.cn',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Referer': 'http://tzls.hazw.gov.cn/jggs.jspx?apply_date_begin=2017-01-31&pageNo=1',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
data = {
'xmmc': '',
'projectid': '',
'spsx': '',
'spdwmc': '\u6276\u6C9F\u7267\u539F\u519C\u7267\u6709\u9650\u516C\u53F8'
}
response = requests.post('http://tzls.hazw.gov.cn/jggs_list.jspx', headers=headers, cookies=cookies, data=data, verify=False)
注意到上述代码中的 'spdwmc': '\u6276\u6C9F\u7267\u539F\u519C\u7267\u6709\u9650\u516C\u53F8',是检索的企业名称。因此只需要改变这一项后重复post即可得到结果。
加上pandas读取企业名和网页结果解析得到结果。
三、代码展示
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
import time
cookies = {
'clientlanguage': 'zh_CN',
'JSESSIONID': 'A0543CD017C770E424424EF38DB58BCF',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'Origin': 'http://tzls.hazw.gov.cn',
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Referer': 'http://tzls.hazw.gov.cn/jggs.jspx?apply_date_begin=2017-01-31&pageNo=1',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
url = 'http://tzls.hazw.gov.cn/jggs_list.jspx'
# 解析网页
def mathch(text):
res = []
soup = BeautifulSoup(text, "html.parser")
for t in soup.find_all('td'):
res.append(t.get_text().strip())
return res[:-1]
def pages(tests):
# 返回有多少页
starts = '共'
startindex = tests.find(starts)+2
endindex = tests.find('条', startindex)
lines = int(tests[startindex: endindex])
return int(lines/10)+1
df = pd.read_excel('all_enterprise_names.xlsx')
ans = []
for row in df.iterrows():
name = row[1].to_list()[0]
# 第一次post判断是否有结果 & 有多少页
print("Name=", name)
params = ( ('xmmc', ''), ('projectid', ''), ('spsx', ''), ('spdwmc', name), )
data = {'pageNo': '1'}
response = requests.post(url, headers=headers, params=params, cookies=cookies, data=data, verify=False, timeout=7)
tests = response.text
# 特判未查询到数据
if re.search("暂无数据", tests) and re.search("暂无数据", tests).group():
continue
else:
res = mathch(tests)
# 判断是否有后几页
pageNums = pages(tests)
if pageNums > 1:
for i in range(2, 2 + pageNums-1):
time.sleep(3) # 防止频率过高封ip或响应失败
print(i, "/", pageNums)
data = {'pageNo': str(i)}
response = requests.post(url, headers=headers, params=params, cookies=cookies, data=data, verify=False)
res = res + mathch(response.text)
ans.append(res)
result = pd.DataFrame(ans)
result.to_excel('result.xlsx')
四、结果

浙公网安备 33010602011771号