简单网页post爬虫

简单网页post爬虫

一、爬虫目标

http://tzls.hazw.gov.cn/jggs.jspx?apply_date_begin=2017-01-31&pageNo=1

image-20211110160542078

要求在 [单位名称] 中检索目标企业,获取所有检索结果。

二、分析思路

在F12开启控制台的情况下,输入任意企业名点击查询:

image-20211110161056469

关注Network页下的 [jggs_list.jspx] 文件,在右侧header中查看如下信息:

Request URL: http://tzls.hazw.gov.cn/jggs_list.jspx
Request Method: POST
Status Code: 200 OK
Remote Address: 222.143.21.177:80
Referrer Policy: strict-origin-when-cross-origin
Content-Language: zh-CN
Content-Type: text/html;charset=UTF-8
Date: Wed, 10 Nov 2021 08:10:35 GMT
Server: Apache-Coyote/1.1
SET-COOKIE: JSESSIONID=0851EB0C048D289DE640AC7EB1851181; HttpOnly
Set-Cookie: clientlanguage=zh_CN; Path=/
Transfer-Encoding: chunked

xmmc: 
projectid: 
spsx: 
spdwmc: 扶沟牧原农牧有限公司

这里使用工具网站生成爬虫代码:

cBash转python

首先,在控制台中,将 [jggs_list.jspx] 文件复制为cbash,如下图:

image-20211110161834240

然后粘贴到上文工具网站中生成代码:

image-20211110161942986

代码如下:

import requests

cookies = {
    'JSESSIONID': '0851EB0C048D289DE640AC7EB1851181',
    'clientlanguage': 'zh_CN',
}

headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'Origin': 'http://tzls.hazw.gov.cn',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Referer': 'http://tzls.hazw.gov.cn/jggs.jspx?apply_date_begin=2017-01-31&pageNo=1',
    'Accept-Language': 'zh-CN,zh;q=0.9',
}

data = {
  'xmmc': '',
  'projectid': '',
  'spsx': '',
  'spdwmc': '\u6276\u6C9F\u7267\u539F\u519C\u7267\u6709\u9650\u516C\u53F8'
}

response = requests.post('http://tzls.hazw.gov.cn/jggs_list.jspx', headers=headers, cookies=cookies, data=data, verify=False)

注意到上述代码中的 'spdwmc': '\u6276\u6C9F\u7267\u539F\u519C\u7267\u6709\u9650\u516C\u53F8',是检索的企业名称。因此只需要改变这一项后重复post即可得到结果。

加上pandas读取企业名和网页结果解析得到结果。

三、代码展示

import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
import time

cookies = {
    'clientlanguage': 'zh_CN',
    'JSESSIONID': 'A0543CD017C770E424424EF38DB58BCF',
}

headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'Origin': 'http://tzls.hazw.gov.cn',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Referer': 'http://tzls.hazw.gov.cn/jggs.jspx?apply_date_begin=2017-01-31&pageNo=1',
    'Accept-Language': 'zh-CN,zh;q=0.9',
}

url = 'http://tzls.hazw.gov.cn/jggs_list.jspx'

# 解析网页
def mathch(text):
    res = []
    soup = BeautifulSoup(text, "html.parser")
    for t in soup.find_all('td'):
        res.append(t.get_text().strip())
    return res[:-1]

def pages(tests):
    # 返回有多少页
    starts = '共'
    startindex = tests.find(starts)+2
    endindex = tests.find('条', startindex)
    lines = int(tests[startindex: endindex])
    return int(lines/10)+1
    
df = pd.read_excel('all_enterprise_names.xlsx')


ans = []
for row in df.iterrows():
    name = row[1].to_list()[0]
    # 第一次post判断是否有结果 & 有多少页
    print("Name=", name)
    params = ( ('xmmc', ''), ('projectid', ''), ('spsx', ''), ('spdwmc', name), )
    data = {'pageNo': '1'}
    response = requests.post(url, headers=headers, params=params, cookies=cookies, data=data, verify=False, timeout=7)
    tests = response.text
    
    # 特判未查询到数据
    if re.search("暂无数据", tests) and re.search("暂无数据", tests).group():
        continue
    else:
        res = mathch(tests)

    # 判断是否有后几页
    pageNums = pages(tests)
    if pageNums > 1:
        for i in range(2, 2 + pageNums-1):
            time.sleep(3)  # 防止频率过高封ip或响应失败
            print(i, "/", pageNums)
            data = {'pageNo': str(i)}
            response = requests.post(url, headers=headers, params=params, cookies=cookies, data=data, verify=False)
            res = res + mathch(response.text)
    ans.append(res)
    
result = pd.DataFrame(ans)
result.to_excel('result.xlsx')

四、结果

image-20211110162950913
posted @ 2021-11-10 16:36  自倚修行  阅读(443)  评论(0)    收藏  举报