Selenium 第六课作业 - 搜索最新发布的职位，抓取最多前10条信息。并且点开每一个职位的页面，获取职位描述信息

题目：
登录 51job ，http://www.51job.com
输入搜索关键词 "python"，地区选择 "上海"
搜索最新发布的职位，抓取最多前10条信息。并且点开每一个职位的页面，获取职位描述信息。
注意：
有些公司的职位详情网页不是标准格式，可以跳过
网站有反爬侦测，操作加上一些sleep，不能太快，容易被判定是爬虫程序
最终以csv格式存储在文件中。（请自行网上搜索python如何以csv格式存储文件）

代码

点击查看代码

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import csv

# 创建 WebDriver 实例对象
wd = webdriver.Chrome(r'C:\Program Files\Google\Chrome\Application\chromedriver.exe')
wd.implicitly_wait(30)

# 打开网址
wd.get('http://www.51job.com')
wd.find_element_by_css_selector('.top_wrap')

# 输入python
wd.find_element_by_css_selector('#kwdselectid').send_keys('python')

# 已选择城市
wd.find_element_by_css_selector('#work_position_click').click()
time.sleep(2)
selected_city = wd.find_elements_by_css_selector('#work_position_click_multiple_selected .ttag em')
for c in selected_city:
    c.click()

# 选择上海
wd.find_element_by_css_selector('#work_position_click_center_right_list_category_000000_020000').click()

# 点击保存
wd.find_element_by_css_selector('#work_position_click_bottom_save').click()
time.sleep(2)

# 点击搜索
wd.find_element_by_css_selector('.top_wrap button').click()
time.sleep(2)

info = []
jobs = wd.find_elements_by_css_selector('.j_joblist .e')
count = 0
for job in jobs:
    # 主页句柄
    mainWindow = wd.current_window_handle

    # 选出岗位名，公司名，发布时间，薪水
    contents = job.find_elements_by_css_selector('.jname, .cname, .time, .sal')
    content_List = [content.text for content in contents]
    # print(content_List)

    link = job.find_element_by_css_selector('a')
    url = link.get_attribute('href')
    # print(url)

    # 打开 url
    r = requests.get(url)
    r.encoding = 'gbk'
    # print(r.text)

    # 解析 html
    # 创建BeautifulSoup类型的对象 soup ，参数为 网页源代码 和 html.parser
    soup = BeautifulSoup(r.text, "html.parser")

    # 过滤掉岗位描述不是 job_msg 的网页
    try:
        # 找到职位信息那一栏
        jd = soup.find_all('div', class_=['job_msg'])

        # 岗位描述不是job_msg jd输出为 []
        if jd == []:
            content_List.append('')
        else:
            for j in jd:
                # print(j.get_text())
                s = j.get_text()
                # 去掉 \n \xa0等
                output = "".join(s.split())
                content_List.append(output)

    except:
        content_List.append('')

    info.append(content_List)
    # print(content_List)

    # 取前 10条数据
    count += 1
    if count > 9:
        break
print(info)

tag = ['职位','发布日期','薪水','公司名','职位描述']
with open('job.csv','w',newline='') as f:
    writer = csv.writer(f) #创建一个csv的写入器
    writer.writerow(tag)
    writer.writerows(info)

posted @ 2021-10-14 15:56 minka 阅读(140) 评论(0) 收藏举报

刷新页面返回顶部

minka

Selenium 第六课作业 - 搜索最新发布的职位， 抓取最多前10条信息。 并且点开每一个职位的页面，获取职位描述信息

公告

Selenium 第六课作业 - 搜索最新发布的职位，抓取最多前10条信息。并且点开每一个职位的页面，获取职位描述信息