Python爬虫之selenium爬取腾讯招聘

目标:腾讯招聘网

 

分析

要抓取的内容在大div class="recruit-wrap recruit-margin"下的div里面

xpath定位:

//*[@class="recruit-wrap recruit-margin"]/div

 开始

导入库


from selenium import webdriver
from selenium.webdriver.common.by import By

模拟打开网址

dirver = webdriver.Chrome()
dirver.get('https://careers.tencent.com/search.html?index=1&keyword=python')

定位到内容

 div_list = dirver.find_elements(By.XPATH, '//*[@class="recruit-wrap recruit-margin"]/div')

这里说一下,为什么这么用,为什么放着find_element_by_xpath()这么方便的语句不用却用find_elements()?  其实,find_element_by_xpath()用不了,在运行的时候会出现警告,让你用find_elements(),如果不用find_elements()就不让你运行,而且你会发现你输入那些命令的时候会多出一条横线,如图:

 

所以在上面也要多一条

from selenium.webdriver.common.by import By

 的命令,就是为了代替find_element_by_xpath(),想要用对应的功能的话,就用

By.XXXX(XXXX是你想要用的功能)

,比如说我想用

find_element_by_name()

的功能,那么我就可以

find_element(By.NAME)

提取内容

one_job_list = div.text.split('\n')
        item = {}
        item['title'] = one_job_info_list[0].strip()
        item['tips'] = one_job_info_list[1].strip()
        item['text'] = one_job_info_list[2].strip()
        print(item)

 

全部代码


from selenium import webdriver
from selenium.webdriver.common.by import By

dirver = webdriver.Chrome()
dirver.get('https://careers.tencent.com/search.html?index=1&keyword=python')
dirver.find_element_by_link_text()
div_list = dirver.find_elements(By.XPATH, '//*[@class="recruit-wrap recruit-margin"]/div')
for div in div_list:
    # print(div_list)
    one_job_list = div.text.split('\n')
    item = {}
    item['title'] = one_job_info_list[0].strip()
    item['tips'] = one_job_info_list[1].strip()
    item['text'] = one_job_info_list[2].strip()
    print(item)

下面就是每一页都爬

分析

 

可以看到,url变的只有index=后面的值,那么 

for page in range(1, 10):
    url = 'https://careers.tencent.com/search.html?index={}&keyword=python'.fomat(page)
    response = requests.get(url=url)
    response.encoding = response.apparent_encoding
    html = response.text
    print(html)

selenium里面只写

for page in range(1, 10):
    url = 'https://careers.tencent.com/search.html?index={}&keyword=python'.fomat(page)

就行。

posted @ 2022-02-12 21:09  冷巷  阅读(180)  评论(0)    收藏  举报