scrapy 爬取腾讯在招职位
1.首先我们先确定自己要爬取的地址,字段,url 等有什么样的地点,经分析
https://hr.tencent.com/position.php?&start= #链接中的start=后面的参数是从0开始的,每翻一页数字加10
然后我们通过分析页面元素确定我们要去字段的path
不多说直接上代码
首先通过命令创建项目
scrapy startproject tentcent
我们会看到
tencent
├── scrapy.cfg
└── tencent
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
可以看到此时并没有关于爬虫的代码模板,我们可以通过命令创建
scrapy genspider TencentSpider hr.tencent.com
#之后你会看到
.
├── scrapy.cfg
└── tencent
├── __init__.py
├── __init__.pyc
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
├── settings.pyc
└── spiders
├── __init__.py
├── __init__.pyc
└── TencentSpider.py
打开TentcentSpider.py你会看到一个生成好的模板
# -*- coding: utf-8 -*-
import scrapy
class TencentspiderSpider(scrapy.Spider):
name = 'TencentSpider' #这里的name 是可以改动的
allowed_domains = ['hr.tencent.com']
start_urls = ['http://hr.tencent.com/']#此处的连接一般会改动,这里的理解可以写多个,但是程序运行只会取一个,从开始到结束只会取一次
def parse(self, response):
pass
之后我们就可以编辑我们的items.py 来确定我们要拿到的关键字段
import scrapy
class TencentItem(scrapy.Item):
# define the fields for your item here like:
# 职位名
positionname = scrapy.Field()
# 详情连接
positionlink = scrapy.Field()
# 职位类别
positionType = scrapy.Field()
# 招聘人数
peopleNum = scrapy.Field()
# 工作地点
workLocation = scrapy.Field()
# 发布时间
publishTime = scrapy.Field()
#简单的取几个字段
确定字段后我们可以编辑我们的爬虫文件
import scrapy
from tencent.items import TencentItem
class TencentpositionSpider(scrapy.Spider):
name = "tencent"
allowed_domains = ["tencent.com"]
url = "http://hr.tencent.com/position.php?&start="
offset = 0
start_urls = [url + str(offset),]
def parse(self, response):
for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):
# 初始化模型对象
item = TencentItem()
item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0]
# 详情连接
item['positionlink'] = each.xpath("./td[1]/a/@href").extract()[0]
# 职位类别
item['positionType'] = each.xpath("./td[2]/text()").extract()[0]
# 招聘人数
item['peopleNum'] = each.xpath("./td[3]/text()").extract()[0]
# 工作地点
item['workLocation'] = each.xpath("./td[4]/text()").extract()[0]
# 发布时间
item['publishTime'] = each.xpath("./td[5]/text()").extract()[0]
yield item #这里的yield 是要把数据传递给pipelines.py 进行处理
if self.offset < 1680: #我们通过连接可以知道职位最多是1680个,所以我们进行一下判断
self.offset += 10
# 每次处理完一页的数据之后,重新发送下一页页面请求
# self.offset自增10,同时拼接为新的url,并调用回调函数self.parse处理Response
yield scrapy.Request(self.url + str(self.offset), callback = self.parse)
这里写好后我们可以去写下载函数pipelines.py
import json
class TencentPipeline(object):
def __init__(self):
self.filename = open("tencent.json", "w")
def process_item(self, item, spider):
text = json.dumps(dict(item), ensure_ascii = False) + ",\n"
self.filename.write(text.encode("utf-8"))
return item
def close_spider(self, spider):
self.filename.close()
我们可以使用codecs 模块进行文件的操作,只是比open多了个encoding 参数
import json import codecs
class TencentPipeline(object): def __init__(self): self.filename = codecs.open("tencent.json", "w",encoding='utf-8') def process_item(self, item, spider): text = json.dumps(dict(item), ensure_ascii = False) + ",\n" self.filename.write(text)#这里就不需要再使用encode return item def close_spider(self, spider): self.filename.close()
此次我们只是把数据简单保存到本地
做完这几步我们就可以先让代码跑起来试试了,有时会出问题,我们没有设置settings 服务器可能会阻止我们的访问
我们可以进行settings 简单设置
# Obey robots.txt rules
ROBOTSTXT_OBEY = True #是否遵守robot规则,我们一般不遵守
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 #这里是线程数,默认16个一般不改变
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1 #下载延迟,一般会简单设置一下,视情况而定
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers: #请求头,这里我要回经常使用
DEFAULT_REQUEST_HEADERS = {
"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'tencent.middlewares.MyCustomSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'tencent.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#这里启用管道文件
ITEM_PIPELINES = {
'tencent.pipelines.TencentPipeline': 300,
}
做完这些我们就可以让爬虫跑起来了
scrapy crawl tencent# tencent 为我们爬虫文件中name的值
"positionname": "20718-Linux嵌入式开发工程师(深圳)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36141&keywords=&tid=0&lid=0", "peopleNum": "2", "positionType": "技术类", "workLocation": "深圳"},
{"positionname": "25667-腾讯云渠道高级市场运营经理(深圳)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36307&keywords=&tid=0&lid=0", "peopleNum": "2", "positionType": "市场类", "workLocation": "深圳"},
{"positionname": "25667-腾讯云渠道项目经理(深圳)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36331&keywords=&tid=0&lid=0", "peopleNum": "6", "positionType": "产品/项目类", "workLocation": "深圳"},
{"positionname": "WXG08-121 微信模式识别自然语言处理算法工程师(北京)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36340&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "北京"},
{"positionname": "WXG08-121 微信机器学习算法工程师(北京)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36339&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "北京"},
{"positionname": "WXG08-113 微信搜索后台开发高级工程师(北京)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36338&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "北京"},
{"positionname": "21530-腾讯音乐薪酬经理(深圳)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36199&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "职能类", "workLocation": "深圳"},
{"positionname": "22087-体育电竞直播导演(北京)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=33941&keywords=&tid=0&lid=0", "peopleNum": "2", "positionType": "内容编辑类", "workLocation": "北京"},
{"positionname": "MIG16-数据分析产品经理", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36032&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "产品/项目类", "workLocation": "北京"},
{"positionname": "MIG16-腾讯地图产品经理(探索项目)(北京)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=33843&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "产品/项目类", "workLocation": "北京"},
{"positionname": "MIG16-手图驾车业务产品策划经理", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=35379&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "产品/项目类", "workLocation": "北京"},
{"positionname": "MIG16-腾讯地图产品经理(北京)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=33570&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "产品/项目类", "workLocation": "北京"},
{"positionname": "MIG16-腾讯地图搜索产品经理", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=33844&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "产品/项目类", "workLocation": "北京"},
{"positionname": "MIG16-腾讯地图产品运营——内容运营(北京)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=33548&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "产品/项目类", "workLocation": "北京"},
{"positionname": "MIG16-腾讯地图产品运营——活动运营(北京)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=33547&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "产品/项目类", "workLocation": "北京"},
{"positionname": "MIG16-车联网搜索推荐算法工程师", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=35549&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "北京"},
{"positionname": "MIG16-车联网搜索推荐算法工程师(北京)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=35847&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "北京"},
{"positionname": "MIG16-车联网架构师", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=35550&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "北京"},
{"positionname": "MIG16-车联网架构师(北京)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=35845&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "北京"},
{"positionname": "MIG16-地图后台高级开发工程师(北京)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=35316&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "北京"},
{"positionname": "SNG11-云安全产品后台研发工程师(深圳)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36337&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "深圳"},
{"positionname": "23294-电竞产品高级视觉设计师(深圳)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36336&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "设计类", "workLocation": "深圳"},
{"positionname": "TEG07-PHP开发工程师(深圳)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36335&keywords=&tid=0&lid=0", "peopleNum": "3", "positionType": "技术类", "workLocation": "深圳"},
{"positionname": "21062-创新类游戏高级策划(深圳)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36334&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "产品/项目类", "workLocation": "深圳"},
{"positionname": "21062-游戏高级交互设计师(深圳)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36333&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "设计类", "workLocation": "深圳"},
{"positionname": "21062-移动游戏高级后台开发工程师(深圳)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36332&keywords=&tid=0&lid=0", "peopleNum": "2", "positionType": "技术类", "workLocation": "深圳"},
{"positionname": "SNG01-QQ后台高级数据分析工程师(深圳)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36330&keywords=&tid=0&lid=0", "peopleNum": "2", "positionType": "技术类", "workLocation": "深圳"},
{"positionname": "20503-优图项目经理(上海)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36329&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "产品/项目类", "workLocation": "上海"},
{"positionname": "20503-人脸分析研发工程师(上海)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36328&keywords=&tid=0&lid=0", "peopleNum": "2", "positionType": "技术类", "workLocation": "上海"},
{"positionname": "20503-人脸识别研发工程师(上海)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36327&keywords=&tid=0&lid=0", "peopleNum": "2", "positionType": "技术类", "workLocation": "上海"},
{"positionname": "20503-优图解决方案架构师(上海)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36325&keywords=&tid=0&lid=0", "peopleNum": "1", "positionType": "技术类", "workLocation": "上海"},
{"positionname": "20503-视频监控人体算法研发工程师(上海)", "publishTime": "2018-01-26", "positionlink": "position_detail.php?id=36326&keywords=&tid=0&lid=0", "peopleNum": "2", "positionType": "技术类", "workLocation": "上海"},
这是一部分数据,当发现数据比我们查看的少的多的时候我们可以调一下下载延迟的时间
这个案例只是简单的爬取静态的数据,而且连接也是我们自己拼接的,这样远远不能满足我们的需求
下个案例,获取页面的连接,访问链接再获取链接进行访问,会使用到CrawlSpider类和LinkExtract,Rule方法等实现

浙公网安备 33010602011771号