爬取思路
- 清楚爬取需求,我们要爬取电影名字,导演名字,电影上映时间,电影类型,制片国家,评分人数和电影简介
- 首先从网页上分析,网页数据不是由ajax抓取,因此我们直接用这个网站url发送请求获取数据,从网页源码上看我们可以从第一层网页中获取到影名字,导演名字,电影上映时间,电影类型,制片国家,评分人数,但是无法直接获取到电影简介,因此,我们需要在第一层中获取到详情页的url,考虑对详情页发送第二层数据请求
以上是对单页豆瓣电影的数据源码进行分析,但是数据不是ajax加载的,由翻页的要求,我们需要观察每个页面的网站规律,从上面不同页码的网站我们可以看出不同的部分只有start=\d+,我们可以给构造一个通用的url便于翻页爬取
https://movie.douban.com/top250?start=%d&filter=
当我们想要抓多少页的时候,输入想要输入的页数,格式化%d
- 我们也可以通过观察网页可以发现点击"后页"时就会跳转到下一页,而"后页"对应的源码中保存着下一页的链接信息,我们可对网站链接进行补充来实现翻页
-
- 上面是在爬取前对网页源码进行简单的分析,了解需求和理清爬取的顺序之后,我们可以开始进行爬取
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request from doubanpro.items import DoubanproItem class MovieSpider(scrapy.Spider): name = 'movie' # allowed_domains = ['www.xxx.com'] start_urls = ['https://movie.douban.com/top250?start=0&filter='] def detail_parse(self,response): item = response.meta["item"]# 接收 desc =response.xpath('//*[@id="link-report"]/span[1]/span/text()[1]').extract_first() item["desc"] = desc yield item def parse(self, response): li_list = response.xpath('//*[@id="content"]/div/div[1]/ol/li') for li in li_list: detail_url = li.xpath('./div/div[1]/a/@href').extract_first() title= li.xpath('./div/div[2]/div[1]/a/span[1]/text()').extract_first() info = li.xpath('./div/div[2]/div[2]/p[1]/text()').extract_first() ranking = li.xpath('./div/div[1]/em/text()').extract_first() score = li.xpath('./div/div[2]/div[2]/div/span[2]/text()').extract_first() score_num = li.xpath('./div/div[2]/div[2]/div/span[4]/text()').extract_first() if title: item = DoubanproItem() item["title"] = title item["ranking"]=ranking item["score"]=score item["score_num"]=score_num item["info"]=info # item["detail_url"] = detail_url yield scrapy.Request(url=detail_url, callback=self.detail_parse, meta={"item": item}) # print(response.meta) # next_url = response.xpath('//*[@id="content"]/div/div[1]/div[2]/span[3]/a/@href').extract() if next_url: next_url = 'https://movie.douban.com/top250' + next_url[0] yield scrapy.Request(next_url, callback=self.parse, dont_filter=True)
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://docs.scrapy.org/en/latest/topics/items.html import scrapy class DoubanproItem(scrapy.Item): # define the fields for your item here like: title=scrapy.Field() ranking=scrapy.Field() score=scrapy.Field() score_num = scrapy.Field() info = scrapy.Field() desc=scrapy.Field()
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html class DoubanproPipeline(object): f = None def open_spider(self,spider): self.f = open("./movie.txt","w",encoding="utf-8") print("开始爬虫") def process_item(self, item, spider): self.f.write(item["title"]+"\n"+item["ranking"]+"\t"+item["score"]+"\t"+item["score_num"]+"\n"+item["info"] ) return item def close_spider(self,spider): self.f.close() print("结束爬虫")
浙公网安备 33010602011771号