python抓取链家房源信息(三)
之前写过一个链家网北京二手房的数据抓取,然后本来今天想着要把所有的东西弄完,但是临时有事出去了一趟,耽搁了一下,然后现在是想着把北京的二手房的信息都进行抓取,并且存储在mongodb中,
首先是通过'https://bj.lianjia.com'的url将按照区域划分和地铁路线图进行划分的所有的url抓取出来进行存储,然后在进行下一步的分析,然后会每一套房源信息都会有一个data-housecode,标识是那一套房间,为了避免有重复的房源信息,在每套房的数据中将data-housecode,数据作为每一套房的标识,进行存储,然后在抓取到房源信息存储到mongodb中时,通过data-housecode进行判断,看当前房源是否已经存储完全,如果已经存储了,则不必插入,否则将该房源信息插入到mongodb中。
用的还是scrapy框架,然后只是在spider.py中添加了按照区和地铁路线图的所有的房源信息,当然根据区域和地铁还可以分的更细。。。
大致的爬虫的框架是:
在scrapy框架中,使用过程是,在spider.py中,将要获取的url请求给scheduler,然后通过download模块进行Request下载数据,如果下载失败,会将结果告诉scrapy engine,然后scrapy engine会稍后进行重新请求,然后download将下载的数据给spider,spider进行数据处理,抓取需要保存的按照地铁路线或者是区域的url,然后跟进url,将个个不同的url进行告诉scrapy engine,然后又通过相同的远离然后进行抓取,然后存储每个房源的标识和条件情况,然后将处理结果返回给item,通过item进行mongodb的存储。
scrapy.py中的代码如下:
#-*-coding:utf-8-*- import scrapy import re from bs4 import BeautifulSoup import time import json from scrapy.http import Request from House.items import HouseItem import lxml.html from lxml import etree class spider(scrapy.Spider): name = 'House' url = 'https://bj.lianjia.com' base_url = 'https://bj.lianjia.com/ershoufang' def start_requests(self): print(self.base_url) yield Request(self.base_url,self.get_area_url,dont_filter=True) def get_area_url(self,response): selector = etree.HTML(response.text) results = selector.xpath('//dd/div/div/a/@href') for each in results: if 'lianjia' not in each: url = self.url + each else: url = each print(url) yield Request(url, self.get_total_page, dont_filter=True) def get_total_page(self,response): soup = BeautifulSoup(response.text, 'lxml') Total_page = soup.find_all('div', class_='page-box house-lst-page-box') res = r'<div .*? page-data=\'{\"totalPage\":(.*?),"curPage":.*?}\' page-url=".*?' total_num = re.findall(res, str(Total_page), re.S | re.M) for i in range(1, int(total_num[0])): print(i) url = response.url + 'pg' + str(i) print(url) yield Request(url, self.parse, dont_filter=True) def parse(self, response): soup = BeautifulSoup(response.text,'lxml') message1 = soup.find_all('div',class_ = 'houseInfo') message2 = soup.find_all('div',class_ = 'followInfo') message3 = soup.find_all('div',class_ = 'positionInfo') message4 = soup.find_all('div',class_ = 'title') message5 = soup.find_all('div',class_ = 'totalPrice') message6 = soup.find_all('div',class_ = 'unitPrice') message7 = soup.find_all(name='a', attrs={'class': 'img'}) Flags = [] for each in message7: Flags.append(each.get('data-housecode')) num = 0 for flag,each,each1,each2,each3,each4,each5 in zip(Flags,message1,message2,message3,message4,message5,message6): List = each.get_text().split('|') item = HouseItem() item['flag'] = flag item['address'] = List[0].strip() item['house_type'] = List[1].strip() item['area'] = List[2].strip() item['toward'] = List[3].strip() item['decorate'] = List[4].strip() if len(List) == 5: item['elevate'] = 'None' else: item['elevate'] = List[5].strip() List = each1.get_text().split('/') item['interest'] = List[0].strip() item['watch'] = List[1].strip() item['publish'] = List[2].strip() List = each2.get_text().split('-') item['build'] = List[0].strip() item['local'] = List[1].strip() item['advantage'] = each3.get_text().strip() item['price'] = each4.get_text().strip() item['unit'] = each5.get_text().strip() print("%s %s %s %s %s %s %s %s %s %s %s %s %s %s %s "%(item['flag'],item['address'],item['house_type'],item['area'],item['toward'], item['decorate'],item['elevate'],item['interest'], item['watch'],item['publish'],item['build'],item['local'],item['advantage'],item['price'],item['unit'])) num += 1 yield item