Python3抓取深圳房地产均价数据，通过真实数据为购置不动产做决策分析（一）

经过之前的小练习，今天准备做一个相对较为复杂的小项目，最近看到一条新闻说深圳的房价断崖式下跌，平均每月均价下跌46块钱。。。所以准备尝试着抓取互联网上真实的卖房数据，通过大数据的分析，来帮想在深圳买房的小伙伴们，做一个辅助决策分析。

首先我们百度一下，top 3的卖房网站(对百度的竞价排名持怀疑态度$_$)

经过筛选，我准备从链家， Q房网，房天下，三个网站抓取房地产售价数据

首先抓取链家的代码如下：

from bs4 import BeautifulSoup
import requests
import csv
from requests.exceptions import RequestException


def get_one_page(page):
    url = "https://sz.lianjia.com/"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
        'Host': 'sz.lianjia.com',
        'Referer': 'https://www.lianjia.com/',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9'
    }
    newUrl = url + 'ershoufang/' + 'pg' + str(page)

    try:
        response = requests.get(newUrl, headers=headers)
    except RequestException as e:
        print("error: " + response.status_code)

    soup = BeautifulSoup(response.text, 'html.parser')

    #  需要抓取： 小区名称， 面积大小， 均价， 以及详细信息的链接

    for item in soup.select('li .clear'):
        detailed_info = item.select('div .houseInfo')[0].text
        community_name = detailed_info.split('|')[0].strip()
        area = detailed_info.split('|')[2].strip()
        average_price = item.select('div .unitPrice span')[0].text
        detailed_url = item.select('a')[0].get('href')
        print("%s\t%s\t%s\t%s"%(community_name, area, average_price, detailed_url))


def main():
    get_one_page(2)


if __name__ == '__main__':
    main()

测试结果如下：

其次抓取Q房网基本代码如下：

from bs4 import BeautifulSoup
import requests
import csv
import re
from requests.exceptions import RequestException


def get_one_page(page):
    url = "https://shenzhen.qfang.com/sale/"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36',
        'Host': 'shenzhen.qfang.com',
        'Referer': 'https://www.qfang.com/',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9'
    }
    newUrl = url + 'f' + str(page)

    try:
        response = requests.get(newUrl, headers=headers)
    except RequestException as e:
            print("error: " + response.status_code)

    soup = BeautifulSoup(response.text, 'html.parser')
    #  需要抓取： 小区名称， 面积大小， 均价， 以及详细信息的链接
    price_list = []
    for item in soup.select('div .show-price'):
        average_price = item.select('p')[0].text
        price_list.append(average_price)

    index = 0;
    for item in soup.select('div .show-detail'):
        detailed_url = 'https://shenzhen.qfang.com/sale' + item.select('a')[0].get('href')
        # 在爬取面积的过程中，发现有数据缺失，原因为，有的存在第4个span tag中，有的存在第5个span tag中，所以先都取出来，然后用正则筛选
        regax = re.compile('(.*?)平米')
        result = item.select('span')[3].text + item.select('span')[4].text
        area = re.findall(regax, result)[0]
        community_name = (item.find_all(target = '_blank')[0].text).split(' ')[0]
        average_price = price_list[index];
        index += 1

        print("%s\t%s\t%s\t%s" % (community_name, area, average_price, detailed_url))



def main():
    get_one_page(1)


if __name__ == '__main__':
    main()

测试结果如下：

最后房天下的抓取基本代码如下from bs4 import BeautifulSoupimport requestsimport csv

from requests.exceptions import RequestException
import re


def get_one_page(page):
    url = "http://esf.sz.fang.com/house/"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
        'Host': 'esf.sz.fang.com',
        'Referer': 'https://www.fang.com/',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9'
    }
    newUrl = url + 'i3' + str(page)

    try:
        response = requests.get(newUrl, headers=headers)
    except RequestException as e:
        print("error: " + response.status_code)

    # 用正则抓取：
    # regax = re.compile('<p class="title"><a href=(.*?) target=.*?><p class="mt10">'
    #                    '<a target=.*? title=(.*?)>.*?<div class="area alignR">'
    #                    '<p>(.*?)</p>.*?<p class="danjia alignR mt5">(.*?)<span>.*?', re.S)
    # result = re.findall(regax,response.text)

    # 美丽汤
    soup = BeautifulSoup(response.text, 'html.parser')
    #  需要抓取： 小区名称， 面积大小， 均价， 以及详细信息的链接
    for item in soup.find_all('dd',{'class':'info rel floatr'}):
　　　　　if item.select('h3'):
　　　　　　　continue
        community_name = item.select('p')[2].select('a')[0].get('title')
        area = re.findall(r'\d+',item.find_all('div',{'class':'area alignR'})[0].select('p')[0].text)[0] # 正则删掉多余乱码字符
        average_price = re.findall(r'\d+',item.find_all('p',{'class':'danjia alignR mt5'})[0].text)[0] # 正则删掉多余乱码字符
        detailed_url = 'http://esf.sz.fang.com' + item.find_all('p',{'class':'title'})[0].select('a')[0].get('href')
        print("%s\t%s\t%s\t%s" % (community_name, area, average_price, detailed_url))

def main():
    get_one_page(2)


if __name__ == '__main__':
    main()

测试结果如下：

在数据的抓取过程中，会发现，对于不同的房产网站，抓取的基本架构是相同的，首先模拟浏览器登录，然后人工解析DOM，通过re或者bs4来抓取需要的关键信息，其中点难点包括对抓取数据的清洗(运用到正则表达式)，以及在抓取Q房网数据的时候，发现DOM的规律会在某些时候变化，因为要灵活运用bs4和re的结合，来抓取到高质量的数据。

在数据抓取完成后，对于数据存储到Mysql，Excel和MongoDB纠结中，百度和google后，准备把数据存入MongoDB这个Not only sql的数据库，原因如下：

If your DB is 3NF and you don't do any joints ( you're just selecting a bunch of tables and putting all the objects together, AKA what most people do in a web app), MangoDB would probably kick ass for u

意思是，在没有复杂表关系的情况下，MongoDB会对于数据的存储非常棒，在MongoDB的官方使用手册，我们看到，MongoDB拥有很多优点，包括高性能，高扩展，高速查询，以及不需要严格的结构来存储组织数据。

当然MongoDB也有很多缺点，但是本文抓取的数据都是有结构的静态数据，可以直接存入数据库，不需要设计其他的表关系等。。

先安装一下MongoDB, :-) taking a deserved break!

之后我们导入了MongoDB可视化插件，然后带入pymongo这个包

在操作mongo的时候，目前要做的是先简单的把抓到的数据插入，代码如下：

tip: result 是以字典的形式储存的

def store_in_db(result):
    conn = MongoClient('localhost', 27017)
    db = conn.test  # db name is test
    my_set = db.text_set  # set type
    my_set.insert(result)

然后，由于要大量的从网站抓取数据，我们需要各种anti-anti-scrapy 的手段，首先，模拟浏览器登录，其次，由于大部分网站的反爬都会设置一个max value 对于同一个ip在一段时间的大量访问，所以需要一个ip池(上一个小练习里提到了)，我抓了一些当前可用的免费ip，存入txt中：

然后在每次对页面请求的时候，需要从这个ip池中随机一个ip，设置到requests的代理中：

proxy_ip = random.choice(ip_list)
ip = {'https:':proxy_ip} # proxy ip needed for requests
try:
    response = requests.get(newUrl, headers=headers, proxies=ip)
except RequestException as e:
    print("error: " + response.status_code)

然后需要设置一下合理的sleep时间，防止反爬：

def getdata(page, ip_list):
    for i in range(page):
        result = get_one_page(i, ip_list)
        store_in_db(result)
        print()
        if i >= 10 and i % 10 == 0:
            time.sleep(100)

OK，先执行链家网的整体代码，拿到100页的数据，并储存在mongoDB中：

因为链家网的静态网页上只显示了100页的数据，也就是大概总共有3000条数据，接下来再用同样的方法，收集Q房网和房天下的页面数据。。

总结：

跑了大概半个小时的时间，收集了10000多条数据，没有出现被反爬的情况.

缺陷：在网络不稳定的时候，可能会有某个链接无响应，应该用一个集合去存储未响应的请求，然后重新发送给requests处理

完整代码：https://github.com/wy9884255/src

下一篇会用这些数据来做数据分析，练习一下用python处理数据

posted @ 2018-05-11 10:57 InsistPy 阅读(1761) 评论(0) 收藏举报

刷新页面返回顶部

InsistPy

Python3抓取 深圳房地产均价数据，通过真实数据为购置不动产做决策分析（一）

公告

Python3抓取深圳房地产均价数据，通过真实数据为购置不动产做决策分析（一）