2023数据采集与融合实践第二次作业

第二次作业

码云连接：https://gitee.com/crazypsz/spider/commit/566b31106cde3cd68bd87c63e851b299542e6565

作业一

实验

要求：在中国气象网（http://www.weather.com.cn）给定城市集的 7

日天气预报，并保存在数据库。

思路解析:

预设需要存储的数据在数据库中的形式

序号	地区	日期	天气信息	温度

创建存储数据的的专属类 weatherDB

该类包含以下主要方法:

openDB(): 创建数据库，初始化数据库连接connect和游标cursor
closeDB(): 关闭数据库
insert(): 传入参数为5个数据，并将一行数据插入

show():展示数据库的内容，输出到控制台

class WeatherDB:
    def __init__(self):
        super()

    def openDB(self):
        self.con=sqlite3.connect("G:\database\weathers.db")
        self.cursor=self.con.cursor()
        try:
            self.cursor.execute("create table if not exists weathers (id varchar(16),wCity varchar(16),wDate varchar(16),wWeather varchar(64),wTemp varchar(32),constraint pk_weather primary key (id))")
        except:
            self.cursor.execute("delete from weathers")

    def closeDB(self):
        self.con.commit()
        self.con.close()


    def insert(self, count,city, date, weather, temp):
        try:
            self.cursor.execute("insert into weathers (id,wCity,wDate,wWeather,wTemp) values (?,?,?,?,?)",
                                (count,city, date, weather, temp))
        except Exception as err:
            print(err)

    def show(self):
        self.cursor.execute("select * from weathers")
        rows = self.cursor.fetchall()
        print("%-16s%-16s%-16s%-32s%-16s" % ("id","city", "date", "weather", "temp"))
        # print(“{0:{4}^16}{1:{4}^16}{2:{4}^32}{3:{4}^32}”.format(“city”,“date”,\
        # “weather”,“temp”,chr(12288)))
        for row in rows:
            print("%-16s%-16s%-16s%-32s%-16s" % (row[0], row[1], row[2], row[3],row[4]))

编写一个爬取天气网站的类

初始化: 请求头，城市与城市代码之间的映射
forecastcity():根据页面源码，解析页面数据
process():接受需要爬取的城市参数，启动forecasticity()分别爬取每个城市的天气信息

class WeatherForecast:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
        self.cityCode = {"北京": "101010100", "上海": "101020100", "广州": "101280101", "深圳": "101280601"}
        self.count=0
    def forecastCity(self, city):
        if city not in self.cityCode.keys():
            print(city + " code cannot be found")
            return

        url = "http://www.weather.com.cn/weather/" + self.cityCode[city] + ".shtml"
        try:
            req = urllib.request.Request(url, headers=self.headers)
            data = urllib.request.urlopen(req)
            data = data.read()
            dammit = UnicodeDammit(data, ["utf-8", "gbk"])
            data = dammit.unicode_markup
            soup = BeautifulSoup(data, "lxml")
            lis = soup.select("ul[class='t clearfix'] li")
            for li in lis:
                try:
                    date = li.select('h1')[0].text
                    weather = li.select('p[class="wea"]')[0].text
                    temp = li.select("p[class='tem']")[0].text.strip()
                    # print(city,date,weather,temp)
                    self.db.insert(self.count,city, date, weather, temp)
                    self.count=self.count+1
                except Exception as err:
                    print(err)
        except Exception as err:
            print(err)

    def process(self, cities):
        self.db = WeatherDB()  # 创建天气数据库对象，db
        self.db.openDB()  # 打开数据库

        for city in cities:
            self.forecastCity(city)  # 循环遍历，逐一爬取和存储天气预报数据
            # self.db.show()#打印数据库中数据
        self.db.show()
        self.db.closeDB()  # 关闭数据库

主程序

ws = WeatherForecast()  # 创建天气预报类对象ws
ws.process(["北京", "上海", "广州", "深圳"])  # 对指定的城市进行天气预报数据的爬取和存储
print("completed")

运行结果

id city date weather temp
0 北京 9日（今天）晴 10℃
1 北京 10日（明天）晴 22℃/10℃
2 北京 11日（后天）晴转多云 23℃/10℃
3 北京 12日（周四）多云转小雨 21℃/12℃
4 北京 13日（周五）晴 22℃/10℃

..................................................................................

心得

这个实验的页面的解析的难度不是很大，这次实验比较不一样的是要将数据存入数据库中，刚好可以熟悉一下数据库的操作,然后其实这个代码知识一个demo而已，如果没有做好城市和城市代码的映射，那么这个城市的的数据就爬不到了，如果感兴趣，可以做一个全国城市的映射，然后根据这个爬虫开发一个小程序

作业2

实验

要求：用 requests 和 BeautifulSoup 库方法定向爬取股票相关信息，并

存储在数据库中。

候选网站：东方财富网：https://www.eastmoney.com/

新浪股票：http://finance.sina.com.cn/stock/

思路解析：

打开东方财富网的网站，按F12,点击网络，刷新页面，观察服务器返回的各项数据，可以观察到需要的数据是在一个js的文件中

根据上面文件的url, 解析这个js文件将其有效数据转化成Pythond的数据类型，可以发现每一条数据都是一个字典，但是其key值需要自己根据value, 做好其名称的映射,单页数据爬取的函数如下

def get_one_pagesource(page_num=1):
    base_url =r'http://36.push2.eastmoney.com/api/qt/clist/get?cb=jQuery1124017913335698798893_1696658430311&pn={}&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1696658430312'
    url = base_url.format(page_num)
    headers= {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36'
    }
    response = requests.get(url,headers=headers)
    content = response.content
    # print(content)
    temp_file= content
    temp_file= str(temp_file.decode('utf-8'))
    temp_file= temp_file.split('(',maxsplit=1)[1]
    temp_file = temp_file[:-2]
    # print(a)
    json_file = json.loads(temp_file)
    # print(json_file)
    data_list= json_file['data']['diff']
    for data in data_list:
        dic['名称'].append(data['f14'])
        dic['最新价格'].append(data['f2'])
        dic['涨跌幅'].append(str(data['f3'])+"%")
        dic['涨跌额'].append(data['f4'])
        dic['成交量'].append(data['f5']/1000.0)
        dic['成交额'].append(data['f6']/100000000.0)
        dic['振幅'].append(str(data['f7'])+"%")
        dic['最高'].append(data['f15'])
        dic['最低'].append(data['f16'])
        dic['今开'].append(data['f17'])
        dic['昨收'].append(data['f18'])
        dic['量比'].append(data['f10'])
        dic['代码'].append(data['f12'])

多页数据爬取，上面的页面的url只有pn这个参数是在变化，因此只需要修改pn值，就可以访问其他网页, 由于属性太多了，就直接保存在本地的xlsx文件当中

if __name__ =="__main__":
    num =3
    df = pandas.DataFrame()
    dic = {"代码":[], "名称":[],"最新价格":[], "涨跌额":[], "涨跌幅":[],"成交量":[], "成交额":[], "振幅":[], "最高":[], "最低":[],
               "今开":[], "昨收":[], "量比":[]}

    for page_num in range(1,num+1):
        get_one_pagesource(page_num)

    for name in dic.keys():
        df[name]=dic[name]
    df.to_excel('./沪深京A股.xlsx')
    print('爬取结束')

运行结果，生成的本地文件

心得体会

这题难点在于网络抓包吧，不仔细观察的话，就看不到数据的的接口，真正去做的话，应该大部分时间是在找借口，和做反爬，越到后面，页面解析的部分就是最简单的了

作业3

实验

要求：爬取中国大学 2021 主榜

（https://www.shanghairanking.cn/rankings/bcur/2021）所有院校信息，并存储在数据库中，同时将浏览器 F12 调试分析的过程录制 Gif 加入至博客中。

技巧：分析该网站的发包情况，分析获取数据的 api

思路解析

进入网站打开f12 ，点击网络，刷新，抓取数据接口，可知数据在payload.js中

观察文件中单项数据的源码

可以直接使用正则去匹配每项数据

import requests
import re
import pandas
url=r'https://www.shanghairanking.cn/_nuxt/static/1695811954/rankings/bcur/2021/payload.js'
# url = r'https://www.shanghairanking.cn/_nuxt/static/1695811954/payload.js'
headers={
    'User-Agent':
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/'
}
response = requests.get(url,headers=headers)
response.encoding='utf-8'
content = response.text
u_name = re.findall(r'univNameCn:"(.*?)"',content)
u_score = re.findall(r'score:(.*?),', content)
u_province = re.findall(r'province:(.*?),', content)
df = pandas.DataFrame()
df['学校']= u_name
df['分数']=u_score
df['省份']=u_province
print(df)

运行结果

另外一个思路，将获取的js文件之后获取使用 js2py 执行之后可以获取python 的数据类型，就不用使用正则去解析了

import requests
import js2py
from io import StringIO
from contextlib import redirect_stdout
url = r'https://www.shanghairanking.cn/_nuxt/static/1695811954/rankings/bcur/2021/payload.js'
r = requests.get(url, timeout=20)
if r.status_code == 200:
    r.encoding = 'utf-8'
    html = r.text
js_function =html[len('__NUXT_JSONP__("/rankings/bcur/2021", ('):-3]
# js2py.eval_js(js_function) #这个方法会报错、
js_code= f"console.log({js_function})"
js= js2py.EvalJs()
pyf = js2py.translate_js(js_code) #js 转换成 python
f = StringIO() #io 流
with redirect_stdout(f):
    exec(pyf) #底层实现
s = f.getvalue()
print(s)

心得

刚开做的时候很纠结用哪一种解析方法，一开始觉得直接使用 beautifulsoup的find是最快的，当我使用beautifulsoup解析出来之后，用尝试编写正则去解析，发现正则的方式更简单，其实原因还是在于自己对于正则表达式还不够熟悉吧，所以不能一下子就反应过来去使用正则的方式，然后第二种思路，其实一开始没想过要把里面的数据转换出来，经过老师的提醒而去尝试的

posted @ 2023-10-09 21:28 crazypsz 阅读(54) 评论(0) 收藏举报

刷新页面返回顶部

psz-github

fzu-psz 102102153 只分享技术

2023数据采集与融合实践第二次作业

第二次作业

作业一

实验

心得

作业2

实验

心得体会

作业3

实验

心得

公告