数据采集与融合技术_实验二

码云地址：https://gitee.com/a2625113421/data-acquisition-practice-ii

作业①

1）要求：在中国气象网（http://www.weather.com.cn）给定城市集的7日天气预报，并保存在数据库。

输出信息:

序号	地区	日期	天气信息	温度
1	北京	七日（今天）	晴间多云，北部山区有阵雨或雷阵雨转晴转多云	31℃/17℃

解题过程(复现书上代码）：

1.数据库类
首先设计一个天气数据库类，里面包括打开数据库、关闭数据库、插入数据库和查看数据库内容的方法。

class WeatherDB:
    def openDB(self):
        self.con=sqlite3.connect("weathers.db")
        self.cursor=self.con.cursor()
        try:
            self.cursor.execute("create table weathers (wCity varchar(16),wDate varchar(16),wWeather varchar(64),"
                                "wTemp varchar(32),constraint pk_weather primary key (wCity,wDate))")
        except:
            self.cursor.execute("delete from weathers")

    def closeDB(self):
        self.con.commit()
        self.con.close()

    def insert(self, city, date, weather, temp):
        try:
            self.cursor.execute("insert into weathers (wCity,wDate,wWeather,wTemp) values (?,?,?,?)",
                                (city, date, weather, temp))
        except Exception as err:
            print(err)

    def show(self):
        self.cursor.execute("select * from weathers")
        rows = self.cursor.fetchall()
        print("{0:^10}{1:{5}^10}{2:{5}^10}{3:{5}^20}{4:{5}^10}".format("序号", "地区", "日期", "天气信息", "温度", chr(12288)))
        i = 1
        for row in rows:
            print("{0:^10}{1:{5}^10}{2:{5}^10}{3:{5}^20}{4:{5}^10}".format(i, row[0], row[1], row[2], row[3], chr(12288)))
            i += 1

2.天气预报爬取类

class WeatherForecast:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
        self.cityCode = {"北京": "101010100", "上海": "101020100", "广州": "101280101", "深圳": "101280601","福州": "101230101"}

    def forecastCity(self, city):
        if city not in self.cityCode.keys():
            print(city + " code cannot be found")
            return
        url = "http://www.weather.com.cn/weather/" + self.cityCode[city] + ".shtml"
        try:
            req = urllib.request.Request(url, headers=self.headers)
            data = urllib.request.urlopen(req)
            data = data.read()
            dammit = UnicodeDammit(data, ["utf-8", "gbk"])
            data = dammit.unicode_markup
            soup = BeautifulSoup(data, "lxml")
            lis = soup.select("ul[class='t clearfix'] li")
            for li in lis:
                try:
                    date = li.select('h1')[0].text
                    weather = li.select('p[class="wea"]')[0].text
                    temp = li.select("p[class='tem']")[0].text.strip()
                    # print(city, date, weather, temp)
                    self.db.insert(city, date, weather, temp)
                except Exception as err:
                    print(err)
        except Exception as err:
            print(err)

    def process(self, cities):
        self.db = WeatherDB()
        self.db.openDB()
        for city in cities:
            self.forecastCity(city)

        self.db.show()
        self.db.closeDB()

3.结果

2）心得体会

在这个作业中，我爬取多个城市的天气数据，并且第一次使用了sqlite数据库，了解了sqlite是一款是一种嵌入式数据库，体积小巧，非常适合新手入门学习。

作业②

1）爬取股票信息

– 要求：用requests和BeautifulSoup库方法定向爬取股票相关信息，并保存在数据库。
– 候选网站：东方财富网：http://quote.eastmoney.com/center/gridlist.html#hs_a_board
– 技巧：在谷歌浏览器中进入F12调试模式进行抓包，查找股票列表加载使用的url，并分析api返回的值，并根据所要求的参数可适当更改api的请求参数。
– 根据URL可观察请求的参数f1、f2可获取不同的数值，根据情况可删减请求的参数。
– 参考链接：https://zhuanlan.zhihu.com/p/50099084
– 输出信息

序号	股票代码	股票名称	最新报价	涨跌幅	涨跌额	成交量	成交额	振幅	最高	最低	今开	昨收
1	688093	N世华	28.47	62.22%	10.92	26.13万	7.6亿	22.34	32.0	28.08	30.2	17.55
2	......

实现步骤：
1.向页面发送请求，获取源代码：

def get_html_request(page):
    try:
        url = "http://73.push2.eastmoney.com/api/qt/clist/get?cb=jQuery112407189399274148378_1635171689880&pn=1&pz=20&po=1&np="\
              +str(page)+"&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f3&fs=m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1635171689881"
        headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
        r = requests.get(url, timeout = 30, headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
    except:
        return "产生异常"
    return r.text

2.用josn.load处理数据

def get_stock(data):
    data = re.search(r'\[.*]', data).group(0)
    temp = re.findall(r'{.*?}', data)
    temp = [json.loads(x) for x in temp]
    a = {"股票代码": 'f12', "股票名称": 'f14', "最新报价": 'f2', "涨跌幅": 'f3', "涨跌额": 'f4', "成交量": 'f5', "成交额": 'f6',
           "振幅": 'f7', "最高": 'f15', "最低": 'f16', "今开": 'f17', "昨收": 'f18'}
    stocks = []
    for t in temp:
        l = []
        for i in a:
            l.append(t[a[i]])
        stocks.append(l)
    return stocks

3.数据库类类似问题一
4.结果

2）心得体会

通过此题的程序编写，让我对sqlite的使用更加的熟练同时也初次尝试了爬取json格式的信息。

作业③

1）爬取股票信息

– 要求：爬取中国大学2021主榜 https://www.shanghairanking.cn/rankings/bcur/2021
– 所有院校信息，并存储在数据库中，同时将浏览器F12调试分析的过程录制Gif加入至博客中。
– 技巧：分析该网站的发包情况，分析获取数据的api
– 输出信息：

排名	学校	总分
1	清华大学	969.2

解题过程
1.分析该网站的发包情况，分析获取数据的api，gif:

2.向页面发送请求，获取源代码：

def get_html_request(url):
    try:
        headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
        r = requests.get(url, timeout = 30, headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
    except:
        return "产生异常"
    return r.text

3.利用正则表达式匹配数据

def get_university(data):
    names = re.findall(r'univNameCn:"(.*?)"', data)
    name = []
    for i in names:
        name.append(i)
    scores = re.findall(r'score:(.*?),', data)
    score = []
    for i in scores:
        score.append(i)
    return name,score

4.结果

2）心得体会：

通过这次实验，进一步加强了分析获取数据的api的能力，同时对正则表达式和sqlite3库的理解与使用也有提高。

posted @ 2021-10-25 23:32 oxoxoox 阅读(137) 评论(0) 收藏举报

刷新页面返回顶部

oxoxoox

数据采集与融合技术_实验二

码云地址：https://gitee.com/a2625113421/data-acquisition-practice-ii

作业①

1）要求：在中国气象网（http://www.weather.com.cn）给定城市集的7日天气预报，并保存在数据库。

解题过程(复现书上代码）：

2）心得体会

作业②

1）爬取股票信息

2）心得体会

作业③

1）爬取股票信息

2）心得体会：

公告