数据融合与采集技术第二次实验

作业①：
·要求：在中国气象网（http://www.weather.com.cn）给定城市集的7日天气预报，并保存在数据库。
·输出信息：

序号	地区	日期	天气信息	温度
1	北京	7（今天）	晴间多云，北部山区有阵雨或雷阵雨转晴转多云	31℃/17℃
2	北京	8（明天）	多云转晴，北部地区有分阵雨或雷阵雨转晴	34℃/20℃
3	北京	9（后天）	晴转多云	36℃/22℃
4 ...

·实现过程
1获取网页信息

def getpage_text(url):
try:
headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 10.0 x104; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
req=urllib.request.Request(url,headers=headers)
resp=urllib.request.urlopen(req)
data =resp.read()
unicodeData =data.decode()
return unicodeData
except exceptions
print("err")

2构造一个weatherforecast类获取网页的天气信息
`
class WeatherForecast:
def init(self):
self.headers = {

        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362"}
    # "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
    self.cityCode = {"北京": "101010100", "上海": "101020100", "广州": "101280101", "深圳": "101280601"}

def forecastCity(self, city):
    if city not in self.cityCode.keys():
        print(city + " code cannot be found")
        return
    url = "http://www.weather.com.cn/weather/" + self.cityCode[city] + ".shtml"
    try:
        req = urllib.request.Request(url, headers=self.headers)
        data = urllib.request.urlopen(req)
        data = data.read()
        dammit = UnicodeDammit(data, ["utf-8", "gbk"])
        data = dammit.unicode_markup
        soup = BeautifulSoup(data, "lxml")
        lis = soup.select("ul[class='t clearfix'] li")
        for li in lis:
            try:
                date = li.select('h1')[0].text
                weather = li.select('p[class="wea"]')[0].text
                temp = li.select('p[class="tem"] span')[0].text + "/" + li.select('p[class="tem"] i')[0].text
                print(city, date, weather, temp)
                self.db.insert(city, date, weather, temp)
            except Exception as err:
                print(err)
    except Exception as err:
        print(err)`

3·输出

ws = WeatherForecast() ws.process(["北京", "上海", "广州", "深圳"]) print("completed")
4·在数据库中查看结果

5·码云链接
第一题码云链接

心得体会
初步了解到了如何将爬取到的数据存入数据库中，为接下来的学习奠定了基础

作业②：
·要求：用requests和自选提取信息方法定向爬取股票相关信息，并存储在数据库中
·候选网站：
东方财富网：https://www.eastmoney.com/
新浪股票：http://finance.sina.com.cn/stock/
·输出信息

序号	股票代码	股票名称	最新报价	涨跌幅	涨跌额	成交量	成交额	振幅
1	688093	N世华	28.47	62.22%	10.92	26.13万	7.6亿	22.3%
2...

·实现过程
技巧：在谷歌浏览器中进入F12调试模式进行抓包，查找股票列表加载使用的url，并分析api返回的值，并根据所要求的参数可适当更改api的请求参数。根据URL可观察请求的参数f1、f2可获取不同的数值，根据情况可删减请求的参数。

获取网页数据

def getHtml(page): url = "https://9.push2.eastmoney.com/api/qt/clist/get?cb=jQuery1124025500369212952667_1634094365855&pn="+page+"&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1634094365856" r = urllib.request.Request(url, headers=headers) html=urllib.request.urlopen(r) html=html.read() dammit=UnicodeDammit(html,["utf-8","gbk"]) html=dammit.unicode_markup return html
对股票信息进行解析
`def anahtml(html):
page_datas = []
exID = re.compile('"f12":"(.*?)",')#股票序号
num = re.findall(exID, html)

exName = re.compile('"f14":"(.*?)",')#股票名称
name = re.findall(exName, html)

exPrice = re.compile('"f2":(.*?),')#成交价格
price = re.findall(exPrice, html)

exRate = re.compile('"f3":(.*?),')#涨跌幅度
changeRate = re.findall(exRate, html)

exChange = re.compile('"f4":(.*?),')# 涨跌额
change = re.findall(exChange, html)

exPrice = re.compile('"f6":(.*?),')# 成交额
CPRICE = re.findall(exPrice, html)

exMax = re.compile('"f15":(.*?),')# 最高价格
MAX = re.findall(exMax, html)

exMin = re.compile('"f16":(.*?),')# 最低格
MIN = re.findall(exMin, html)`

创建数据库并保存数据

`def data_save(datalist, dbpath):
init_db(dbpath)
conn = sqlite3.connect(dbpath)
cur = conn.cursor()
for data in datalist:
for index in range(len(data)):
data[index] = '"' + data[index] + '"'
sql = '''
insert into shares(
num,name,price,changeRate,change,currentPrice,max,min
)
values(%s)
''' % ",".join(data)
cur.execute(sql)
conn.commit()
cur.close()
conn.close()
print("保存成功！")

def init_db(dbpath):
sql = '''
create table shares(
num text,name text,price text,
changeRate text, change text,currentPrice text,
max text,min text
);
'''
cet = sqlite3.connect(dbpath)
cu = cet.cursor()
cu.execute(sql)
cet.commit()
cet.close()`

结果展示

·码云链接
第二题链接

心得体会
这次作业让我更加了解了sqlite3的使用，利用f12进行抓包操作，对数据的爬取有了更为深刻的理解

作业③

1）实验要求

爬取中国大学2021主榜（https://www.shanghairanking.cn/rankings/bcur/2021）所有院校信息，并存储在数据库中，同时将浏览器F12调试分析的过程录制Gif加入至博客中。
技巧：分析该网站的发包情况，分析获取数据的api
·输出信息

排名	学校	总分
1	清华大学	969.2

实现过程
网页数据获取
def getHtml(): url = "https://www.shanghairanking.cn/rankings/bcur/2021" response = urllib.request.Request(url=url, headers=headers) page_text=urllib.request.urlopen(response) page_text=page_text.read() dammit=UnicodeDammit(page_text,["utf-8","gbk"]) page_text=dammit.unicode_markup return page_text

解析数据
def GETPAGE_TEXT(html): soup = BeautifulSoup(html, 'html.parser') lis = soup.find_all("tr") DATA_LIST = [] for j in range(1,len(lis)): tr=lis[j] td=tr.find_all("td") score = td[4].text.strip() name = td[1].find('a').text.strip() rank=td[0].find('div').text.strip() DATA_LIST.append([rank,name,score]) return DATA_LIST

保存数据

for data in Dlist: for index in range(len(data)): data[index] = '"' + data[index] + '"' sql = ''' insert into rank( id,name,score ) values(%s) ''' % ",".join(data)

结果展示

第三题码云链接

·心得体会
学会了调试网页，利用抓包工具以及json抓取数据，在尝试阶段遇到了很多的报错与阻挠，通过自己的不断探索最终得以克服

posted @ 2021-10-26 20:11 鸿影虹影阅读(13) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

鸿影虹影

数据融合与采集技术第二次实验

公告