102302110高悦作业2

•作业①:要求:在中国气象网(http://www.weather.com.cn)给定城市集的7日天气预报,并保存在数据库。
1.代码与过程
查看相应位置的代码
image

from bs4 import BeautifulSoup
from bs4 import UnicodeDammit
import urllib.request
import sqlite3

class WeatherDB:
    def openDB(self):
        self.con=sqlite3.connect("weathers.db")
        self.cursor=self.con.cursor()
        try:
            self.cursor.execute("create table weathers (wCity varchar(16),wDate varchar(16),wWeather varchar(64),wTemp varchar(32),constraint pk_weather primary key (wCity,wDate))")
        except:
            self.cursor.execute("delete from weathers")
    def closeDB(self):
        self.con.commit()
        self.con.close()

    def insert(self,city,date,weather,temp):
        try:
            self.cursor.execute("insert into weathers (wCity,wDate,wWeather,wTemp) values (?,?,?,?)" ,(city,date,weather,temp))
        except Exception as err:
            print(err)
    def show(self):
        self.cursor.execute("select * from weathers")
        rows=self.cursor.fetchall()
        print("%-16s%-16s%-32s%-16s" % ("city","date","weather","temp"))
        for row in rows:
            print("%-16s%-16s%-32s%-16s" % (row[0],row[1],row[2],row[3]))

class WeatherForecast:
    def __init__(self):
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
        self.cityCode={"福州":"10123010103A","厦门":"10123020102A","广州":"101280101","深圳":"101280601"}
    def forecastCity(self,city):
        if city not in self.cityCode.keys():
            print(city+" code cannot be found")
            return

        url="http://www.weather.com.cn/weather/"+self.cityCode[city]+".shtml"
        try:
            req=urllib.request.Request(url,headers=self.headers)
            data=urllib.request.urlopen(req)
            data=data.read()
            dammit=UnicodeDammit(data,["utf-8","gbk"])
            data=dammit.unicode_markup
            soup=BeautifulSoup(data,"lxml")
            lis=soup.select("ul[class='t clearfix'] li")
            for li in lis:
                try:
                    date=li.select('h1')[0].text
                    weather=li.select('p[class="wea"]')[0].text
                    temp=li.select('p[class="tem"] span')[0].text+"/"+li.select('p[class="tem"] i')[0].text
                    print(city,date,weather,temp)
                    self.db.insert(city,date,weather,temp)
                except Exception as err:
                    print(err)
        except Exception as err:
            print(err)
    def process(self,cities):
        self.db=WeatherDB()
        self.db.openDB()

        for city in cities:
            self.forecastCity(city)

        #self.db.show()
        self.db.closeDB()

ws=WeatherForecast()
ws.process(["福州","厦门","广州","深圳"])
print("completed")




image

2.心得体会:用 BeautifulSoup 的 select 方法抓ul[class='t clearfix'] li这类标签,直接提取日期、天气、温度,比正则扒数据直观。
• 作业②:用requests和BeautifulSoup库方法定向爬取股票相关信息,并存储在数据库中。
1.代码与过程
首先打开东方财富网,使用f12来看看股票数据
image

还是和之前的操作一样,首先保存该网页的json下来分析,避免多次访问被网站拉黑
image

首先逐页爬取各个页面,保存股票信息
然后将爬取的股票数据进行处理和单位转换
最后将数据写入数据库

image

2.心得与体会:直接调用网页的api接口,可以直观快速地分析数据接口,提高了数据解析的效率。这让我明白,爬取前先分析目标网站的数据源(是 HTML、API 还是 JS 加载),能少走很多弯路。
• 作业③: 爬取中国大学2021主榜
1.代码与过程
在f12中分析数据的api,里面是json格式
1

可以看到其中的api是有形参和实参的,找到json的头尾,即将形参和实参一一对应,获得对应表

province_map = {
    "q": "北京", "D": "上海", "k": "江苏", "y": "安徽", "w": "湖南",
    "s": "陕西", "F": "福建", "u": "广东", "n": "山东", "v": "湖北",
    "x": "浙江", "r": "辽宁", "M": "重庆", "K": "甘肃", "C": "吉林",
    "I": "广西", "N": "天津", "B": "黑龙江", "z": "江西", "H": "云南",
    "t": "四川", "o": "河南", "J": "贵州", "L": "新疆", "G": "海南",
    "aA": "青海", "aB": "宁夏", "aC": "西藏"
}


category_map = {
    "f": "综合", "e": "理工", "h": "师范", "m": "农业", "S": "林业",
    "i": "医药", "a": "财经", "j": "政法", "g": "民族", "d": "语言",
    "b": "艺术", "c": "体育"
}

解析数据,提取关键信息

def extract_universities(js_content):
    if not js_content:
        return []
    
    # 正则匹配目标字段(学校名称、类型、省份、分数)
    pattern = re.compile(
        r'univNameCn:"([^"]+)"'  
        r'.*?'  # 非贪婪匹配,跳过中间无关内容
        r'univCategory:(\w+)' 
        r'.*?'
        r'province:(\w+)'      
        r'.*?'
        r'score:([\d.]+)'    
        , re.DOTALL  # 支持跨多行匹配
    )
    
    matches = pattern.findall(js_content)
    universities = []
    # 用索引作为排名(从1开始)
    for i, (name_cn, category_var, prov_var, score) in enumerate(matches, start=1):
        universities.append({
            "rank": i,
            "school_name": name_cn,
            "province": province_map.get(prov_var, prov_var),  
            "category": category_map.get(category_var, category_var), 
            "total_score": float(score)  
        })
    return universities

最后将数据导入数据库
数据库输出结果
image

2.心得体会
在处理json的过程中,本来想将jsonp格式处理成json,就可以直接使用json.load()直接来进行解析,但是一直尝试,发现前面开头特别之长,解析的效果不好,因此又选择了正则表达式。

posted @ 2025-11-07 16:29  小堆小堆  阅读(1)  评论(0)    收藏  举报