数据采集作业1

1.作业①:

要求：用requests和BeautifulSoup库方法定向爬取给定网址（http://www.shanghairanking.cn/rankings/bcur/2020）的数据，屏幕打印爬取的大学排名信息。
输出信息：

排名	学校名称	省市	学校类型	总分
1	清华大学	北京	综合	852.5
2	...	...	...	...

1.1作业代码和图片

import requests
from bs4 import BeautifulSoup
#目标URL
url = 'https://www.shanghairanking.cn/rankings/bcur/2020'
rank_list = []
try:
    #发送请求并获取网页内容
    response = requests.get(url)
    response.encoding = 'utf-8'
    html_content = response.text
    soup = BeautifulSoup(html_content, "html.parser")
    table = soup.find('table', attrs= {'class': 'rk-table'})
    if table:
        trs = table.find_all('tr')[1:] #跳过表头
        for tr in trs:
            tds = tr.find_all('td')
            if len(tds) >= 5:
                rank = int(tds[0].text.strip())
                name_span = tds[1].find('span', attrs = {'class': 'name-cn'})
                name = name_span.text.strip() #获取学校的中文名称
                province_city = tds[2].text.strip()
                campus_type = tds[3].text.strip()
                score = float(tds[4].text.strip())
                rank_list.append([rank, name, province_city, campus_type, score])
except Exception as err:
    print(err)
#打印结果
print("{:^10}\t{:^15}\t{:^10}\t{:^10}\t{:^10}".format("排名", "学校名称", "省市", "类型", "总分"))
print("-" * 75)
for u in rank_list:
    print("{:^10}\t{:^15}\t{:^10}\t{:^10}\t{:^10.1f}".format(u[0], u[1], u[2], u[3], u[4]))

1.2心得体会

通过编写代码实践，实现了从 Python 爬虫理论知识到实际应用的跨越。我对requests库的请求发送、响应处理，以及BeautifulSoup库的标签定位、文本提取等操作更加熟练，能够根据实际网页结构灵活调整代码逻辑。例如，在定位表格和span标签时，灵活运用find与find_all方法，并结合attrs参数指定类名，这些操作从最初的 “对照文档编写” 转变为 “根据需求灵活运用”，技术熟练度显著提升。

作业②:

要求：用requests和re库方法设计某个商城（自已选择）商品比价定向爬虫，爬取该商城，以关键词“书包”搜索页面的数据，爬取商品名称和价格。
输出信息：

序号	价格	商品名
1	65.00	xxx
2.....

2.1作业代码和图片

import requests
import re

url = 'https://search.dangdang.com/?key=%CA%E9%B0%FC&category_id=10009684#J_tab'

try:
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"
    }
    response = requests.get(url, headers=headers)
    response.encoding = response.apparent_encoding
    data = response.text
    table = re.search(r'<ul class="bigimg cloth_shoplist" id="component_59".*?>(.*?)</ul>', data, re.S)

    if table:
        content = table.group(1)
        print("序号\t价格\t\t商品名称")
        lis = re.findall(r'<li.*?>(.*?)</li>', content, re.S)
        #序号变量i从1开始
        i = 1
        for li in lis:
            price_bag = re.search(r'<span class="price_n">&yen;(.*?)</span>', li)
            name_bag = re.search(r'<a title="(.*?)"', li)
            if price_bag and name_bag:
                price = price_bag.group(1).strip()
                names = name_bag.group(1).strip()
                #直接处理整个名称字符串
                clean_name = re.sub(r'\s+', ' ', names)
                print("{}\t{}\t\t{}".format(i, price, clean_name))
            else:
                print("{}\t未找到商品价格或名称".format(i))
            i += 1
    else:
        print("未找到商品列表容器")

except Exception as err:
    print(err)

2.2心得体会

通过实践，加深了我对正则表达式逻辑的理解。在数据提取环节，正则表达式成为核心工具：用re.search定位商品列表的ul容器，再以re.findall提取每条商品的li标签，最后针对价格（price_n类）和商品名称（title属性）用re.search精准匹配。过程中，用re.sub替换多余空格清洗商品名称，以及通过条件判断过滤无效数据，让我体会到正则表达式在复杂 HTML 文本中 “精准筛选” 的优势，也明白数据清洗对提升信息可用性的重要性。

作业③：

要求：爬取一个给定网页（ https://news.fzu.edu.cn/yxfd.htm）或者自选网页的所有JPEG、JPG或PNG格式图片文件
输出信息：将自选网页内的所有JPEG、JPG或PNG格式图片件保存在一个文件夹中

3.1作业代码和图片

import requests
import os
from bs4 import BeautifulSoup

if not os.path.exists('fzu_images'):
    os.makedirs('fzu_images')

url = 'https://news.fzu.edu.cn/yxfd.htm'

try:
    #发送请求并获取网页内容
    response = requests.get(url)
    response.raise_for_status()
    response.encoding = 'utf-8'
    soup = BeautifulSoup(response.text, "html.parser")
    data = soup.find('section', attrs={'class': 'n_tutu'})
    imgs = data.find_all('img')
    for img in imgs:
        img_src = img.get('src')
        if img_src.endswith(('.jpg','.jpeg','.png')):
            img_url = "https://news.fzu.edu.cn/" + img_src
            filename = os.path.basename(img_url)
            with open ('fzu_images/'+filename,'wb') as f:
                f.write(requests.get(img_url).content)
                print(f"下载成功:{filename}")

except Exception as err:
    print(err)

3.2心得体会

通过实践，我掌握了图片类文件下载的基本方法，也加强了我对BeautifulSoup的认识与运用。

posted @ 2025-10-25 21:07 cjx2133 阅读(13) 评论(0) 收藏举报

刷新页面返回顶部

2025-project