数据采集与融合作业1

数据采集与融合第一次作业
102102125 肖辰恺

作业①:

o 要求：用requests和BeautifulSoup库方法定向爬取给定网址（http://www.shanghairanking.cn/rankings/bcur/2020 ）的数据，屏幕打印爬取的大学排名信息。

代码如下

import urllib.request
from bs4 import BeautifulSoup

url = "http://www.shanghairanking.cn/rankings/bcur/2020"
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
html = response.read().decode('utf-8')

soup = BeautifulSoup(html, 'html.parser')
table = soup.find('tbody')
if table:
    rows = table.find_all('tr')


    print("排名\t学校名称\t省市\t学校类型\t总分")
    for row in rows:
        cols = row.find_all('td')
        if len(cols) >= 5:
            rank = cols[0].get_text(strip=True)
            uni_name = cols[1].find('a').get_text(strip=True) if cols[1].find('a') else cols[1].get_text(strip=True)
            province = cols[2].get_text(strip=True)
            uni_type = cols[3].get_text(strip=True)
            score = cols[4].get_text(strip=True)
            print(f"{rank}\t{uni_name}\t{province}\t{uni_type}\t{score}")
else:
    print("未找到排名信息表格")

运行结果如下![截屏2023-09-22 10.32.27](/Users/xiaochenkai/Desktop/截屏2023-09-22 10.32.27.png)

心得感想

这段代码旨在从“上海交通大学世界大学排名”网站抓取2020年的中国大学排名信息。首先，它使用urllib.request模块访问和下载指定网页内容。接着，利用BeautifulSoup来解析该HTML内容。代码搜索含有排名的表格（通过<tbody>标签），然后遍历表格的每一行(<tr>标签)来提取大学的排名、名称、省份、类型和总分。最后，这些信息被格式化并输出到控制台。如果未找到表格，会输出提示信息。这是一个基于网页抓取技术的简单示例，用于展示数据提取的过程。

作业②:

o 要求：用requests和re库方法设计某个商城（自已选择）商品比价定向爬虫，爬取该商城，以关键词“书包”搜索页面的数据，爬取商品名称和价格。

代码如下

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}


url = "https://search.jd.com/Search?keyword=书包&enc=utf-8"
response = requests.get(url, headers=headers)

if response.status_code == 200:
    page_content = response.text
    soup = BeautifulSoup(page_content, 'html.parser')
    product_info_list = soup.find_all('li', class_='gl-item')[:60]  # 限制为60项

    for idx, product_info in enumerate(product_info_list):
        product_name_tag = product_info.find('div', class_='p-name')
        product_price_tag = product_info.find('div', class_='p-price')

        if product_name_tag and product_price_tag:
            product_name = product_name_tag.find('em')
            product_price = product_price_tag.find('i')

            if product_name and product_price:
                print(f"{idx + 1}. 物品名称: {product_name.get_text()}, 物品价格: {product_price.get_text()}")
            else:
                print(f"{idx + 1}. 不能识别该物品")
        else:
            print(f"{idx + 1}. 不能识别该物品")

else:
    print("Fail")

运行结果如下

![截屏2023-09-22 10.32.20](/Users/xiaochenkai/Desktop/截屏2023-09-22 10.32.20.png)

心得感想

这段代码用于爬取京东网站上与“书包”关键词相关的商品信息。首先，使用requests库发送GET请求，设置用户代理伪装为正常浏览器。如果响应码为200（请求成功），利用BeautifulSoup解析网页内容。接着，找到前60个包含商品信息的li标签（每个li代表一个商品）。对于每个商品，尝试从其内部的div标签中提取商品名称与价格。如果能够找到名称和价格，则打印出来；否则，打印“不能识别该物品”。这样，可以得到京东上搜索“书包”关键词的前60个商品的名称和价格。如果请求失败，代码会打印“Fail”。

作业③：

o 要求：爬取一个给定网页（ https://xcb.fzu.edu.cn/info/1071/4481.htm）或者自选网页的所有JPEG和JPG格式文件

o 输出信息：将自选网页内的所有JPEG和JPG文件保存在一个文件夹中

代码如下

import os
import requests
from bs4 import BeautifulSoup

URL = 'https://xcb.fzu.edu.cn/info/1071/4481.htm'

response = requests.get(URL)
response.raise_for_status()

soup = BeautifulSoup(response.content, 'html.parser')

img_tags = soup.find_all('img')

if not os.path.exists('downloaded_images'):
    os.makedirs('downloaded_images')

for img in img_tags:
    img_url = img['src']
    if img_url.endswith('.jpeg') or img_url.endswith('.jpg'):
        if not img_url.startswith(('http:', 'https:')):
            img_url = requests.compat.urljoin(URL, img_url)
        img_response = requests.get(img_url, stream=True)
        img_response.raise_for_status()
        with open(os.path.join('downloaded_images', os.path.basename(img_url)), 'wb') as f:
            for chunk in img_response.iter_content(8192):
                f.write(chunk)

print("下载完成!")

运行后下载的图片如下

![截屏2023-09-22 10.33.20](/Users/xiaochenkai/Desktop/截屏2023-09-22 10.33.20.png)

心得感想

这段代码是用于从给定的网页URL（https://xcb.fzu.edu.cn/info/1071/4481.htm）中下载所有的.jpeg和.jpg图片。

首先，它利用requests库发出HTTP请求，获取网页内容。如果请求发生错误（例如，404未找到或500服务器错误），它会抛出异常。

接着，它使用BeautifulSoup来解析返回的网页内容，寻找所有的<img>标签，这样可以获取所有图片的URL。

在下载图片之前，代码会检查一个名为'downloaded_images'的文件夹是否存在。如果不存在，它会创建该文件夹。

然后，对于每一个<img>标签，它检查图片URL是否以.jpeg或.jpg结尾。如果URL不是完全的（即，不以http:或https:开头），它会使用requests.compat.urljoin将基础URL和图片URL结合起来，形成一个完整的URL。

最后，代码发出一个新的请求，下载该图片。图片以二进制模式写入'downloaded_images'文件夹，文件名是URL的最后一部分。

完成所有图片下载后，打印出“下载完成!”。

posted @ 2023-09-22 10:55 拾霜阅读(62) 评论(0) 收藏举报

刷新页面返回顶部

xmxck

数据采集与融合作业1

公告