102102125肖辰恺数据采集与融合技术第一次作业

一、作业内容
•作业①:要求:用requests和BeautifulSoup库方法定向爬取给定网址(http://www.shanghairanking.cn/rankings/bcur/2020 )的数据,屏幕打印爬取的大学排名信息。

代码如下
`import urllib.request
from bs4 import BeautifulSoup
url = "http://www.shanghairanking.cn/rankings/bcur/2020"
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
html = response.read().decode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('tbody')
if table:
    rows = table.find_all('tr')
    print("排名\t学校名称\t省市\t学校类型\t总分")
    for row in rows:
        cols = row.find_all('td')
        if len(cols) >= 5:
            rank = cols[0].get_text(strip=True)
            uni_name = cols[1].find('a').get_text(strip=True) if cols[1].find('a') else cols[1].get_text(strip=True)
            province = cols[2].get_text(strip=True)
            uni_type = cols[3].get_text(strip=True)
            score = cols[4].get_text(strip=True)
            print(f"{rank}\t{uni_name}\t{province}\t{uni_type}\t{score}")
else:
    print("未找到排名信息表格")`
代码结果如下 ![](https://img2023.cnblogs.com/blog/3286208/202309/3286208-20230921164420182-529848552.png)

•作业②:要求:用requests和re库方法设计某个商城(自已选择)商品比价定向爬虫,爬取该商城,以关键词“书包”搜索页面的数据,爬取商品名称和价格。

代码如下
import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}


url = "https://search.jd.com/Search?keyword=书包&enc=utf-8"
response = requests.get(url, headers=headers)

if response.status_code == 200:
    page_content = response.text
    soup = BeautifulSoup(page_content, 'html.parser')
    product_info_list = soup.find_all('li', class_='gl-item')[:60]  # 限制为60项

    for idx, product_info in enumerate(product_info_list):
        product_name_tag = product_info.find('div', class_='p-name')
        product_price_tag = product_info.find('div', class_='p-price')

        if product_name_tag and product_price_tag:
            product_name = product_name_tag.find('em')
            product_price = product_price_tag.find('i')

            if product_name and product_price:
                print(f"{idx + 1}. 物品名称: {product_name.get_text()}, 物品价格: {product_price.get_text()}")
            else:
                print(f"{idx + 1}. 不能识别该物品")
        else:
            print(f"{idx + 1}. 不能识别该物品")

else:
    print("Fail")
代码结果如下 ![](https://img2023.cnblogs.com/blog/3286208/202309/3286208-20230921164353041-1895737883.png)

•作业③:要求:爬取一个给定网页( https://xcb.fzu.edu.cn/info/1071/4481.htm)或者自选网页的所有JPEG和JPG格式文件

代码如下
import os
import requests
from bs4 import BeautifulSoup

URL = 'https://xcb.fzu.edu.cn/info/1071/4481.htm'

response = requests.get(URL)
response.raise_for_status()

soup = BeautifulSoup(response.content, 'html.parser')

img_tags = soup.find_all('img')

if not os.path.exists('downloaded_images'):
    os.makedirs('downloaded_images')

for img in img_tags:
    img_url = img['src']
    if img_url.endswith('.jpeg') or img_url.endswith('.jpg'):
        if not img_url.startswith(('http:', 'https:')):
            img_url = requests.compat.urljoin(URL, img_url)
        img_response = requests.get(img_url, stream=True)
        img_response.raise_for_status()
        with open(os.path.join('downloaded_images', os.path.basename(img_url)), 'wb') as f:
            for chunk in img_response.iter_content(8192):
                f.write(chunk)

print("下载完成!")
代码运行后图片库如下 ![](https://img2023.cnblogs.com/blog/3286208/202309/3286208-20230921164452215-1244622782.png)
posted @ 2023-09-21 15:47  拾霜  阅读(14)  评论(0)    收藏  举报