102102125肖辰恺数据采集与融合技术第一次作业
一、作业内容
•作业①:要求:用requests和BeautifulSoup库方法定向爬取给定网址(http://www.shanghairanking.cn/rankings/bcur/2020 )的数据,屏幕打印爬取的大学排名信息。
代码如下
`import urllib.request
from bs4 import BeautifulSoup
url = "http://www.shanghairanking.cn/rankings/bcur/2020"
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
html = response.read().decode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('tbody')
if table:
rows = table.find_all('tr')
print("排名\t学校名称\t省市\t学校类型\t总分")
for row in rows:
cols = row.find_all('td')
if len(cols) >= 5:
rank = cols[0].get_text(strip=True)
uni_name = cols[1].find('a').get_text(strip=True) if cols[1].find('a') else cols[1].get_text(strip=True)
province = cols[2].get_text(strip=True)
uni_type = cols[3].get_text(strip=True)
score = cols[4].get_text(strip=True)
print(f"{rank}\t{uni_name}\t{province}\t{uni_type}\t{score}")
else:
print("未找到排名信息表格")`
•作业②:要求:用requests和re库方法设计某个商城(自已选择)商品比价定向爬虫,爬取该商城,以关键词“书包”搜索页面的数据,爬取商品名称和价格。
代码如下
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
url = "https://search.jd.com/Search?keyword=书包&enc=utf-8"
response = requests.get(url, headers=headers)
if response.status_code == 200:
page_content = response.text
soup = BeautifulSoup(page_content, 'html.parser')
product_info_list = soup.find_all('li', class_='gl-item')[:60] # 限制为60项
for idx, product_info in enumerate(product_info_list):
product_name_tag = product_info.find('div', class_='p-name')
product_price_tag = product_info.find('div', class_='p-price')
if product_name_tag and product_price_tag:
product_name = product_name_tag.find('em')
product_price = product_price_tag.find('i')
if product_name and product_price:
print(f"{idx + 1}. 物品名称: {product_name.get_text()}, 物品价格: {product_price.get_text()}")
else:
print(f"{idx + 1}. 不能识别该物品")
else:
print(f"{idx + 1}. 不能识别该物品")
else:
print("Fail")
•作业③:要求:爬取一个给定网页( https://xcb.fzu.edu.cn/info/1071/4481.htm)或者自选网页的所有JPEG和JPG格式文件
代码如下
import os
import requests
from bs4 import BeautifulSoup
URL = 'https://xcb.fzu.edu.cn/info/1071/4481.htm'
response = requests.get(URL)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
img_tags = soup.find_all('img')
if not os.path.exists('downloaded_images'):
os.makedirs('downloaded_images')
for img in img_tags:
img_url = img['src']
if img_url.endswith('.jpeg') or img_url.endswith('.jpg'):
if not img_url.startswith(('http:', 'https:')):
img_url = requests.compat.urljoin(URL, img_url)
img_response = requests.get(img_url, stream=True)
img_response.raise_for_status()
with open(os.path.join('downloaded_images', os.path.basename(img_url)), 'wb') as f:
for chunk in img_response.iter_content(8192):
f.write(chunk)
print("下载完成!")

浙公网安备 33010602011771号