2023数据采集与融合技术实践作业一

作业①

实验要求

o 要求：用requests和BeautifulSoup库方法定向爬取给定网址（http://www.shanghairanking.cn/rankings/bcur/2020 ）的数据，屏幕打印爬取的大学排名信息。
o 输出信息：
排名学校名称省市学校类型总分
1 清华大学北京综合 852.5
2......

代码

import json
import requests

url='https://www.shanghairanking.cn/api/pub/v1/bcur?bcur_type=11&year=2020'
html=requests.get(url).text  # 把html转化为text
# print(html)
uni=json.loads(html)  
# print(data)
uni=uni['data']['rankings']  # 找到每一个大学的数据
univers = []
num=int(input('请输入要爬取前多少名的大学:'))
for i in range(num):
    rank = uni[i]['rankOverall']  # 排名
    name = uni[i]['univNameCn']  # 大学名
    province = uni[i]['province']  # 省份
    score = uni[i]['score']  # 得分
    category = uni[i]['univCategory']  # 类别
    univers.append([rank,name,province,category,score])  # 添加到列表中

print(f"排名", "学校名称", "省市", "学校类型","总分")
for i in range(num):  # 输出
    print(univers[i][0],univers[i][1],univers[i][2],univers[i][3],univers[i][4])

心得体会

用的api，很直观的爬出来了，以后要多找api

作业②

实验要求

o 要求：用requests和re库方法设计某个商城（自已选择）商品比价定向爬虫，爬取该商城，以关键词“书包”搜索页面的数据，爬取商品名称和价格。
o 输出信息：
序号价格商品名
1 65.00 xxx
2......

代码

import urllib.request
from bs4 import BeautifulSoup
import re
import urllib.parse

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.81"}
url="https://search.jd.com/Search?keyword=%E4%B9%A6%E5%8C%85&enc=utf-8&wq=shu&pvid=ea9bef8ed6114b1a9106f35315b8bf24"

req=urllib.request.Request(url,headers=headers)  # Request可以加url
html=urllib.request.urlopen(req)
html=html.read()
html=html.decode()

soup = BeautifulSoup(html, "html.parser")
i=0
lis=soup.find_all("li",{"data-sku": re.compile("\d+")})  # 通过观察html文档发现京东商品数据都保存在名为li的节点中
print("序号        商品名称                        价格")
for li in lis:
    price1=li.find("div",attrs={"class":"p-price"}).find("strong").find("i")  # 价格在li的class="p-price"中
    price = price1.text
    name1=li.find("div",attrs={"class":"p-name"}).find("a").find("em")
    name=name1.text.strip()  # 找到书包名字及去除头尾所有空格
    i=i+1
    t='\t'
    print(i,t,name,t,price)

心得体会

淘宝的搞不定，京东的倒是蛮简单的，找到html所在地就可以很快爬出来了。学会beautifulsoup的使用了。

作业③

实验要求

o 要求：爬取一个给定网页（ https://xcb.fzu.edu.cn/info/1071/4481.htm）或者自选网页的所有JPEG和JPG格式文件
o 输出信息：将自选网页内的所有JPEG和JPG文件保存在一个文件夹中

代码

import urllib.request
from bs4 import BeautifulSoup
import urllib.parse

headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.81"}
url=" https://xcb.fzu.edu.cn/info/1071/4481.htm"

req=urllib.request.Request(url,headers=headers)  # Request可以加url
html=urllib.request.urlopen(req)
html=html.read()
html=html.decode()  # 把html爬下来
# print(html)

photo_urls = []  # 图片地址保存下来
soup = BeautifulSoup(html,"lxml")
i=0
lis=soup.find_all("p",{"class":"vsbcontent_img"})  # 找到图片所在地
for li in lis:
    # print(li)
    photo = li.find("img")["src"]  # 找到图片链接
    photo = "https://xcb.fzu.edu.cn/" + photo  # 加个头
    photo_urls.append(photo)
for photo_url in photo_urls:
    urllib.request.urlretrieve(photo_url,f'.\photo\photo{i}.jpg')  # 下载到本目录的photo文件夹中，并取名
    i+=1

心得体会

刚爬出来时下载不了，后来发现少了个https头，于是手动给网址添加，就可以下载了，对beautifulsoup理解更深了。还学会了下载图片的函数。

posted @ 2023-09-21 16:53 EDG-Yiper 阅读(64) 评论(0) 收藏举报

刷新页面返回顶部

fzzzz

2023数据采集与融合技术实践作业一

作业①

实验要求

代码

心得体会

作业②

实验要求

代码

心得体会

作业③

实验要求

代码

心得体会

公告