「数据采集」实验一

一、作业①

2020排名 全部层次 学校类型 总分
1 前2% 中国人民大学 1069.0
2 .... ........... ......

1.获取网页源码:getHTMLTextUrllib(url)

def getHTMLTextUrllib(url):
    try:
        headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; "
                                 "en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
        req=urllib.request.Request(url,headers=headers)
        resp=urllib.request.urlopen(req)
        data =resp.read()
        unicodeData =data.decode()
        #dammit = UnicodeDammit(data,["utf-8","gbk"])
        #unicodeData= dammit.unicode_markup
        return unicodeData
    except Exception as err:
        print(err)

2.构造正则表达式匹配所需内容

  • 2020排名:rank = re.findall(r'<td data-v-68e330ae><div class="ranking" data-v-68e330ae>(\n\s*?\d*\s*?)<\/div><\/td>',html)
  • 全部层次:level = re.findall(r'<td data-v-68e330ae>('r'\n\s*?.*\s*?)<!----><\/td>',html)
  • 学校类型:name = re.findall(r'class="name-cn" data-v-b80b4d60>(.*?)<\/a>',html)
  • 总分:score = re.findall(r'<td data-v-68e330ae>(\n\s*?.*\s*?)<\/td>',html)

3.将获取信息存储于ulist中:fillUnivList(ulist, html)

def fillUnivList(ulist, html):
    try:
        rank = re.findall(r'<td data-v-68e330ae><div class="ranking" data-v-68e330ae>(\n\s*?\d*\s*?)<\/div><\/td>',html)
        level = re.findall(r'<td data-v-68e330ae>(\n\s*?.*\s*?)<!----><\/td>',html)
        name = re.findall(r'class="name-cn" data-v-b80b4d60>(.*?)<\/a>',html)
        score = re.findall(r'<td data-v-68e330ae>(\n\s*?.*\s*?)<\/td>',html)

        for i in range(len(rank)):
            Rank = rank[i].strip()
            Level = level[i].strip()
            Name = name[i].strip()
            Score = score[i].strip()
            ulist.append([Rank,Level,Name,Score])

    except Exception as err:
        print(err)

4.打印列表信息printUnivList(ulist, 20)

def printUnivList(ulist, num):
    #中西文混排时,要使用中文字符空格填充chr(12288)
    tplt = "{:6}\t{:4}\t{:10}\t{:4}"
    #对中文输出的列,用第6个参数即中文空格chr(12288)填充
    print(tplt.format("2020排名", "全部层次", "学校类型", "总分",chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2], u[3],chr(12288)))

5.输出结果

6.心得体会

  • 正则表达式使用还不够熟练,匹配<div class="ranking" data-v-68e330ae=""> 1 </div>内容时,由于忽略了换行与空格,浪费了好多时间...
  • 要多多练习编写正则表达式

二、作业②

  • 要求:用requestsBeautiful Soup库方法设计爬取数据服务AQI实时报。
  • 输出形式
序号 城市 AQI PM2.5 SO2 NO2 CO 首要污染物
1 北京 55 6 5 1.0 225 -
2 .... .... .... .... .... .... ....

1.获取网页源码getHTMLText(url)

def getHTMLText(url):
    try:
        resp = requests.get(url,timeout=30)
        resp.raise_for_status()
        resp.encoding = resp.apparent_encoding
        return resp.text
    except:
        return '产生异常'

2.调用Beautiful Soup获取信息

#html页面数据存入ulist
def fillAQIList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    i = 1 #表示序号
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            #检查tr标签,排除tr为普通字符串,需要引入bs4库
            tds = tr('td')
            # 列表中存入AQI信息
            ulist.append([str(i),tds[0].text.strip(), tds[1].text.strip(),
                        tds[2].text.strip(), tds[4].text.strip(),
                          tds[5].text.strip(),tds[6].text.strip(),
                          tds[8].text.strip()])
            i += 1

3.打印列表信息printAQIList(ulist, num)

def printAQIList(ulist, num):
    #中西文混排时,要使用中文字符空格填充chr(12288)
    tplt = "{:6}\t{:4}\t{:10}\t{:4}\t{:6}\t{:4}\t{:10}\t{:4}"
    #对中文输出的列,进行用第6个参数即中文空格chr(12288)填充
    print(tplt.format("序号", "城市", "AQI", "PM2.5", "SO2","NO2","CO","首要污染物",
                      chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0], u[1], u[2], u[3], u[4],u[5],u[6],u[7],chr(
            12288)))

4.输出结果

5.心得体会

  • 由于已做过类似爬虫,此项作业快速完成,没有遇到困难

三、作业③

  • 要求:使用urllibrequestsre爬取一个给定网页福州大学新闻网下的所有图片
  • 输出形式:将自选网页内的所有jpg文件保存在一个文件夹中

1.获取网页源码getHTMLText(url)

  • Requests
def getHTMLText(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36',
            }
    try:
        resp = requests.get(url,headers=headers,timeout=30)
        resp.raise_for_status()
        resp.encoding = resp.apparent_encoding
        return resp.text
    except Exception as err:
        return err
  • urllib.request
    try:
        req = urllib.request.Request(url, headers=headers)
        resp = urllib.request.urlopen(req)
        data = resp.read().decode()
        return data
    except Exception as err:
        return err

2.构造正则表达式匹配jpg图片

reg = r'<img src="/(.*?).jpg"'

3.保存图片至文件夹中SavePics(html)

def SavePics(html):
    reg = r'<img src="/(.*?).jpg"'
    img_list = re.findall(reg,html)
    i = 0 #记录图片数量
    for imgurl in img_list:
        i += 1
        imgurl = 'http://news.fzu.edu.cn/' + imgurl + '.jpg'
        print(i,imgurl)
        #利用requests库下载图片
        try:
            response = requests.get(imgurl)
            file_path = 'D:/PyCharm/InternetWorm/News/' + '第' + str(i) + '张图片' + '.jpg' #图片保存路径
            with open(file_path, 'wb') as f:  # 图片信息是二进制形式,所以要用wb写入
                f.write(response.content)
                print('success')
        except Exception as err:
            print(err)

4.输出结果


5.心得体会

  • 编码过程中遇到以下报错
1 http://attach/2021/09/26/433747.jpg
HTTPConnectionPool(host='attach', port=80): Max retries exceeded with url: /2021/09/26/433747.jpg (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x0000022D1D22DB50>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
  • 寻找多种解决方案无果后,尝试复制图片url在浏览器中打开,提示无法打开,于是将原url变为http://news.fzu.edu.cn/attach/2021/09/26/433747.jpg,成功解决...
  • 目前对于正则表达式的运用还不够熟练,课下应多做练习,增强理解。

代码地址:https://gitee.com/sevennnn/internet-worm

posted @ 2021-09-29 15:56  Sevennnn  阅读(93)  评论(0编辑  收藏  举报