Python网络爬虫——爬取哔哩哔哩网站原创视频以及其动漫视频

一、选题的背景

为什么要选择此选题？要达到的数据分析的预期目标是什么?(10 分)从社会、经济、技术、数据来源等方面进行描述（200 字以内）

选题原因：爬虫是指一段自动抓取互联网信息的程序，从互联网上抓取对于我们有价值的信息。选择此题正是因为随着信息化的发展，大数据时代对信息的采需求和集量越来越大，相应的处理量也越来越大，正是因为如此，爬虫相应的岗位也开始增多，因此，学好这门课也是为将来就业打下扎实的基础。B站在当今众多视频网站中，相对于年轻化较有爬取价值，可以进一步了解现阶段年轻人的观看喜好。

预期目标：熟悉地掌握爬取网页信息，将储存地信息进行清洗、查重并处理，并对其进行持久性可更新性地储存，然后对数据进行简单的可视化处理，最后再假设根据客户需求，提供快捷方便的相应的数据。

二、主题式网络爬虫设计方案（10 分）

1.主题式网络爬虫名称

爬取B站原创视频以及其动漫视频相关信息并反馈处理的程序

2.主题式网络爬虫爬取的内容与数据特征分析

内容：B站热门原创视频排名（视频标题、排名、播放量、弹幕量、作者网络名称、视频播放地址、作者空间地址）；B站热门动漫的排名（排名,动漫标题,播放量,弹幕量,更新至集数,动漫播放地址）

数据特征分析：对前十名的视频进行制作柱状图（视频标题与播放量，视频排名与弹幕量，动漫标题与播放量，动漫排名与弹幕量）

3.主题式网络爬虫设计方案概述（包括实现思路与技术难点）

实现思路：

1.网络爬虫爬取B站的内容与数据进行分析

2.数据清洗和统计

3.mysql数据库的数据储存

技术难点：网页各信息上的标签属性查找，def自定义函数的建立，对存储至csv文件的数据进行清理查重，并且对其特点数据进行数据整数化（如：排名，播放量，弹幕量），对网址进行添加删除（如：添加“https://”，删除多余的“//”），机器学习sklearn库的学习与调用，selenium库的学习与调用。

三、主题页面的结构特征分析（10 分）

本次爬取两个同网址不同排行榜的主题页面（B站的原创视频排行榜、B站的动漫排行榜）的URL，分别为：“https://www.bilibili.com/v/popular/rank/all”与“https://www.bilibili.com/v/popular/rank/bangumi”。

Schema : https

Host : www.bilibili.com

Path : /v/popular/rank/all

/v/popular/rank/bangumi

主题页面组成为：<html>

<html>

B站的原创视频排行榜和B站的动漫排行榜的<head>标签中包含了<mate><title><script><link><style>五种标签，这些标签定义文档的头部，它是所有头部元素的容器。（附图）

1.Htmls 页面解析

本次课程设计主要是对<body>部分进行解析，<body>中存在<svg><div><script><style>四种标签，经过定位，确定要爬取的数据位于<div id=”app”>的<li ...class=”rank-item”>标签中。

以下为爬取<li ...class=”rank-item”>标签中所有信息的代码：

import requests

from bs4 import BeautifulSoup

url = 'https://www.bilibili.com/v/popular/rank/all'

bdata = requests.get(url).text

soup = BeautifulSoup(bdata,'html.parser')

items = soup.findAll('li',{'class':'rank-item'})#提取列表

print(items)

1.节点(标签)的查找方法与遍历方法（必要时画出节点树结构）

import requests

from bs4 import BeautifulSoup

r=requests.get('https://www.bilibili.com/v/popular/rank/all')

demo=r.text

soup=BeautifulSoup(demo,'html.parser')

#遍历方法：

print(soup.contents)# 获取整个标签树的儿子节点

print(soup.body.content)#返回标签树的body标签下的节点

print(soup.head)#返回head标签

#查找方法：

print(soup.title)#查找标签，这里查找了title标签

print(soup.li['class'])#根据标签名查找某属性，这里查找了li标签下的class

print(soup.find_all('li'))#根据标签名查找元素，这里查找了li标签下的所有代码

节点树结构图：

二、网络爬虫程序设计（60分）

爬虫程序主体要包括以下各部分，要附源代码及较详细注释，并在每部分程序后面提供输出结果的截图。

1.数据爬取与采集

①bvid网址获取

②aid的获取

③爬取界面

  1 #导入数据库
  2 
  3 import requests
  4 
  5 from bs4 import BeautifulSoup
  6 
  7 import csv
  8 
  9 import datetime
 10 
 11 import pandas as pd
 12 
 13 import numpy as np
 14 
 15 from matplotlib import rcParams
 16 
 17 import matplotlib.pyplot as plt
 18 
 19 import matplotlib.font_manager as font_manager
 20 
 21 from selenium import webdriver
 22 
 23 from time import sleep
 24 
 25 import matplotlib
 26 
 27 url = 'https://www.bilibili.com/v/popular/rank/all'
 28 
 29 #发起网络请求
 30 
 31 response = requests.get(url)
 32 
 33 html_text = response.text
 34 
 35 soup = BeautifulSoup(html_text,'html.parser')
 36 
 37 #创建Video对象
 38 
 39 class Video:
 40 
 41     def __init__(self,rank,title,visit,barrage,up_id,url,space):
 42 
 43         self.rank = rank
 44 
 45         self.title = title
 46 
 47         self.visit = visit
 48 
 49         self.barrage = barrage
 50 
 51         self.up_id = up_id
 52 
 53         self.url = url
 54 
 55         self.space = space
 56 
 57     def to_csv(self):
 58 
 59         return[self.rank,self.title,self.visit,self.barrage,self.up_id,self.url,self.space]
 60 
 61     @staticmethod
 62 
 63     def csv_title():
 64 
 65         return ['排名','标题','播放量','弹幕量','Up_ID','URL','作者空间']
 66 
 67 #提取列表
 68 
 69 items = soup.findAll('li',{'class':'rank-item'})
 70 
 71 #保存提取出来的Video列表
 72 
 73 videos = []
 74 
 75 for itm in items:
 76 
 77     title = itm.find('a',{'class':'title'}).text #视频标题
 78 
 79     rank = itm.find('i',{'class':'num'}).text #排名
 80 
 81     visit = itm.find_all('span')[3].text #播放量
 82 
 83     barrage = itm.find_all('span')[4].text #弹幕量
 84 
 85     up_id = itm.find('span',{'class':'data-box up-name'}).text #作者id
 86 
 87     url = itm.find_all('a')[1].get('href')#获取视频网址
 88 
 89     space = itm.find_all('a')[2].get('href')#获取作者空间网址
 90 
 91     v = Video(rank,title,visit,barrage,up_id,url,space)
 92 
 93     videos.append(v)
 94 
 95 #建立时间后缀
 96 
 97 now_str = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
 98 
 99 #建立文件名称以及属性
100 
101 file_name1 = f'哔哩哔哩视频top100_{now_str}.csv'
102 
103 #写入数据到文件中，并存储
104 
105 with open(file_name1,'w',newline='',encoding='utf-8') as f:
106 
107     writer = csv.writer(f)
108 
109     writer.writerow(Video.csv_title())
110 
111     for v in videos:
112 
113         writer.writerow(v.to_csv())
114 
115 url = 'https://www.bilibili.com/v/popular/rank/bangumi'
116 
117 #发起网络请求
118 
119 response = requests.get(url)
120 
121 html_text = response.text
122 
123 soup = BeautifulSoup(html_text,'html.parser')
124 
125 #创建Video对象
126 
127 class Video:
128 
129     def __init__(self,rank,title,visit,barrage,new_word,url):
130 
131         self.rank = rank
132 
133         self.title = title
134 
135         self.visit = visit
136 
137         self.barrage = barrage
138 
139         self.new_word = new_word
140 
141         self.url = url
142 
143     def to_csv(self):
144 
145         return[self.rank,self.title,self.visit,self.barrage,self.new_word,self.url]
146 
147     @staticmethod
148 
149     def csv_title():
150 
151         return ['排名','标题','播放量','弹幕量','更新话数至','URL']
152 
153 #提取列表
154 
155 items = soup.findAll('li',{'class':'rank-item'})
156 
157 #保存提取出来的Video列表
158 
159 videos = []
160 
161 for itm in items:
162 
163     rank = itm.find('i',{'class':'num'}).text #排名
164 
165     title = itm.find('a',{'class':'title'}).text #视频标题
166 
167     url = itm.find_all('a')[0].get('href')#获取视频网址
168 
169     visit = itm.find_all('span')[2].text #播放量
170 
171     barrage = itm.find_all('span')[3].text #弹幕量
172 
173     new_word = itm.find('span',{'class':'data-box'}).text#更新话数
174 
175     v = Video(rank,title,visit,barrage,new_word,url)
176 
177     videos.append(v)
178 
179 #建立时间后缀
180 
181 now_str = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
182 
183 #建立文件名称以及属性
184 
185 file_name2 = f'哔哩哔哩番剧top50_{now_str}.csv'
186 
187 #写入数据到文件中，并存储
188 
189 with open(file_name2,'w',newline='',encoding='utf-8') as f:
190 
191     writer = csv.writer(f)
192 
193     writer.writerow(Video.csv_title())
194 
195     for v in videos:
196 
197         writer.writerow(v.to_csv())

④清洗数据

 1 #导入数据库
 2 
 3 import pandas as pd
 4 
 5 file_name1 = f'哔哩哔哩视频top100_20211215_154744.csv'
 6 
 7 file_name2 = f'哔哩哔哩番剧top50_20211215_154745.csv'
 8 
 9 paiming1 = pd.DataFrame(pd.read_csv(file_name1,encoding="utf_8_sig"))#对数据进行清洗和处理
10 
11 paiming2 = pd.DataFrame(pd.read_csv(file_name2,encoding="utf_8_sig"))
12 
13 print(paiming1.head())
14 
15 print(paiming2.head())
16 
17 #查找重复值
18 
19 print(paiming1.duplicated())
20 
21 print(paiming2.duplicated())
22 
23 #查找空值与缺失值
24 
25 print(paiming1['标题'].isnull().value_counts())
26 
27 print(paiming2['标题'].isnull().value_counts())
28 
29 print(paiming1['URL'].isnull().value_counts())
30 
31 print(paiming2['URL'].isnull().value_counts())
32 
33 print(paiming1['播放量'].isnull().value_counts())
34 
35 print(paiming2['播放量'].isnull().value_counts())
36 
37 print(paiming1['弹幕量'].isnull().value_counts())
38 
39 print(paiming2['弹幕量'].isnull().value_counts())

2.储存至mysql数据库当中

①爬取网站

 1 # 爬取B站日榜新闻
 2 
 3 def BilibiliNews():
 4 
 5     newsList=[]
 6 
 7     # 伪装标头
 8 
 9     headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'}
10 
11     res=requests.get('https://www.bilibili.com/ranking/all/0/0/3',headers=headers) # 请求网页
12 
13     soup = BeautifulSoup(res.text,'html.parser')  # 解析网页
14 
15     result=soup.find_all(class_='rank-item')  # 找到榜单所在标签
16 
17     num=0
18 
19     startTime=time.strftime("%Y-%m-%d", time.localtime())  # 记录爬取的事件
20 
21     for i in result:
22 
23         try:
24 
25             num=int(i.find(class_='num').text)  # 当前排名
26 
27             con=i.find(class_='content')  
28 
29             title=con.find(class_='title').text  # 标题
30 
31             detail=con.find(class_='detail').find_all(class_='data-box')
32 
33             play=detail[0].text  # 播放量
34 
35             view=detail[1].text  # 弹幕量
36 
37             # 由于这两者存在类似15.5万的数据情况，所以为了保存方便将他们同义转换为整型
38 
39             if(play[-1]=='万'):
40 
41                 play=int(float(play[:-1])*10000)
42 
43             if(view[-1]=='万'):
44 
45                 view=int(float(view[:-1])*10000)
46 
47             # 以下为预防部分数据不显示的情况
48 
49             if(view=='--'):
50 
51                 view=0
52 
53             if(play=='--'):
54 
55                 play=0
56 
57             author=detail[2].text  # UP主
58 
59  
60 
61             url=con.find(class_='title')['href']  # 获取视频链接
62 
63             BV=re.findall(r'https://www.bilibili.com/video/(.*)', url)[0]  # 通过正则表达式解析得到视频的BV号
64 
65             pts=int(con.find(class_='pts').find('div').text)   # 视频综合得分
66 
67             newsList.append([num,title,author,play,view,BV,pts,startTime])  # 将数据插入列表中
68 
69         except:
70 
71             continue
72 
73 return newsList  # 返回数据信息列表

②数据库的创建

 1 mysql> create table BILIBILI(
 2 
 3     -> NUM INT,
 4 
 5     -> TITLE CHAR(80),
 6 
 7     -> UP CHAR(20),
 8 
 9     -> VIEW INT,
10 
11     -> COMMENT INT,
12 
13     -> BV_NUMBER INT,
14 
15     -> SCORE INT,
16 
17     -> EXECUTION_TIME DATETIME);

③将数据插入MySQL中

 1 def GetMessageInMySQL():
 2 
 3     # 连接数据库
 4 
 5     db = pymysql.connect(host="cdb-cdjhisi3hih.cd.tencentcdb.com",port=10056,user="root",password="xxxxxx",database="weixinNews",charset='utf8')
 6 
 7     cursor = db.cursor()  # 创建游标
 8 
 9     news=getHotNews()  # 调用getHotNews()方法获取热搜榜数据内容
10 
11     sql = "INSERT INTO WEIBO(NUMBER_SERIAL,TITLE, ATTENTION,EXECUTION_TIME) VALUES (%s,%s,%s,%s)"  # 插入语句
12 
13     timebegin=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())  # 记录开始事件，便于查找错误发生情况
14 
15     try:
16 
17         # 执行sql语句，executemany用于批量插入数据
18 
19         cursor.executemany(sql, news)
20 
21         # 提交到数据库执行
22 
23         db.commit()
24 
25         print(timebegin+"成功！")
26 
27     except :
28 
29         # 如果发生错误则回滚
30 
31         db.rollback()
32 
33         print(timebegin+"失败！")
34 
35     # 关闭游标
36 
37     cursor.close()
38 
39     # 关闭数据库连接
40 
41     db.close()

④利用schedule实现定时爬取

 1 # 记录程序运行事件
 2 
 3 time1=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
 4 
 5 print("开始爬取信息,程序正常执行："+time1)
 6 
 7 # 每20分钟执行一次程序
 8 
 9 schedule.every(20).minutes.do(startFunction)
10 
11 # 检查部署的情况，如果任务准备就绪，就开始执行任务
12 
13 while True:
14 
15     schedule.run_pending()
16 
17     time.sleep(1)

3.flask开发服务器端

①jieba提词和echarts wordcloud

 1 from collections import Counter
 2 from pyecharts import WordCloud
 3 import jieba.analyse
 4 # 将counter拆分成两个list
 5 def counter2list(counter):
 6     keyList,valueList = [],[]
 7     for c in counter:
 8         keyList.append(c[0])
 9         valueList.append(c[1])
10     return keyList,valueList
11 # 使用jieba提取关键词并计算权重
12 def extractTag(content,tagsList):
13     keyList,valueList = [],[]
14     if content:
15         tags = jieba.analyse.extract_tags(content, topK=100, withWeight=True)
16         for tex, widget in tags:
17             tagsList[tex] += int(widget*10000)
18 
19 def drawWorldCloud(content,count):
20     outputFile = './测试词云.html'
21     cloud = WordCloud('词云图', width=1000, height=600, title_pos='center')
22     cloud.add(
23         ' ',content,count,
24         shape='circle',
25         background_color='white',
26         max_words=200 
27     )
28     cloud.render(outputFile)
29 if __name__ == '__main__':
30     c = Counter()   #建一个容器
31     filePath = './新建文本文档.txt'   #分析的文档路径
32     with open(filePath) as file_object:
33         contents = file_object.read()
34         extractTag(contents, c)
35     contentList,countList = counter2list(c.most_common(200))
36     drawWorldCloud(contentList, countList)

②flask接受请求的参数

1 username = request.form.get("username")
2 password = request.form.get("password", type=str, default=None)
3 cpuCount = request.form.get("cpuCount", type=int, default=None)
4 memorySize = request.form.get("memorySize", type=int, default=None)

③BV爬取

  1 # _*_ coding: utf-8 _*_
  2 
  3 from urllib.request import urlopen, Request
  4 
  5 from http.client import HTTPResponse
  6 
  7 from bs4 import BeautifulSoup
  8 
  9 import gzip
 10 
 11 import json
 12 
 13 def get_all_comments_by_bv(bv: str, time_order=False) -> tuple:
 14 
 15     """
 16 
 17     根据哔哩哔哩的BV号，返回对应视频的评论列表（包括评论下面的回复）
 18 
 19     :param bv: 视频的BV号
 20 
 21     :param time_order: 是否需要以时间顺序返回评论，默认按照热度返回
 22 
 23     :return: 包含三个成员的元组，第一个是所有评论的列表（评论的评论按原始的方式组合其中，字典类型）
 24 
 25              第二个是视频的AV号（字符串类型），第三个是统计到的实际评论数（包括评论的评论）
 26 
 27     """
 28 
 29     video_url = 'https://www.bilibili.com/video/' + bv
 30 
 31     headers = {
 32 
 33         'Host': 'www.bilibili.com',
 34 
 35         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
 36 
 37         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
 38 
 39         'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
 40 
 41         'Accept-Encoding': 'gzip, deflate, br',
 42 
 43         'Connection': 'keep-alive',
 44 
 45         'Cookie': '',
 46 
 47         'Upgrade-Insecure-Requests': '1',
 48 
 49         'Cache-Control': 'max-age=0',
 50 
 51         'TE': 'Trailers',
 52 
 53     }
 54 
 55     rep = Request(url=video_url, headers=headers)  # 获取页面
 56 
 57     html_response = urlopen(rep)  # type: HTTPResponse
 58 
 59     html_content = gzip.decompress(html_response.read()).decode(encoding='utf-8')
 60 
 61     bs = BeautifulSoup(markup=html_content, features='html.parser')
 62 
 63     comment_meta = bs.find(name='meta', attrs={'itemprop': 'commentCount'})
 64 
 65     av_meta = bs.find(name='meta', attrs={'property': 'og:url'})
 66 
 67     comment_count = int(comment_meta.attrs['content'])  # 评论总数
 68 
 69     av_number = av_meta.attrs['content'].split('av')[-1][:-1]  # AV号
 70 
 71     print(f'视频 {bv} 的AV号是 {av_number} ，元数据中显示本视频共有 {comment_count} 条评论（包括评论的评论）。')
 72 
 73  
 74 
 75     page_num = 1
 76 
 77     replies_count = 0
 78 
 79     res = []
 80 
 81     while True:
 82 
 83         # 按时间排序：type=1&sort=0
 84 
 85         # 按热度排序：type=1&sort=2
 86 
 87         comment_url = f'https://api.bilibili.com/x/v2/reply?pn={page_num}&type=1&oid={av_number}' + \
 88 
 89                       f'&sort={0 if time_order else 2}'
 90 
 91         comment_response = urlopen(comment_url)  # type: HTTPResponse
 92 
 93         comments = json.loads(comment_response.read().decode('utf-8'))  # type: dict
 94 
 95         comments = comments.get('data').get('replies')  # type: list
 96 
 97         if comments is None:
 98 
 99             break
100 
101         replies_count += len(comments)
102 
103         for c in comments:  # type: dict
104 
105             if c.get('replies'):
106 
107                 rp_id = c.get('rpid')
108 
109                 rp_num = 10
110 
111                 rp_page = 1
112 
113                 while True:  # 获取评论下的回复
114 
115                     reply_url = f'https://api.bilibili.com/x/v2/reply/reply?' +
116 
117                                 f'type=1&pn={rp_page}&oid={av_number}&ps={rp_num}&root={rp_id}'
118 
119                     reply_response = urlopen(reply_url)  # type: HTTPResponse
120 
121                     reply_reply = json.loads(reply_response.read().decode('utf-8'))  # type: dict
122 
123                     reply_reply = reply_reply.get('data').get('replies')  # type: dict
124 
125                     if reply_reply is None:
126 
127                         break
128 
129                     replies_count += len(reply_reply)
130 
131                     for r in reply_reply:  # type: dict
132 
133                         res.append(r)
134 
135                     if len(reply_reply) < rp_num:
136 
137                         break
138 
139                     rp_page += 1
140 
141                 c.pop('replies')
142 
143                 res.append(c)
144 
145         if replies_count >= comment_count:
146 
147             break
148 
149         page_num += 1
150 
151  
152 
153     print(f'实际获取视频 {bv} 的评论总共 {replies_count} 条。')
154 
155     return res, av_number, replies_count
156 
157 if __name__ == '__main__':
158 
159     cts, av, cnt = get_all_comments_by_bv('BV1op4y1X7N2')
160 
161     for i in cts:
162 
163         print(i.get('content').get('message'))

2.数据分析可视化（例如：数据柱形图、直方图、散点图、盒图、分布图）

  1 #数据分析以及可视化
  2 
  3 filename1 = file_name1
  4 
  5 filename2 = file_name2
  6 
  7 with open(filename1,encoding="utf_8_sig") as f1:
  8 
  9     #创建阅读器（调用csv.reader()将前面存储的文件对象最为实参传给它）
 10 
 11     reader1 = csv.reader(f1)
 12 
 13     #调用了next()一次，所以这边只调用了文件的第一行，并将头文件存储在header_row中
 14 
 15     header_row1 = next(reader1)
 16 
 17     print(header_row1)
 18 
 19     #指出每个头文件的索引
 20 
 21     for index,column_header in enumerate(header_row1):
 22 
 23         print(index,column_header)
 24 
 25     #建立空列表
 26 
 27     title1 = []
 28 
 29     rank1 = []
 30 
 31     highs1=[]
 32 
 33     url1 = []
 34 
 35     visit1 = []
 36 
 37     space1 = []
 38 
 39     up_id1 = []
 40 
 41     for row in reader1:
 42 
 43         rank1.append(row[0])
 44 
 45         title1.append(row[1])
 46 
 47         visit1.append(row[2].strip('\n').strip(' ').strip('\n'))
 48 
 49         highs1.append(row[3].strip('\n').strip(' ').strip('\n'))
 50 
 51         up_id1.append(row[4].strip('\n').strip(' ').strip('\n'))
 52 
 53         url1.append(row[5].strip('\n').strip(' ').strip('\n').strip('//'))
 54 
 55         space1.append(row[6].strip('\n').strip(' ').strip('\n').strip('//'))
 56 
 57     visit1 = str(visit1)
 58 
 59     visit1 = visit1.replace('万', '000')
 60 
 61     visit1 = visit1.replace('.', '')
 62 
 63     visit1 = eval(visit1)
 64 
 65     visit_list_new1 = list(map(int, visit1))
 66 
 67     highs1 = str(highs1)
 68 
 69     highs1 = highs1.replace('万', '000')
 70 
 71     highs1 = highs1.replace('.', '')
 72 
 73     highs1 = eval(highs1)
 74 
 75     highs_list_new1 = list(map(int, highs1))
 76 
 77     print(highs_list_new1)
 78 
 79 #设置x轴数据
 80 
 81 x=np.array(rank1[0:10])
 82 
 83 #设置y轴数据
 84 
 85 y=np.array(highs_list_new1[0:10])
 86 
 87 # 绘制柱状图，并把每根柱子的颜色设置自己的喜欢的颜色，顺便设置每根柱子的宽度
 88 
 89 plt.bar(x,y,color =  ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)
 90 
 91 plt.show()
 92 
 93 #设置x轴数据
 94 
 95 x=np.array(title1[0:10])
 96 
 97 #设置y轴数据
 98 
 99 y=np.array(visit_list_new1[0:10])
100 
101 # 绘制柱状图，并把每根柱子的颜色设置自己的喜欢的颜色，顺便设置每根柱子的宽度
102 
103 plt.bar(x,y,color =  ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)
104 
105 matplotlib.rcParams['font.sans-serif'] = ['KaiTi']
106 
107 plt.show()
108 
109 #定义画布的大小
110 
111 fig = plt.figure(figsize = (15,8))
112 
113 #添加主标题
114 
115 plt.title("各视频播放量")
116 
117 #设置X周与Y周的标题
118 
119 plt.xlabel("视频名称")
120 
121 plt.ylabel("播放量")
122 
123 # 显示网格线
124 
125 plt.grid(True)
126 
127 #设置x轴数据
128 
129 x=np.array(title1[0:10])
130 
131 #设置y轴数据
132 
133 y=np.array(visit_list_new1[0:10])
134 
135 #绘制柱状图，并把每根柱子的颜色设置自己的喜欢的颜色，顺便设置每根柱子的宽度
136 
137 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.6)
138 
139 #图片保存
140 
141 plt.savefig(r"C:\Users\24390\Desktop\bilibili-up-v.png")
142 
143 with open(filename2,encoding="utf_8_sig") as f2:
144 
145     reader2 = csv.reader(f2)
146 
147     header_row2 = next(reader2)
148 
149     print(header_row2)
150 
151     for index,column_header in enumerate(header_row2):
152 
153         print(index,column_header)
154 
155     rank2 = []
156 
157     title2 = []
158 
159     highs2 = []
160 
161     url2 = []
162 
163     visit2 = []
164 
165     new_word2 = []
166 
167     for row in reader2:
168 
169         rank2.append(row[0])
170 
171         title2.append(row[1])
172 
173         visit2.append(row[2].strip('\n').strip(' ').strip('\n'))
174 
175         highs2.append(row[3].strip('\n').strip(' ').strip('\n'))
176 
177         new_word2.append(row[4])
178 
179         url2.append(row[5].strip('\n').strip(' ').strip('\n').strip('//'))
180 
181     print(highs2)
182 
183     title2 = str(title2)
184 
185     title2 = eval(title2)
186 
187     visit2 = str(visit2)
188 
189     visit2 = visit2.replace('万', '000')
190 
191     visit2 = visit2.replace('亿', '0000000')
192 
193     visit2 = visit2.replace('.', '')
194 
195     visit2 = eval(visit2)
196 
197     visit2 = list(map(int, visit2))
198 
199     visit_list_new2 = list(map(int, visit2))
200 
201     highs2 = str(highs2)
202 
203     highs2 = highs2.replace('万', '000')
204 
205     highs2 = highs2.replace('.', '')
206 
207     highs2 = eval(highs2)
208 
209     highs_list_new2 = list(map(int, highs2))
210 
211     print(highs_list_new2)
212 
213 #设置x轴数据
214 
215 x=np.array(rank2[0:10])
216 
217 #设置y轴数据
218 
219 y=np.array(highs_list_new2[0:10])
220 
221 # 绘制柱状图，并把每根柱子的颜色设置自己的喜欢的颜色，顺便设置每根柱子的宽度
222 
223 plt.bar(x,y,color =  ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)
224 
225 plt.show()
226 
227 #设置x轴数据
228 
229 x=np.array(title2[0:10])
230 
231 #设置y轴数据
232 
233 y=np.array(visit_list_new2[0:10])
234 
235 # 绘制柱状图，并把每根柱子的颜色设置自己的喜欢的颜色，顺便设置每根柱子的宽度
236 
237 plt.bar(x,y,color =  ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)
238 
239 matplotlib.rcParams['font.sans-serif'] = ['KaiTi']
240 
241 plt.show()
242 
243 # 定义画布的大小
244 
245 fig = plt.figure(figsize = (15,8))
246 
247 #添加主标题
248 
249 plt.title("番剧播放量")
250 
251 #设置X周与Y周的标题
252 
253 plt.xlabel("番剧名称")
254 
255 plt.ylabel("播放量")
256 
257 # 显示网格线
258 
259 plt.grid(True)
260 
261 #设置x轴数据
262 
263 x=np.array(title2[0:10])
264 
265 #设置y轴数据
266 
267 y=np.array(visit_list_new2[0:10])
268 
269 # 绘制柱状图，并把每根柱子的颜色设置自己的喜欢的颜色，顺便设置每根柱子的宽度
270 
271 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.6)
272 
273 #图片保存
274 
275 plt.savefig(r"C:\Users\24390\Desktop\bilibili-draw-v.png")

3.根据数据之间的关系，分析两个变量之间的相关系数，画出散点图，并建立变量之间的回归方程（一元或多元）。

  1 import pandas as pd
  2 
  3 import numpy as np
  4 
  5 import matplotlib.pyplot as plt
  6 
  7 from pandas import DataFrame,Series
  8 
  9 from sklearn.model_selection import train_test_split
 10 
 11 from sklearn.linear_model import LinearRegression
 12 
 13 import csv
 14 
 15 file_name2 = f'哔哩哔哩番剧top50_20211215_154745.csv'
 16 
 17 filename2 = file_name2
 18 
 19 with open(filename2,encoding="utf_8_sig") as f2:
 20 
 21     reader2 = csv.reader(f2)
 22 
 23     header_row2 = next(reader2)
 24 
 25     print(header_row2)
 26 
 27     for index,column_header in enumerate(header_row2):
 28 
 29         print(index,column_header)
 30 
 31     rank2 = []
 32 
 33     title2 = []
 34 
 35     highs2 = []
 36 
 37     url2 = []
 38 
 39     visit2 = []
 40 
 41     new_word2 = []
 42 
 43     for row in reader2:
 44 
 45         rank2.append(row[0])
 46 
 47         title2.append(row[1])
 48 
 49         visit2.append(row[2].strip('\n').strip(' ').strip('\n'))
 50 
 51         highs2.append(row[3].strip('\n').strip(' ').strip('\n'))
 52 
 53         new_word2.append(row[4])
 54 
 55         url2.append(row[5].strip('\n').strip(' ').strip('\n').strip('//'))
 56 
 57     print(highs2)
 58 
 59     title2 = str(title2)
 60 
 61     title2 = eval(title2)
 62 
 63     visit2 = str(visit2)
 64 
 65     visit2 = visit2.replace('万', '000')
 66 
 67     visit2 = visit2.replace('亿', '0000000')
 68 
 69     visit2 = visit2.replace('.', '')
 70 
 71     visit2 = eval(visit2)
 72 
 73     visit2 = list(map(int, visit2))
 74 
 75     visit_list_new2 = list(map(int, visit2))
 76 
 77     highs2 = str(highs2)
 78 
 79     highs2 = highs2.replace('万', '000')
 80 
 81     highs2 = highs2.replace('.', '')
 82 
 83     highs2 = eval(highs2)
 84 
 85     highs_list_new2 = list(map(int, highs2))
 86 
 87 with open('output.csv','w') as f:
 88 
 89     writer = csv.writer(f)
 90 
 91     writer.writerows(zip(highs_list_new2,visit_list_new2))
 92 
 93 #创建数据集
 94 
 95 examDict  = {'弹幕量':highs_list_new2[0:10],
 96 
 97              '播放量':visit_list_new2[0:10]}
 98 
 99 #转换为DataFrame的数据格式
100 
101 examDf = DataFrame(examDict)
102 
103 #绘制散点图
104 
105 plt.scatter(examDf.播放量,examDf.弹幕量,color = 'b',label = "Exam Data")
106 
107 #添加图的标签（x轴，y轴）
108 
109 plt.xlabel("Hours")
110 
111 plt.ylabel("Score")
112 
113 #显示图像
114 
115 plt.show()
116 
117 rDf = examDf.corr()
118 
119 print(rDf)
120 
121 exam_X=examDf.弹幕量
122 
123 exam_Y=examDf.播放量
124 
125 #将原数据集拆分训练集和测试集
126 
127 X_train,X_test,Y_train,Y_test = train_test_split(exam_X,exam_Y,train_size=.8)
128 
129 #X_train为训练数据标签,X_test为测试数据标签,exam_X为样本特征,exam_y为样本标签，train_size 训练数据占比
130 
131  
132 
133 print("原始数据特征:",exam_X.shape,
134 
135       ",训练数据特征:",X_train.shape,
136 
137       ",测试数据特征:",X_test.shape)
138 
139  
140 
141 print("原始数据标签:",exam_Y.shape,
142 
143       ",训练数据标签:",Y_train.shape,
144 
145       ",测试数据标签:",Y_test.shape)
146 
147 #散点图
148 
149 plt.scatter(X_train, Y_train, color="blue", label="train data")
150 
151 plt.scatter(X_test, Y_test, color="red", label="test data")
152 
153  
154 
155 #添加图标标签
156 
157 plt.legend(loc=2)
158 
159 plt.xlabel("Hours")
160 
161 plt.ylabel("Pass")
162 
163 #显示图像
164 
165 plt.savefig("tests.jpg")
166 
167 plt.show()
168 
169 model = LinearRegression()
170 
171  
172 
173 #对于模型错误我们需要把我们的训练集进行reshape操作来达到函数所需要的要求
174 
175 # model.fit(X_train,Y_train)
176 
177  
178 
179 #reshape如果行数=-1的话可以使我们的数组所改的列数自动按照数组的大小形成新的数组
180 
181 #因为model需要二维的数组来进行拟合但是这里只有一个特征所以需要reshape来转换为二维数组
182 
183 X_train = X_train.values.reshape(-1,1)
184 
185 X_test = X_test.values.reshape(-1,1)
186 
187  
188 
189 model.fit(X_train,Y_train)
190 
191 a  = model.intercept_#截距
192 
193  
194 
195 b = model.coef_#回归系数
196 
197  
198 
199 print("最佳拟合线:截距",a,",回归系数：",b)
200 
201 #训练数据的预测值
202 
203 y_train_pred = model.predict(X_train)
204 
205 #绘制最佳拟合线：标签用的是训练数据的预测值y_train_pred
206 
207 plt.plot(X_train, y_train_pred, color='black', linewidth=3, label="best line")
208 
209  
210 
211 #测试数据散点图
212 
213 plt.scatter(X_test, Y_test, color='red', label="test data")
214 
215  
216 
217 #添加图标标签
218 
219 plt.legend(loc=2)
220 
221 plt.xlabel("Number1")
222 
223 plt.ylabel("Number2")
224 
225 #显示图像
226 
227 plt.savefig("lines.jpg")
228 
229 plt.show()
230 
231 score = model.score(X_test,Y_test)
232 
233 print(score)

4.数据持久化

 1 file_name1 = f'哔哩哔哩视频top100_{now_str}.csv'
 2 
 3 with open(file_name1,'w',newline='',encoding='utf-8') as f:
 4 
 5     writer = csv.writer(f)
 6 
 7     writer.writerow(Video.csv_title())
 8 
 9     for v in videos:
10 
11         writer.writerow(v.to_csv())
12 
13 file_name2 = f'哔哩哔哩番剧top50_{now_str}.csv'
14 
15 with open(file_name2,'w',newline='',encoding='utf-8') as f:
16 
17     writer = csv.writer(f)
18 
19     writer.writerow(Video.csv_title())
20 
21     for v in videos:
22 
23         writer.writerow(v.to_csv())
24 
25 plt.savefig(r"C:\Users\24390\Desktop\bilibili-up-v.png")#图片保存
26 
27 plt.savefig(r"C:\Users\24390\Desktop\bilibili-draw-v.png")#图片保存

5.将以上各部分的代码汇总，附上完整程序代码

  1 import requests
  2 
  3 from bs4 import BeautifulSoup
  4 
  5 import csv
  6 
  7 import datetime
  8 
  9 import pandas as pd
 10 
 11 import numpy as np
 12 
 13 from matplotlib import rcParams
 14 
 15 import matplotlib.pyplot as plt
 16 
 17 import matplotlib.font_manager as font_manager
 18 
 19 from selenium import webdriver
 20 
 21 from time import sleep
 22 
 23 import matplotlib
 24 
 25 from sklearn.model_selection import train_test_split
 26 
 27 from sklearn.linear_model import LinearRegression
 28 
 29 from pandas import DataFrame,Series
 30 
 31 url = 'https://www.bilibili.com/v/popular/rank/all'
 32 
 33 response = requests.get(url)#发起网络请求
 34 
 35 html_text = response.text
 36 
 37 soup = BeautifulSoup(html_text,'html.parser')
 38 
 39 class Video:#创建Video对象
 40 
 41     def __init__(self,rank,title,visit,barrage,up_id,url,space):
 42 
 43         self.rank = rank
 44 
 45         self.title = title
 46 
 47         self.visit = visit
 48 
 49         self.barrage = barrage
 50 
 51         self.up_id = up_id
 52 
 53         self.url = url
 54 
 55         self.space = space
 56 
 57     def to_csv(self):
 58 
 59         return[self.rank,self.title,self.visit,self.barrage,self.up_id,self.url,self.space]
 60 
 61     @staticmethod
 62 
 63     def csv_title():
 64 
 65         return ['排名','标题','播放量','弹幕量','Up_ID','URL','作者空间']
 66 
 67 items = soup.findAll('li',{'class':'rank-item'})#提取列表
 68 
 69 videos = []#保存提取出来的Video列表
 70 
 71 for itm in items:
 72 
 73     title = itm.find('a',{'class':'title'}).text #视频标题
 74 
 75     rank = itm.find('i',{'class':'num'}).text #排名
 76 
 77     visit = itm.find_all('span')[3].text #播放量
 78 
 79     barrage = itm.find_all('span')[4].text #弹幕量
 80 
 81     up_id = itm.find('span',{'class':'data-box up-name'}).text #作者id
 82 
 83     url = itm.find_all('a')[1].get('href')#获取视频网址
 84 
 85     space = itm.find_all('a')[2].get('href')#获取作者空间网址
 86 
 87     v = Video(rank,title,visit,barrage,up_id,url,space)
 88 
 89     videos.append(v)
 90 
 91 now_str = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
 92 
 93 file_name1 = f'哔哩哔哩视频top100_{now_str}.csv'
 94 
 95 with open(file_name1,'w',newline='',encoding='utf-8') as f:
 96 
 97     writer = csv.writer(f)
 98 
 99     writer.writerow(Video.csv_title())
100 
101     for v in videos:
102 
103         writer.writerow(v.to_csv())
104 
105 url = 'https://www.bilibili.com/v/popular/rank/bangumi'
106 
107 response = requests.get(url)#发起网络请求
108 
109 html_text = response.text
110 
111 soup = BeautifulSoup(html_text,'html.parser')
112 
113 class Video:#创建Video对象
114 
115     def __init__(self,rank,title,visit,barrage,new_word,url):
116 
117         self.rank = rank
118 
119         self.title = title
120 
121         self.visit = visit
122 
123         self.barrage = barrage
124 
125         self.new_word = new_word
126 
127         self.url = url
128 
129     def to_csv(self):
130 
131         return[self.rank,self.title,self.visit,self.barrage,self.new_word,self.url]
132 
133     @staticmethod
134 
135     def csv_title():
136 
137         return ['排名','标题','播放量','弹幕量','更新话数至','URL']
138 
139 items = soup.findAll('li',{'class':'rank-item'})#提取列表
140 
141 videos = []#保存提取出来的Video列表
142 
143 for itm in items:
144 
145     rank = itm.find('i',{'class':'num'}).text #排名
146 
147     title = itm.find('a',{'class':'title'}).text #视频标题
148 
149     url = itm.find_all('a')[0].get('href')#获取视频网址
150 
151     visit = itm.find_all('span')[2].text #播放量
152 
153     barrage = itm.find_all('span')[3].text #弹幕量
154 
155     new_word = itm.find('span',{'class':'data-box'}).text#更新话数
156 
157     v = Video(rank,title,visit,barrage,new_word,url)
158 
159     videos.append(v)
160 
161 now_str = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
162 
163 file_name2 = f'哔哩哔哩番剧top50_{now_str}.csv'
164 
165 with open(file_name2,'w',newline='',encoding='utf-8') as f:
166 
167     writer = csv.writer(f)
168 
169     writer.writerow(Video.csv_title())
170 
171     for v in videos:
172 
173         writer.writerow(v.to_csv())
174 
175 paiming1 = pd.DataFrame(pd.read_csv(file_name1,encoding="utf_8_sig"))#对数据进行清洗和处理
176 
177 paiming2 = pd.DataFrame(pd.read_csv(file_name2,encoding="utf_8_sig"))
178 
179 print(paiming1.head())
180 
181 print(paiming2.head())
182 
183 print(paiming1.duplicated())#查找重复值
184 
185 print(paiming2.duplicated())
186 
187 print(paiming1['标题'].isnull().value_counts())#查找空值与缺失值
188 
189 print(paiming2['标题'].isnull().value_counts())
190 
191 print(paiming1['URL'].isnull().value_counts())
192 
193 print(paiming2['URL'].isnull().value_counts())
194 
195 print(paiming1['播放量'].isnull().value_counts())
196 
197 print(paiming2['播放量'].isnull().value_counts())
198 
199 print(paiming1['弹幕量'].isnull().value_counts())
200 
201 print(paiming2['弹幕量'].isnull().value_counts())
202 
203 #数据分析以及可视化
204 
205 filename1 = file_name1
206 
207 filename2 = file_name2
208 
209 with open(filename1,encoding="utf_8_sig") as f1:
210 
211     reader1 = csv.reader(f1)#创建阅读器（调用csv.reader()将前面存储的文件对象最为实参传给它）
212 
213     header_row1 = next(reader1)#调用了next()一次，所以这边只调用了文件的第一行，并将头文件存储在header_row中
214 
215     print(header_row1)
216 
217     for index,column_header in enumerate(header_row1):#指出每个头文件的索引
218 
219         print(index,column_header)
220 
221     title1 = []
222 
223     rank1 = []
224 
225     highs1=[]
226 
227     url1 = []
228 
229     visit1 = []
230 
231     space1 = []
232 
233     up_id1 = []
234 
235     for row in reader1:
236 
237         rank1.append(row[0])
238 
239         title1.append(row[1])
240 
241         visit1.append(row[2].strip('\n').strip(' ').strip('\n'))
242 
243         highs1.append(row[3].strip('\n').strip(' ').strip('\n'))
244 
245         up_id1.append(row[4].strip('\n').strip(' ').strip('\n'))
246 
247         url1.append(row[5].strip('\n').strip(' ').strip('\n').strip('//'))
248 
249         space1.append(row[6].strip('\n').strip(' ').strip('\n').strip('//'))
250 
251     visit1 = str(visit1)
252 
253     visit1 = visit1.replace('万', '000')
254 
255     visit1 = visit1.replace('.', '')
256 
257     visit1 = eval(visit1)
258 
259     visit_list_new1 = list(map(int, visit1))
260 
261     highs1 = str(highs1)
262 
263     highs1 = highs1.replace('万', '000')
264 
265     highs1 = highs1.replace('.', '')
266 
267     highs1 = eval(highs1)
268 
269     highs_list_new1 = list(map(int, highs1))
270 
271     print(highs_list_new1)
272 
273 x=np.array(rank1[0:10])#设置x轴数据
274 
275 y=np.array(highs_list_new1[0:10])#设置y轴数据
276 
277 plt.bar(x,y,color =  ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)# 绘制柱状图，并把每根柱子的颜色设置自己的喜欢的颜色，顺便设置每根柱子的宽度
278 
279 plt.show()
280 
281 x=np.array(title1[0:10])#设置x轴数据
282 
283 y=np.array(visit_list_new1[0:10])#设置y轴数据
284 
285 plt.bar(x,y,color =  ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)# 绘制柱状图，并把每根柱子的颜色设置自己的喜欢的颜色，顺便设置每根柱子的宽度
286 
287 matplotlib.rcParams['font.sans-serif'] = ['KaiTi']
288 
289 plt.show()
290 
291 fig = plt.figure(figsize = (15,8))#定义画布的大小
292 
293 plt.title("各视频播放量")#添加主标题
294 
295 plt.xlabel("视频名称")#设置X周与Y周的标题
296 
297 plt.ylabel("播放量")
298 
299 plt.grid(True)# 显示网格线
300 
301 x=np.array(title1[0:10])#设置x轴数据
302 
303 y=np.array(visit_list_new1[0:10])#设置y轴数据
304 
305 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.6)#绘制柱状图，并把每根柱子的颜色设置自己的喜欢的颜色，顺便设置每根柱子的宽度
306 
307 plt.savefig(r"C:\Users\24390\Desktop\bilibili-up-v.png")#图片保存
308 
309 with open(filename2,encoding="utf_8_sig") as f2:
310 
311     reader2 = csv.reader(f2)
312 
313     header_row2 = next(reader2)
314 
315     print(header_row2)
316 
317     for index,column_header in enumerate(header_row2):
318 
319         print(index,column_header)
320 
321     rank2 = []
322 
323     title2 = []
324 
325     highs2 = []
326 
327     url2 = []
328 
329     visit2 = []
330 
331     new_word2 = []
332 
333     for row in reader2:
334 
335         rank2.append(row[0])
336 
337         title2.append(row[1])
338 
339         visit2.append(row[2].strip('\n').strip(' ').strip('\n'))
340 
341         highs2.append(row[3].strip('\n').strip(' ').strip('\n'))
342 
343         new_word2.append(row[4])
344 
345         url2.append(row[5].strip('\n').strip(' ').strip('\n').strip('//'))
346 
347     print(highs2)
348 
349     title2 = str(title2)
350 
351     title2 = eval(title2)
352 
353     visit2 = str(visit2)
354 
355     visit2 = visit2.replace('万', '000')
356 
357     visit2 = visit2.replace('亿', '0000000')
358 
359     visit2 = visit2.replace('.', '')
360 
361     visit2 = eval(visit2)
362 
363     visit2 = list(map(int, visit2))
364 
365     visit_list_new2 = list(map(int, visit2))
366 
367     highs2 = str(highs2)
368 
369     highs2 = highs2.replace('万', '000')
370 
371     highs2 = highs2.replace('.', '')
372 
373     highs2 = eval(highs2)
374 
375     highs_list_new2 = list(map(int, highs2))
376 
377     print(highs_list_new2)
378 
379 x=np.array(rank2[0:10])#设置x轴数据
380 
381 y=np.array(highs_list_new2[0:10])#设置y轴数据
382 
383 plt.bar(x,y,color =  ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)# 绘制柱状图，并把每根柱子的颜色设置自己的喜欢的颜色，顺便设置每根柱子的宽度
384 
385 plt.show()
386 
387 x=np.array(title2[0:10])#设置x轴数据
388 
389 y=np.array(visit_list_new2[0:10])#设置y轴数据
390 
391 plt.bar(x,y,color =  ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.5)# 绘制柱状图，并把每根柱子的颜色设置自己的喜欢的颜色，顺便设置每根柱子的宽度
392 
393 matplotlib.rcParams['font.sans-serif'] = ['KaiTi']
394 
395 plt.show()
396 
397 fig = plt.figure(figsize = (15,8))# 定义画布的大小
398 
399 plt.title("番剧播放量")#添加主标题
400 
401 plt.xlabel("番剧名称")#设置X周与Y周的标题
402 
403 plt.ylabel("播放量")
404 
405 plt.grid(True)# 显示网格线
406 
407 x=np.array(title2[0:10])#设置x轴数据
408 
409 y=np.array(visit_list_new2[0:10])#设置y轴数据
410 
411 plt.bar(x,y,color = ["red","yellow","green","blue","black","gold","pink","purple","violet","Chocolate"],width = 0.6)# 绘制柱状图，并把每根柱子的颜色设置自己的喜欢的颜色，顺便设置每根柱子的宽度
412 
413 plt.savefig(r"C:\Users\24390\Desktop\bilibili-draw-v.png")#图片保存
414 
415 with open('output.csv','w') as f:
416 
417     writer = csv.writer(f)
418 
419     writer.writerows(zip(highs_list_new2,visit_list_new2))
420 
421 #创建数据集
422 
423 examDict  = {'弹幕量':highs_list_new2[0:10],
424 
425              '播放量':visit_list_new2[0:10]}
426 
427 #转换为DataFrame的数据格式
428 
429 examDf = DataFrame(examDict)
430 
431 #绘制散点图
432 
433 plt.scatter(examDf.播放量,examDf.弹幕量,color = 'b',label = "Exam Data")
434 
435 #添加图的标签（x轴，y轴）
436 
437 plt.xlabel("Hours")
438 
439 plt.ylabel("Score")
440 
441 #显示图像
442 
443 plt.show()
444 
445 rDf = examDf.corr()
446 
447 print(rDf)
448 
449 exam_X=examDf.弹幕量
450 
451 exam_Y=examDf.播放量
452 
453 #将原数据集拆分训练集和测试集
454 
455 X_train,X_test,Y_train,Y_test = train_test_split(exam_X,exam_Y,train_size=.8)
456 
457 #X_train为训练数据标签,X_test为测试数据标签,exam_X为样本特征,exam_y为样本标签，train_size 训练数据占比
458 
459  
460 
461 print("原始数据特征:",exam_X.shape,
462 
463       ",训练数据特征:",X_train.shape,
464 
465       ",测试数据特征:",X_test.shape)
466 
467  
468 
469 print("原始数据标签:",exam_Y.shape,
470 
471       ",训练数据标签:",Y_train.shape,
472 
473       ",测试数据标签:",Y_test.shape)
474 
475 #散点图
476 
477 plt.scatter(X_train, Y_train, color="blue", label="train data")
478 
479 plt.scatter(X_test, Y_test, color="red", label="test data")
480 
481  
482 
483 #添加图标标签
484 
485 plt.legend(loc=2)
486 
487 plt.xlabel("Hours")
488 
489 plt.ylabel("Pass")
490 
491 #显示图像
492 
493 plt.savefig("tests.jpg")
494 
495 plt.show()
496 
497 model = LinearRegression()
498 
499  
500 
501 #对于模型错误我们需要把我们的训练集进行reshape操作来达到函数所需要的要求
502 
503 # model.fit(X_train,Y_train)
504 
505  
506 
507 #reshape如果行数=-1的话可以使我们的数组所改的列数自动按照数组的大小形成新的数组
508 
509 #因为model需要二维的数组来进行拟合但是这里只有一个特征所以需要reshape来转换为二维数组
510 
511 X_train = X_train.values.reshape(-1,1)
512 
513 X_test = X_test.values.reshape(-1,1)
514 
515  
516 
517 model.fit(X_train,Y_train)
518 
519 a  = model.intercept_#截距
520 
521  
522 
523 b = model.coef_#回归系数
524 
525  
526 
527 print("最佳拟合线:截距",a,",回归系数：",b)
528 
529 #训练数据的预测值
530 
531 y_train_pred = model.predict(X_train)
532 
533 #绘制最佳拟合线：标签用的是训练数据的预测值y_train_pred
534 
535 plt.plot(X_train, y_train_pred, color='black', linewidth=3, label="best line")
536 
537  
538 
539 #测试数据散点图
540 
541 plt.scatter(X_test, Y_test, color='red', label="test data")
542 
543  
544 
545 #添加图标标签
546 
547 plt.legend(loc=2)
548 
549 plt.xlabel("Number1")
550 
551 plt.ylabel("Number2")
552 
553 #显示图像
554 
555 plt.savefig("lines.jpg")
556 
557 plt.show()
558 
559 score = model.score(X_test,Y_test)
560 
561 print(score)
562 
563 print(title1[1],title2[1])
564 
565 print('请问您想观看UP主视频还是番剧亦或者是查询UP主的空间页面？\n观看UP主视频请扣1，观看番剧请扣2，查询UP主空间页面请扣3。')
566 
567 z = int(input())
568 
569 if z == int(2):
570 
571     print(title2)
572 
573     print('请输入您想观看的番剧：')
574 
575     name = input()
576 
577     i=0
578 
579     for i in range(0,50,1):
580 
581         if title2[i]==name:
582 
583             print(i)
584 
585             break
586 
587     print(url2[i])
588 
589     to_url2=url2[i]
590 
591     d = webdriver.Chrome()#打开谷歌浏览器，并且赋值给变量d
592 
593     d.get('https://'+to_url2)#通过get()方法，在当前窗口打开网页
594 
595     sleep(2)
596 
597 elif z == int(1):
598 
599     print(title1)
600 
601     print('请输入您想观看的UP主视频：')
602 
603     name = input()
604 
605     i=0
606 
607     for i in range(0,100,1):
608 
609         if title1[i]==name:
610 
611             print(i)
612 
613             break
614 
615     print(url1[i])
616 
617     to_url1=url1[i]
618 
619     d = webdriver.Chrome()#打开谷歌浏览器，并且赋值给变量d
620 
621     d.get('https://'+to_url1)#通过get()方法，在当前窗口打开网页
622 
623     sleep(2)
624 
625 elif z == int(3):
626 
627     print(up_id1)
628 
629     print('请输入您想查询的UP主空间：')
630 
631     name = input()
632 
633     i=0
634 
635     for i in range(0,100,1):
636 
637         if up_id1[i]==name:
638 
639             print(i)
640 
641             break
642 
643     print(space1[i])
644 
645     to_space11=space1[i]
646 
647     d = webdriver.Chrome()#打开谷歌浏览器，并且赋值给变量d
648 
649     d.get('https://'+to_space11)#通过get()方法，在当前窗口打开网页
650 
651     sleep(2)
652 
653 else:
654 
655 print('输入不符合要求')

三、总结

1.经过对主题数据的分析与可视化，可以得到哪些结论？是否达到预期的目标？

结论：本次课程设计，影响最深的就是在遇到问题时候，可以通过网上了解BUG问题的原因并很好地解决，在设计课程时候，可以考虑与机器学习以及其他方面进行结合本次课程所绘制的散点图与直方图等不只局限于课程爬虫设计这一主题，其中还涉及到对机器主题的应用，让我明白了设计课题主题的知识广泛与应用。

目标：首先需要学好网络爬虫基本的步骤request请求与存储。采集信息并提取出来进行可视化绘制也是我下次要学习的重点。实行数据的持久化可以减少对所获取的数据的清洗与处理次数。这次的课程设计使我明白了要加强对python的了解与理解，才能迅速的找到自己不足的地方并且专攻下来，争取推动自己对python的进程。

posted @ 2021-12-30 13:52 林鑫颖阅读(1919) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

林鑫颖

Python网络爬虫——爬取哔哩哔哩网站原创视频以及其动漫视频

④利用schedule实现定时爬取

①jieba提词和echarts wordcloud

②flask接受请求的参数

公告