值得买推荐网爬虫分析

值得买推荐网爬虫分析

一、选题的背景

“什么值得买”是集导购、媒体、工具、社区属性于一体的消费内容社区,以高质量的消费类内容向用户介绍高性价比、好口碑的商品及服务,为用户提供高效、精准、中立、专业的消费决策支持。本次课题通过什么值得买家电类进行爬虫分析。

二、主题式网络爬虫设计方案

  2.1主题式网络爬虫主题名称

     值得买家电类爬虫系统

  2.2主题式网络爬取内容与数据分析特征

     爬取内容:家电名称、价格、出自哪平台、家电简介

     数据特征分析:网页文本

  2.3主题式网络爬虫设计方案概述(包括实现思路与技术难点)

  实现思路:

     查看网页爬取内容的位置,利用request库进行get请求,获取页面数据;

     取出数据,遍历数据;数据存储,使用open()+for循环进行批量保存。

   技术难点:

     数据异常处理,由于对基础知识不熟练;网页内容读取,要用xtree第三方库用法;整体系统设计需要思考。

三、主题页面的结构特征分析

  3.1爬虫设计

    3.1.1主题页面结构特征

     目标网址:https://www.smzdm.com/

    

    3.1.2Html页面解析

                 url = 'https://www.smzdm.com/fenlei/dajiadian/p'+str(page)+'/#feed-main'

     商品名称:

    

    价格:

    

    简介:

    

    平台:

    

 

    3.1.3节点查找

    通过审查元素,定位元素目标复制xpath。

     

 

             遍历方法:通过for循环把xpath定位的内容取出来。

四、网络爬虫程序设计

   4.1数据爬取与采集

     代码分析:

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import time
 4 import random
 5 import sys
 6 import re
 7 from tqdm import tqdm
 8 from lxml import etree
 9 
10 
11 # 随机头
12 USER_AGENTS = [
13     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
14     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
15     "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
16     "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
17     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
18     "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
19     "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
20     "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
21     "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
22     "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
23     "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
24     "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
25     "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
26     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
27     "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
28     "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
29     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
30     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
31     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
32     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
33     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
34 ]
35 headers = {
36     'User-Agent':random.choice(USER_AGENTS),
37     'Connection':'keep-alive',
38     'cookie':'__ckguid=6Le4K3rwaxuugb5yVhv92q; __jsluid_s=fd7f40803fa69a378c424c4e06af60a7; device_id=19645643031640395328534362a4734d47f9281a0183984955fef55b1d; Hm_lvt_9b7ac3d38f30fe89ff0b8a0546904e58=1640395329,1640483005; Hm_lpvt_9b7ac3d38f30fe89ff0b8a0546904e58=1640483005;',
39     'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2'
40     }
41 
42 # 创建Jiadian.csv
43 file = open("Jiadian.csv", "a",encoding='utf-8')
44 file.write("name" + "," + "price" + "," + "info" + "," + "platform"  + '\n')
45 file = file.close()
46 
47 def Jiadian(page):
48     page = int(page)
49     for i in range(0,page):
50         # 请求访问
51         url = 'https://www.smzdm.com/fenlei/dajiadian/p'+str(page)+'/#feed-main'
52         res = requests.get(url,headers=headers,timeout=3)
53         res.encoding = 'utf-8'
54         html = etree.HTML(res.text)
55         print(html)
56         coun = 1
57         try:
58             # 家电名称name、价格price、商品简介info、出售平台platform
59             for i in range(1,30):
60                 name = html.xpath("//*[@id='feed-main-list']/li[{}]/div/div[2]/h5/a/text()".format(coun))
61                 for i in name:
62                     name = i
63 
64                 price = html.xpath("//*[@id='feed-main-list']/li[{}]/div/div[2]/div[1]/a/text()".format(coun))
65                 for i in price:
66                     price = i.strip()
67                     price = price.strip('(需用券)')
68                     price = price.strip('元包邮 (需用券)')
69                     price = price.strip('元包邮(双重优惠')
70                     price = price.strip('元包邮(拍下立减')
71                     price = price.strip('')
72                     price = price.strip('元(包邮、')
73                 info = html.xpath('//*[@id="feed-main-list"]/li[{}]/div/div[2]/div[3]/text()'.format(coun))
74                 for i in info:
75                     info = i.strip()
76 
77                 platform = html.xpath('//*[@id="feed-main-list"]/li[{}]/div/div[2]/div[4]/div[2]/span/a/text()'.format(coun))
78                 for i in platform:
79                     platform = i.strip()
80                 # print(type(name), type(price), type(info), type(platform))
81                 # 将数据保存至Jiadian.csv文件
82                 with open("Jiadian.csv","a",encoding='utf-8') as f2:
83                     f2.writelines(name + "," + price + "," + platform + "," + info + "," + '\n')
84                 print(name,'\n','价格:',price,'','\n','简介:',info,'\n','购买平台:',platform,'\n')
85                 coun += 1
86         except:
87             pass
88         # 防止IP被ban
89         time.sleep(3)
90 
91 if __name__ == '__main__':
92     page = input("输入要爬取的页数:")
93     Jiadian(page)

 

     爬虫系统运行演示:

  

   4.2数据分析

     导入数据:

1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 Jiadian = pd.read_excel('Jiadian.xlsx')

  数据清洗处理:

1 # 重复值处理
2 Jiadian = Jiadian.drop_duplicates('name')
3 # Nan处理
4 Jiadian = Jiadian.dropna(axis = 0)
5 #空白值处理
6 Jiadian = Jiadian.dropna()
7 删除无效行
8 Jiadian = Jiadian.drop(['platform'], axis = 1)
9 Jiadian

    商品进行类:

 1 # 进行分类
 2 京东 = pd.DataFrame(columns=['name','price', 'inform'])
 3 小米有品 = pd.DataFrame(columns=['name','price', 'inform'])
 4 天猫精选 = pd.DataFrame(columns=['name','price', 'inform'])
 5 顺电网上商城 = pd.DataFrame(columns=['name','price', 'inform'])
 6 苏宁易购 = pd.DataFrame(columns=['name','price', 'inform'])
 7 for i in Jiadian['inform']:
 8     if "京东" in i:
 9         a = Jiadian[(Jiadian.inform==i)].index.tolist()
10         京东=京东.append(Jiadian.loc[a,:],ignore_index=True)
11     elif "小米有品" in i:
12         n = Jiadian[(Jiadian.inform==i)].index.tolist()
13         小米有品=小米有品.append(Jiadian.loc[n,:],ignore_index=True)
14     elif "天猫精选" in i:
15         n = Jiadian[(Jiadian.inform==i)].index.tolist()
16         天猫精选=天猫精选.append(Jiadian.loc[n,:],ignore_index=True)
17     elif "顺电网上商城" in i:
18         n = Jiadian[(Jiadian.inform==i)].index.tolist()
19         顺电网上商城=顺电网上商城.append(Jiadian.loc[n,:],ignore_index=True)
20     elif "苏宁易购" in i:
21         n = Jiadian[(Jiadian.inform==i)].index.tolist()
22         苏宁易购=苏宁易购.append(Jiadian.loc[n,:],ignore_index=True)  

 

  

 

  

  

  4.3数据分析可视化

   折线图:

 1 # 价格进行降序排列分析
 2 x = 京东['name']
 3 y = 京东['price']
 4 fig = plt.figure(figsize=(10, 4), dpi=80)
 5 plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
 6 plt.rcParams['axes.unicode_minus']=False
 7 plt.xticks(rotation=90)
 8 plt.plot(x,y,'s-',color = 'plum',label="价格")#s-:方形
 9 plt.legend(loc = "best")#图例
10 plt.title("京东家电价格趋势图",fontsize=18)
11 plt.xticks(fontsize=12)
12 plt.ylabel("价格/元",fontsize=12)#纵坐标名字
13 plt.show()

  

 1 # 价格进行降序排列分析
 2 x = Jiadian['name']
 3 y = Jiadian['price']
 4 fig = plt.figure(figsize=(10, 4), dpi=80)
 5 plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
 6 plt.rcParams['axes.unicode_minus']=False
 7 plt.xticks(rotation=90)
 8 plt.plot(x,y,'s-',color = 'plum',label="价格")#s-:方形
 9 plt.legend(loc = "best")#图例
10 plt.title("值得买家电价格趋势图",fontsize=18)
11 plt.xticks(fontsize=12)
12 plt.ylabel("价格/元",fontsize=12)#纵坐标名字

 

  

   柱状图:

 1 x = 天猫精选['name']
 2 y = 天猫精选['price']
 3 fig = plt.figure(figsize=(10, 8), dpi=80)
 4 plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
 5 plt.rcParams['axes.unicode_minus']=False
 6 plt.xticks(rotation=90)
 7 plt.bar(x,y,alpha=0.2, width=0.6, color='b', lw=3)
 8 plt.legend(loc = "best")#图例
 9 plt.title("天猫精选家电价格柱状图",fontsize=18)
10 plt.ylabel("价格/元",fontsize=12)#纵坐标名字
11 plt.show()

 

  

 1 x = 京东['name']
 2 y = 京东['price']
 3 fig = plt.figure(figsize=(10, 8), dpi=80)
 4 plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
 5 plt.rcParams['axes.unicode_minus']=False
 6 plt.xticks(rotation=90)
 7 plt.bar(x,y,alpha=0.2, width=0.6, color='b', lw=3)
 8 plt.legend(loc = "best")#图例
 9 plt.title("京东家电价格柱状图",fontsize=18)
10 plt.ylabel("价格/元",fontsize=12)#纵坐标名字
11 plt.show()

 

  

  水平图:

1 # 水平图
2 x = 天猫精选['name']
3 y = 天猫精选['price']
4 fig = plt.figure(figsize=(16, 8), dpi=80)
5 plt.barh(x,y, alpha=0.2, height=0.6, color='coral')
6 plt.title("天猫精选家电价格水平图",fontsize=18)
7 plt.legend(loc = "best")#图例
8 plt.xticks(rotation=90)
9 plt.xlabel("价格",fontsize=12)#横坐标名字

 

  

1 # 水平图
2 x = 京东['name']
3 y = 京东['price']
4 fig = plt.figure(figsize=(16, 8), dpi=80)
5 plt.barh(x,y, alpha=0.2, height=0.6, color='coral')
6 plt.title("京东家电价格水平图",fontsize=18)
7 plt.legend(loc = "best")#图例
8 plt.xticks(rotation=90)
9 plt.xlabel("价格",fontsize=12)#横坐标名字

 

  

   散点图:

 1 # 散点图
 2 x = 天猫精选['name']
 3 y = 天猫精选['price']
 4 fig = plt.figure(figsize=(10, 6), dpi=80)
 5 ax = plt.subplot(1, 1, 1)
 6 plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
 7 plt.scatter(x,y,color='c',marker='o',s=60,alpha=1)
 8 plt.xticks(rotation=90)
 9 plt.xticks([])
10 plt.ylabel("价格",fontsize=12)#横坐标名字
11 plt.title("天猫精选家电价格散点图",fontsize=16)

 

  

 1 # 散点图
 2 x = Jiadian['name']
 3 y = Jiadian['price']
 4 fig = plt.figure(figsize=(10, 6), dpi=80)
 5 ax = plt.subplot(1, 1, 1)
 6 plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
 7 plt.scatter(x,y,color='c',marker='o',s=60,alpha=1)
 8 plt.xticks(rotation=90)
 9 plt.xticks([])
10 plt.ylabel("价格",fontsize=12)#横坐标名字
11 plt.title("值得买家电价格散点图",fontsize=16)

 

  

  盒图:

1 # 盒图
2 y = 天猫精选['price']
3 plt.boxplot(y)
4 plt.title("天猫精选家电价格盒图",fontsize=16)
5 plt.show()

  

1 # 盒图
2 y = 京东['price']
3 plt.boxplot(y)
4 plt.title("京东家电价格盒图",fontsize=16)
5 plt.show()

 

  

1 # 盒图
2 y = Jiadian['price']
3 plt.boxplot(y)
4 plt.title("值得买家电价格盒图",fontsize=16)
5 plt.show()

 

  

   饼图:

1 y = Jiadian['inform'].value_counts()
2 y = np.array([19, 5, 2, 2, 1])
3 
4 plt.pie(y,
5         labels=['京东','天猫精选','顺电','苏宁','小米精选'], # 设置饼图标签
6         colors=["#d5695d", "#5d8ca8", "#65a479", "#a564c9","#FFE4C4"], # 设置饼图颜色
7        )
8 plt.title("总家电占比",fontsize='14') # 设置标题
9 plt.show()

 

  

 

   线性回归:

 1 import seaborn as sns
 2 import matplotlib.pyplot as plt
 3 import numpy as mp
 4 from scipy.optimize import leastsq
 5 plt.rcParams['font.sans-serif'] = ['SimHei']#解决乱码问题
 6 #定义变量
 7 gsgm=qcwy.loc[:,'Updatetime']
 8 zprs=qcwy.loc[:,'price']
 9 #函数表达式
10 def func(params,x):
11     a,b,c=params
12     return a*x*x+b*x+c
13 def error_func(params,x,y):
14     return func(params,x)-y
15 P0=[1,9.0]
16 def main():
17     plt.figure(figsize=(8,6))
18     P0=[1,9.0,1]
19     Para=leastsq(error_func,P0,args=(gsgm,zprs))
20     a,b,c=Para[0]
21     print("a=",a, "b=",b, "c=",c)
22     #绘图
23     plt.scatter(gsgm,zprs,color="green",label="样本数据",linewidth=2)
24     x=mp.linspace(13,1,2)
25     y=a*x*x+b*x+c
26     #右上角标
27     plt.plot(x,y,color="red",label="拟合曲线",linewidth=2)
28     #x,y轴名称
29     plt.xlabel('Updatetime')
30     plt.ylabel('price')
31     #标题
32     plt.title("price与Updatetime回归方程")
33     plt.grid()
34     plt.legend()
35     plt.show()
36 main()  

 

  

 

  4.4文本分析

 1 词云:
 2 # 词云
 3 import random
 4 import wordcloud as wc
 5 import matplotlib.pyplot as plt
 6 
 7 # 定义图片尺寸
 8 word_cloud = wc.WordCloud(
 9    background_color='black',  
10    font_path='msyhbd.ttc',  
11    max_font_size=300, 
12    random_state=50,  
13                        )
14 text = Jiadian['name']
15 text = " ".join(text)
16 # 绘制词云
17 fig = plt.figure(figsize=(8, 4), dpi=80)
18 ax = plt.subplot(1, 1, 1)
19 word_cloud.generate(text)
20 plt.imshow(word_cloud)
21 plt.show()

 

  

 1 # 词云
 2 import random
 3 import wordcloud as wc
 4 import matplotlib.pyplot as plt
 5 from PIL import Image
 6 
 7 # 背景图片
 8 bk = np.array(Image.open("Jiadian.jpg"))
 9 mask = bk
10 # 定义图片尺寸
11 word_cloud = wc.WordCloud(
12    mask = mask,
13    background_color='black',  
14    font_path='msyhbd.ttc',  
15    max_font_size=300, 
16    random_state=50,  
17                        )
18 text = Jiadian['inform']
19 text = " ".join(text)
20 # 绘制词云
21 fig = plt.figure(figsize=(10, 5), dpi=80)
22 ax = plt.subplot(1, 1, 1)
23 word_cloud.generate(text)
24 plt.imshow(word_cloud)
25 plt.show()

 

  

   4.5代码汇总

  1 import requests
  2 from bs4 import BeautifulSoup
  3 import time
  4 import random
  5 import sys
  6 import re
  7 from tqdm import tqdm
  8 from lxml import etree
  9 
 10 
 11 # 随机头
 12 USER_AGENTS = [
 13     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
 14     "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
 15     "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
 16     "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
 17     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
 18     "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
 19     "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
 20     "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
 21     "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
 22     "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
 23     "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
 24     "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
 25     "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
 26     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
 27     "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
 28     "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
 29     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
 30     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
 31     "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
 32     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
 33     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
 34 ]
 35 headers = {
 36     'User-Agent':random.choice(USER_AGENTS),
 37     'Connection':'keep-alive',
 38     'cookie':'__ckguid=6Le4K3rwaxuugb5yVhv92q; __jsluid_s=fd7f40803fa69a378c424c4e06af60a7; device_id=19645643031640395328534362a4734d47f9281a0183984955fef55b1d; Hm_lvt_9b7ac3d38f30fe89ff0b8a0546904e58=1640395329,1640483005; Hm_lpvt_9b7ac3d38f30fe89ff0b8a0546904e58=1640483005;',
 39     'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2'
 40     }
 41 
 42 # 创建Jiadian.csv
 43 file = open("Jiadian.csv", "a",encoding='utf-8')
 44 file.write("name" + "," + "price" + "," + "info" + "," + "platform"  + '\n')
 45 file = file.close()
 46 
 47 def Jiadian(page):
 48     page = int(page)
 49     for i in range(0,page):
 50         # 请求访问
 51         url = 'https://www.smzdm.com/fenlei/dajiadian/p'+str(page)+'/#feed-main'
 52         res = requests.get(url,headers=headers,timeout=3)
 53         res.encoding = 'utf-8'
 54         html = etree.HTML(res.text)
 55         print(html)
 56         coun = 1
 57         try:
 58             # 家电名称name、价格price、商品简介info、出售平台platform
 59             for i in range(1,30):
 60                 name = html.xpath("//*[@id='feed-main-list']/li[{}]/div/div[2]/h5/a/text()".format(coun))
 61                 for i in name:
 62                     name = i
 63 
 64                 price = html.xpath("//*[@id='feed-main-list']/li[{}]/div/div[2]/div[1]/a/text()".format(coun))
 65                 for i in price:
 66                     price = i.strip()
 67                     price = price.strip('(需用券)')
 68                     price = price.strip('元包邮 (需用券)')
 69                     price = price.strip('元包邮(双重优惠')
 70                     price = price.strip('元包邮(拍下立减')
 71                     price = price.strip('')
 72                     price = price.strip('元(包邮、')
 73                 info = html.xpath('//*[@id="feed-main-list"]/li[{}]/div/div[2]/div[3]/text()'.format(coun))
 74                 for i in info:
 75                     info = i.strip()
 76 
 77                 platform = html.xpath('//*[@id="feed-main-list"]/li[{}]/div/div[2]/div[4]/div[2]/span/a/text()'.format(coun))
 78                 for i in platform:
 79                     platform = i.strip()
 80                 # print(type(name), type(price), type(info), type(platform))
 81                 # 将数据保存至Jiadian.csv文件
 82                 with open("Jiadian.csv","a",encoding='utf-8') as f2:
 83                     f2.writelines(name + "," + price + "," + platform + "," + info + "," + '\n')
 84                 print(name,'\n','价格:',price,'','\n','简介:',info,'\n','购买平台:',platform,'\n')
 85                 coun += 1
 86         except:
 87             pass
 88         # 防止IP被ban
 89         time.sleep(3)
 90 
 91 if __name__ == '__main__':
 92     page = input("输入要爬取的页数:")
 93     Jiadian(page)
 94 #数据分析
 95 import pandas as pd
 96 import numpy as np
 97 import matplotlib.pyplot as plt
 98 Jiadian = pd.read_excel('Jiadian.xlsx')
 99 # 重复值处理
100 Jiadian = Jiadian.drop_duplicates('name')
101 # Nan处理
102 Jiadian = Jiadian.dropna(axis = 0)
103 #空白值处理
104 Jiadian = Jiadian.dropna()
105 删除无效行
106 Jiadian = Jiadian.drop(['platform'], axis = 1)
107 Jiadian
108 
109 # 进行分类
110 京东 = pd.DataFrame(columns=['name','price', 'inform'])
111 小米有品 = pd.DataFrame(columns=['name','price', 'inform'])
112 天猫精选 = pd.DataFrame(columns=['name','price', 'inform'])
113 顺电网上商城 = pd.DataFrame(columns=['name','price', 'inform'])
114 苏宁易购 = pd.DataFrame(columns=['name','price', 'inform'])
115 for i in Jiadian['inform']:
116     if "京东" in i:
117         a = Jiadian[(Jiadian.inform==i)].index.tolist()
118         京东=京东.append(Jiadian.loc[a,:],ignore_index=True)
119     elif "小米有品" in i:
120         n = Jiadian[(Jiadian.inform==i)].index.tolist()
121         小米有品=小米有品.append(Jiadian.loc[n,:],ignore_index=True)
122     elif "天猫精选" in i:
123         n = Jiadian[(Jiadian.inform==i)].index.tolist()
124         天猫精选=天猫精选.append(Jiadian.loc[n,:],ignore_index=True)
125     elif "顺电网上商城" in i:
126         n = Jiadian[(Jiadian.inform==i)].index.tolist()
127         顺电网上商城=顺电网上商城.append(Jiadian.loc[n,:],ignore_index=True)
128     elif "苏宁易购" in i:
129         n = Jiadian[(Jiadian.inform==i)].index.tolist()
130         苏宁易购=苏宁易购.append(Jiadian.loc[n,:],ignore_index=True)
131 
132 # 拆线图
133 # 价格进行降序排列分析
134 x = 京东['name']
135 y = 京东['price']
136 fig = plt.figure(figsize=(10, 4), dpi=80)
137 plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
138 plt.rcParams['axes.unicode_minus']=False
139 plt.xticks(rotation=90)
140 plt.plot(x,y,'s-',color = 'plum',label="价格")#s-:方形
141 plt.legend(loc = "best")#图例
142 plt.title("京东家电价格趋势图",fontsize=18)
143 plt.xticks(fontsize=12)
144 plt.ylabel("价格/元",fontsize=12)#纵坐标名字
145 plt.show()
146 # 价格进行降序排列分析
147 x = Jiadian['name']
148 y = Jiadian['price']
149 fig = plt.figure(figsize=(10, 4), dpi=80)
150 plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
151 plt.rcParams['axes.unicode_minus']=False
152 plt.xticks(rotation=90)
153 plt.plot(x,y,'s-',color = 'plum',label="价格")#s-:方形
154 plt.legend(loc = "best")#图例
155 plt.title("值得买家电价格趋势图",fontsize=18)
156 plt.xticks(fontsize=12)
157 plt.ylabel("价格/元",fontsize=12)#纵坐标名字
158 
159 # 柱状图
160 x = 天猫精选['name']
161 y = 天猫精选['price']
162 fig = plt.figure(figsize=(10, 8), dpi=80)
163 plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
164 plt.rcParams['axes.unicode_minus']=False
165 plt.xticks(rotation=90)
166 plt.bar(x,y,alpha=0.2, width=0.6, color='b', lw=3)
167 plt.legend(loc = "best")#图例
168 plt.title("天猫精选家电价格柱状图",fontsize=18)
169 plt.ylabel("价格/元",fontsize=12)#纵坐标名字
170 # 柱状图
171 x = 京东['name']
172 y = 京东['price']
173 fig = plt.figure(figsize=(10, 8), dpi=80)
174 plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
175 plt.rcParams['axes.unicode_minus']=False
176 plt.xticks(rotation=90)
177 plt.bar(x,y,alpha=0.2, width=0.6, color='b', lw=3)
178 plt.legend(loc = "best")#图例
179 plt.title("京东家电价格柱状图",fontsize=18)
180 plt.ylabel("价格/元",fontsize=12)#纵坐标名字
181 plt.show()
182 
183 # 水平图
184 x = 天猫精选['name']
185 y = 天猫精选['price']
186 fig = plt.figure(figsize=(16, 8), dpi=80)
187 plt.barh(x,y, alpha=0.2, height=0.6, color='coral')
188 plt.title("天猫精选家电价格水平图",fontsize=18)
189 plt.legend(loc = "best")#图例
190 plt.xticks(rotation=90)
191 plt.xlabel("价格",fontsize=12)#横坐标名字
192 # 水平图
193 x = 京东['name']
194 y = 京东['price']
195 fig = plt.figure(figsize=(16, 8), dpi=80)
196 plt.barh(x,y, alpha=0.2, height=0.6, color='coral')
197 plt.title("京东家电价格水平图",fontsize=18)
198 plt.legend(loc = "best")#图例
199 plt.xticks(rotation=90)
200 plt.xlabel("价格",fontsize=12)#横坐标名字
201 
202 # 散点图
203 x = 天猫精选['name']
204 y = 天猫精选['price']
205 fig = plt.figure(figsize=(10, 6), dpi=80)
206 ax = plt.subplot(1, 1, 1)
207 plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
208 plt.scatter(x,y,color='c',marker='o',s=60,alpha=1)
209 plt.xticks(rotation=90)
210 plt.xticks([])
211 plt.ylabel("价格",fontsize=12)#横坐标名字
212 plt.title("天猫精选家电价格散点图",fontsize=16)
213 # 散点图
214 x = Jiadian['name']
215 y = Jiadian['price']
216 fig = plt.figure(figsize=(10, 6), dpi=80)
217 ax = plt.subplot(1, 1, 1)
218 plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
219 plt.scatter(x,y,color='c',marker='o',s=60,alpha=1)
220 plt.xticks(rotation=90)
221 plt.xticks([])
222 plt.ylabel("价格",fontsize=12)#横坐标名字
223 plt.title("值得买家电价格散点图",fontsize=16)
224 
225 # 盒图
226 y = 天猫精选['price']
227 plt.boxplot(y)
228 plt.title("天猫精选家电价格盒图",fontsize=16)
229 plt.show()
230 # 盒图
231 y = 京东['price']
232 plt.boxplot(y)
233 plt.title("京东家电价格盒图",fontsize=16)
234 plt.show()
235 # 盒图
236 y = Jiadian['price']
237 plt.boxplot(y)
238 plt.title("值得买家电价格盒图",fontsize=16)
239 plt.show()
240 
241 # 饼图
242 y = Jiadian['inform'].value_counts()
243 y = np.array([19, 5, 2, 2, 1])
244 
245 plt.pie(y,
246         labels=['京东','天猫精选','顺电','苏宁','小米精选'], # 设置饼图标签
247         colors=["#d5695d", "#5d8ca8", "#65a479", "#a564c9","#FFE4C4"], # 设置饼图颜色
248        )
249 plt.title("总家电占比",fontsize='14') # 设置标题
250 plt.show()
251 
252 #线性回归
253 import seaborn as sns
254 import matplotlib.pyplot as plt
255 import numpy as mp
256 from scipy.optimize import leastsq
257 plt.rcParams['font.sans-serif'] = ['SimHei']#解决乱码问题
258 #定义变量
259 gsgm=qcwy.loc[:,'Updatetime']
260 zprs=qcwy.loc[:,'price']
261 #函数表达式
262 def func(params,x):
263     a,b,c=params
264    return a*x*x+b*x+c
265 def error_func(params,x,y):
266   return func(params,x)-y
267 P0=[1,9.0]
268 def main():
269    plt.figure(figsize=(8,6))
270      P0=[1,9.0,1]
271     Para=leastsq(error_func,P0,args=(gsgm,zprs))
272     a,b,c=Para[0]
273     print("a=",a, "b=",b, "c=",c)
274     #绘图
275     plt.scatter(gsgm,zprs,color="green",label="样本数据",linewidth=2)
276    x=mp.linspace(13,1,2)
277     y=a*x*x+b*x+c
278     #右上角标
279     plt.plot(x,y,color="red",label="拟合曲线",linewidth=2)
280   #x,y轴名称
281   plt.xlabel('Updatetime')
282   plt.ylabel('price')
283   #标题
284   plt.title("price与Updatetime回归方程")
285   plt.grid()
286   plt.legend()
287   plt.show()
288 main()  
289 
290 # 词云
291 import random
292 import wordcloud as wc
293 import matplotlib.pyplot as plt
294 
295 # 定义图片尺寸
296 word_cloud = wc.WordCloud(
297    background_color='black',  
298    font_path='msyhbd.ttc',  
299    max_font_size=300, 
300    random_state=50,  
301                        )
302 text = Jiadian['name']
303 text = " ".join(text)
304 
305 # 绘制词云
306 fig = plt.figure(figsize=(8, 4), dpi=80)
307 ax = plt.subplot(1, 1, 1)
308 word_cloud.generate(text)
309 plt.imshow(word_cloud)
310 plt.show()
311 
312 # 词云
313 import random
314 import wordcloud as wc
315 import matplotlib.pyplot as plt
316 from PIL import Image
317 
318 # 背景图片
319 bk = np.array(Image.open("Jiadian.jpg"))
320 mask = bk
321 # 定义图片尺寸
322 word_cloud = wc.WordCloud(
323    mask = mask,
324    background_color='black',  
325    font_path='msyhbd.ttc',  
326    max_font_size=300, 
327    random_state=50,  
328                        )
329 text = Jiadian['inform']
330 text = " ".join(text)
331 # 绘制词云
332 fig = plt.figure(figsize=(10, 5), dpi=80)
333 ax = plt.subplot(1, 1, 1)
334 word_cloud.generate(text)
335 plt.imshow(word_cloud)
336 plt.show()

 

五、总结

从分析结果来看,天猫分类中GEREE挂壁暖风机价格最高,原因可能正直冬天需求量比较大。

从京东分类中Meide的空调价格最高,这款机型也具有暖风功能。总体价格降序排行中大部分是挂式空调、液晶电视,从而体现了在购买大家电中这两款产品是必买商品。

从饼图分析,爬取到的数据大部分来源于京东可见,在京东上购买电器已经成为了一个潜移默化的参考行为。

分析结果达到预期。在刚开始,我对于函数的用法不太熟练,如绘制饼图需要把详细信息分类计算这些问题,需要查阅相关材料才明白,以及整体系统设计最后还是出现一点小问题,在设计线性回归图的时候,发现数据处理的不够完美,没有想到更多的方面,导致提取的困难,但也是通过本次设计我很好的学习了BeautifulSoup库的使用,对今后爬虫设计有了很大的帮助。

 

 

posted @ 2021-12-27 18:21  周发发  阅读(1263)  评论(0)    收藏  举报