大数据分析——某电商平台药品销售数据分析

一、选题背景

　　我们已经习惯了在网上购买衣服、数码产品和家用电器，但是在网上买药品的还是不多。据史国网上药店理事会调查报告显示:2022 年，医药 B2C 的规模达到 4 亿元，仅出现 5 家锁售额达.5000 万元的网上药店。而 2022 年医药行业的市场规模达到3718 亿，线上药品的销售额还不到网下药店的一个零头，还有很大的发展潜力。大数据的不断发展影响消费者生活的各个方面，也对企业的营销模式提出挑站对大数据量化分析，分析数据中的相关性分析，单因素分析等技术对消费者相关数据进行分析，能够挖掘出对企业真正有意义的信息。这就要求企业在有现的人力、物力资源下，更新并找出合理的销售方案。对于医药企业来说，大数据为企业带来了危机也带来了商机，企业应根据自身发展阶段及药品特征，以及顾客价值最大化作为方向，以信息化为手段，并根据市场对药品需求的变化，把握消费者的个性需求，进行精准营销，与消费者建立起良性有效的互动，及时获得消费者反馈，整合传统媒体与新媒体宣传资源，选择合适的企业发展的营销战略。

二、大数据分析设计方案

1.本数据集的数据内容数据特征分析

　　本数据集是一份网络电商平台的药品销售数据集，共7个字段,包括购药时间(string)，社保卡号(int)，商品编码(int)，商品名称(string)，销售数量(int)，应收金额(int)，实收金额(int)。

2.数据分析的课程设计方案概述

(1)先对数据进行预处理和清洗

(2)数据分析和可视化

(3)随机森林填补缺失值

三、数据分析步骤

1.数据源

　　数据集来源于国外Kaggle数据集网站进行采集。源数据集网址

　　https://www.kaggle.com/datasets/jack20216915/yaopin

导入库

 1 import pandas as pd
 2 import stylecloud
 3 from PIL import Image
 4 from collections import Counter
 5 from pyecharts.charts import Bar
 6 from pyecharts.charts import Line
 7 from pyecharts.charts import Calendar
 8 from pyecharts import options as opts
 9 from pyecharts.commons.utils import JsCode
10 from pyecharts.globals import SymbolType

读取数据集

1 df = pd.read_excel('电商平台药品销售数据.xlsx')
2 df.head(10)

2数据清洗

　　数据清洗，是整个数据分析过程中不可缺少的一个环节，其结果质量直接关系到模型效果和最终结论。在实际操作中，数据清洗通常会占据分析过程的50%—80%的时间。

(1)查看索引、数据类型和内存信息

　　info() 函数用于打印 DataFrame 的简要摘要，显示有关 DataFrame 的信息，包括索引的数据类型 dtype 和列的数据类型 dtype，非空值的数量和内存使用情况

1 df.info()

(2)统计空值数据

　　使用 isnull() 函数时不需要传入任何参数，只需要使用 df 对象去调用它就可以了。该方法运行之后会将整个表格对象内的所有数据都转为 True 值以及 False 值，其中 NaN 值转换之后得到就是 True

1 df.isnull().sum()

(3) 输出包含空值的行

　　因为购药时间在后面的分析中会用到，所以我们将购药时间为空的行删除

(4)社保卡号用”000” 填充

fillna() 函数的功能: 该函数的功能是用指定的值去填充 dataframe 中的缺失值

1 df1['社保卡号'].fillna('0000', inplace=True)
2 df1.isnull().sum()

此时可以看到没有空值了

(5)社保卡号、商品编码为一串数字，应为 str 类型，销售数量应为 int 类型

1 df1['社保卡号'] = df1['社保卡号'].astype(str)
2 df1['商品编码'] = df1['商品编码'].astype(str)
3 df1['销售数量'] = df1['销售数量'].astype(int)
4 df1.info()
5 df1.head()

　　虽然这里强制转换社保卡号、商品编码为 str 类型，但是在读取表格的时候是以 float 读取的，所以存在小数点，这里我们可以在读取表格文件时指定相应列的数据类型 (需要注意如果数据存在空值，那么转换数值型时会失效)：

1 df_tmp = pd.read_excel('电商平台药品销售数据.xlsx', converters={'社保卡号':str, '商品编码':str, '销售数量':int})
2 df_tmp.head()

(6)销售数量、应收金额、实收金额分布情况

1 df2 = df_tmp.copy()
2 df2 = df2.dropna(subset=['购药时间'])
3 df2['社保卡号'].fillna('0000', inplace=True)
4 df2['销售数量'] = df2['销售数量'].astype(int)
5 df2[['销售数量','应收金额','实收金额']].describe()

数据中存在负值，显然不合理，我们看一下负值所在的行

1 df2.loc[(df2['销售数量'] < 0)]

(7)负值转正值

abs 是 python 的绝对值函数，计算绝对值，数据值转为正值

1 df2['销售数量'] = df2['销售数量'].abs()
2 df2['应收金额'] = df2['应收金额'].abs()
3 df2['实收金额'] = df2['实收金额'].abs()
4 df2.loc[(df2['销售数量'] < 0) | (df2['应收金额'] < 0) | (df2['实收金额'] < 0)].sum()

(8)列拆分（购药时间列拆分为两列）

　　对字符串按照指定规则分割，并将分割后的字段作为 list 返回，对购药日期和星期两列进行分隔，进行列拆分

1 df3 = df2.copy()
2 df3[['购药日期', '星期']] = df3['购药时间'].str.split(' ', 2, expand = True)
3 df3 = df3[['购药日期', '星期','社保卡号','商品编码', '商品名称', '销售数量', '应收金额', '实收金额' ]]
4 df3

(9) 数据时间范围

　　unique 函数去除其中重复的元素，并按元素由大到小返回一个新的无元素重复的元组或者列表

1 len(df3['购药日期'].unique())
2 df3.groupby('购药日期').sum()

　　一共201个购买日期，时间范围2016-01-01至2016-07-19

3数据可视化

　　“pyecharts 是一个用于生成 Echarts 图表的类库。Echarts 是百度开源的一个数据可视化 JS 库。用 Echarts 生成的图可视化效果非常棒，为了与 Python 进行对接，方便在 Python 中直接使用数据生成图”。
　　pyecharts可以展示动态图，在线报告使用比较美观，并且展示数据方便，鼠标悬停在图上，即可显示数值、标签等。

(1)一周各天药品销量柱状图

 1 color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0,
 2     [{offset: 0, color: '#FFFFFF'}, {offset: 1, color: '#ed1941'}], false)"""
 3 
 4 g1 = df3.groupby('星期').sum()
 5 x_data = list(g1.index)
 6 y_data = g1['销售数量'].values.tolist()
 7 b1 = (
 8         Bar()
 9         .add_xaxis(x_data)
10         .add_yaxis('',y_data ,itemstyle_opts=opts.ItemStyleOpts(color=JsCode(color_js)))
11         .set_global_opts(title_opts=opts.TitleOpts(title='一周各天药品销量',pos_top='2%',pos_left = 'center'),
12             legend_opts=opts.LegendOpts(is_show=False),
13             xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15)),
14             yaxis_opts=opts.AxisOpts(name="销量",name_location='middle',name_gap=50,name_textstyle_opts=opts.TextStyleOpts(font_size=16)))
15 
16     )
17 b1.render('一周各天药品销量柱状图.html')

　　从下图可以清楚直观的看到每一周药品的销量，我发现每天销量整体相差不大，周五、周六偏于购药高峰。

(2)药品销量前十柱状图

 1 color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0,
 2     [{offset: 0, color: '#FFFFFF'}, {offset: 1, color: '#08519c'}], false)"""
 3 
 4 g2 = df3.groupby('商品名称').sum().sort_values(by='销售数量', ascending=False)
 5 x_data = list(g2.index)[:10]
 6 y_data = g2['销售数量'].values.tolist()[:10]
 7 b2 = (
 8         Bar()
 9         .add_xaxis(x_data)
10         .add_yaxis('',y_data ,itemstyle_opts=opts.ItemStyleOpts(color=JsCode(color_js)))
11         .set_global_opts(title_opts=opts.TitleOpts(title='药品销量前十',pos_top='2%',pos_left = 'center'),
12             legend_opts=opts.LegendOpts(is_show=False),
13             xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15)),
14             yaxis_opts=opts.AxisOpts(name="销量",name_location='middle',name_gap=50,name_textstyle_opts=opts.TextStyleOpts(font_size=16)))
15 
16     )
17 b2.render('药品销量前十柱状图.html')

　　我们在这可以看出：苯磺酸氨氯地平片 (安内真)、开博通、酒石酸美托洛尔片 (倍他乐克) 等治疗高血压、心绞痛药物购买量比较多

(3)药品销售额前十柱状图

 1 color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0,
 2     [{offset: 0, color: '#FFFFFF'}, {offset: 1, color: '#871F78'}], false)"""
 3 
 4 g3 = df3.groupby('商品名称').sum().sort_values(by='实收金额', ascending=False)
 5 x_data = list(g3.index)[:10]
 6 y_data = g3['实收金额'].values.tolist()[:10]
 7 b3 = (
 8         Bar()
 9         .add_xaxis(x_data)
10         .add_yaxis('',y_data ,itemstyle_opts=opts.ItemStyleOpts(color=JsCode(color_js)))
11         .set_global_opts(title_opts=opts.TitleOpts(title='药品销售额前十',pos_top='2%',pos_left = 'center'),
12             legend_opts=opts.LegendOpts(is_show=False),
13             xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15)),
14             yaxis_opts=opts.AxisOpts(name="销量",name_location='middle',name_gap=50,name_textstyle_opts=opts.TextStyleOpts(font_size=16)))
15 
16     )
17 b3.render('药品销售额前十柱状图.html')

　　我们可以清楚看到药品销售额前十的条形图，我们发现开播通销售额最高，为37671

(4)一周每天订单量

 1 # 设置样式
 2 color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0,
 3     [{offset: 0, color: '#25BEAD'}, {offset: 1, color: '#ed1941'}], false)"""
 4 
 5 area_color_js = (
 6     "new echarts.graphic.LinearGradient(0, 0, 0, 1, "
 7     "[{offset: 0, color: '#25BEAD'}, {offset: 1, color: '#3fbbff0d'}], false)"
 8 )
 9 # 一周每天订单量
10 df_week = df3.groupby(['星期'])['实收金额'].count()
11 week_x_data = df_week.index
12 week_y_data = df_week.values.tolist()
13  
14 line1 = (
15     Line(init_opts=opts.InitOpts(bg_color=JsCode(color_js)))
16     .add_xaxis(xaxis_data=week_x_data)
17     .add_yaxis(
18         series_name="",
19         y_axis=week_y_data,        
20         is_smooth=True,
21         is_symbol_show=True,
22         symbol="circle",
23         symbol_size=6,
24         linestyle_opts=opts.LineStyleOpts(color="#fff"),
25         label_opts=opts.LabelOpts(is_show=True, position="top", color="white"),
26         itemstyle_opts=opts.ItemStyleOpts(
27             color="red", border_color="#fff", border_width=3
28         ),
29         tooltip_opts=opts.TooltipOpts(is_show=False),
30         areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1),
31     )
32     .set_global_opts(
33         title_opts=opts.TitleOpts(
34             title="一周每天订单量",
35             pos_top="2%",
36             pos_left="center",
37             title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16),
38         ),
39         xaxis_opts=opts.AxisOpts(
40             type_="category",
41             boundary_gap=True,
42             axislabel_opts=opts.LabelOpts(margin=30, color="#ffffff63",font_weight =900),
43             axisline_opts=opts.AxisLineOpts(is_show=False),
44             axistick_opts=opts.AxisTickOpts(
45                 is_show=True,
46                 length=25,
47                 linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
48             ),
49             splitline_opts=opts.SplitLineOpts(
50                 is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
51             ),
52         ),
53         yaxis_opts=opts.AxisOpts(
54             type_="value",
55             position="left",
56             axislabel_opts=opts.LabelOpts(margin=20, color="#ffffff63"),
57             axisline_opts=opts.AxisLineOpts(
58                 linestyle_opts=opts.LineStyleOpts(width=2, color="#fff")
59             ),
60             axistick_opts=opts.AxisTickOpts(
61                 is_show=True,
62                 length=15,
63                 linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
64             ),
65             splitline_opts=opts.SplitLineOpts(
66                 is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
67             ),
68         ),
69         legend_opts=opts.LegendOpts(is_show=False),
70     )
71 )
72 line1.render('一周每天订单量分析图.html')

　　我们通过折线图来分析每周的订单数量

(5)自然月每天订单数量

 1 # 自然月每天订单数量
 2 df3['购药日期'] = pd.to_datetime(df3['购药日期'])
 3 df_day = df3.groupby(df3['购药日期'].dt.day)['星期'].count()
 4 day_x_data = [str(i) for i in list(df_day.index)]
 5 day_y_data = df_day.values.tolist()
 6 line1 = (
 7     Line(init_opts=opts.InitOpts(bg_color=JsCode(color_js)))
 8     .add_xaxis(xaxis_data=day_x_data)
 9     .add_yaxis(
10         series_name="",
11         y_axis=day_y_data,
12         is_smooth=True,
13         is_symbol_show=True,
14         symbol="circle",
15         symbol_size=6,
16         linestyle_opts=opts.LineStyleOpts(color="#fff"),
17         label_opts=opts.LabelOpts(is_show=True, position="top", color="white"),
18         itemstyle_opts=opts.ItemStyleOpts(
19             color="red", border_color="#fff", border_width=3
20         ),
21         tooltip_opts=opts.TooltipOpts(is_show=False),
22         areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1),
23     )
24     .set_global_opts(
25         title_opts=opts.TitleOpts(
26             title="自然月每日订单量",
27             pos_top="5%",
28             pos_left="center",
29             title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16),
30         ),
31         xaxis_opts=opts.AxisOpts(
32             type_="category",
33             boundary_gap=True,
34             axislabel_opts=opts.LabelOpts(margin=30, color="#ffffff63",font_weight =900),
35             axisline_opts=opts.AxisLineOpts(is_show=False),
36             axistick_opts=opts.AxisTickOpts(
37                 is_show=True,
38                 length=25,
39                 linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
40             ),
41             splitline_opts=opts.SplitLineOpts(
42                 is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
43             ),
44         ),
45         yaxis_opts=opts.AxisOpts(
46             type_="value",
47             position="left",
48             axislabel_opts=opts.LabelOpts(margin=20, color="#ffffff63"),
49             axisline_opts=opts.AxisLineOpts(
50                 linestyle_opts=opts.LineStyleOpts(width=2, color="#fff")
51             ),
52             axistick_opts=opts.AxisTickOpts(
53                 is_show=True,
54                 length=15,
55                 linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
56             ),
57             splitline_opts=opts.SplitLineOpts(
58                 is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
59             ),
60         ),
61         legend_opts=opts.LegendOpts(is_show=False),
62     )
63 )
64 line1.render('自然月每天订单数量分析图.html')

　　可以看出：5 日、15 日、25 日是药品销售高峰期，尤其是每月 15 日

(6)每月订单数量

 1 # 每月订单数量
 2 df_month = df3.groupby(df3['购药日期'].dt.month)['星期'].count()
 3 day_x_data = [str(i)+'月' for i in list(df_month.index)]
 4 day_y_data = df_month.values.tolist()
 5 line1 = (
 6     Line(init_opts=opts.InitOpts(bg_color=JsCode(color_js)))
 7     .add_xaxis(xaxis_data=day_x_data)
 8     .add_yaxis(
 9         series_name="",
10         y_axis=day_y_data,
11         is_smooth=True,
12         is_symbol_show=True,
13         symbol="circle",
14         symbol_size=6,
15         linestyle_opts=opts.LineStyleOpts(color="#fff"),
16         label_opts=opts.LabelOpts(is_show=True, position="top", color="black"),
17         itemstyle_opts=opts.ItemStyleOpts(
18             color="red", border_color="#fff", border_width=3
19         ),
20         tooltip_opts=opts.TooltipOpts(is_show=False),
21         areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1),
22     )
23     .set_global_opts(
24         title_opts=opts.TitleOpts(
25             title="每月订单量",
26             pos_top="2%",
27             pos_left="center",
28             title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16),
29         ),
30         xaxis_opts=opts.AxisOpts(
31             type_="category",
32             boundary_gap=True,
33             axislabel_opts=opts.LabelOpts(margin=30, color="#ffffff63",font_weight =900),
34             axisline_opts=opts.AxisLineOpts(is_show=False),
35             axistick_opts=opts.AxisTickOpts(
36                 is_show=True,
37                 length=25,
38                 linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
39             ),
40             splitline_opts=opts.SplitLineOpts(
41                 is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
42             ),
43         ),
44         yaxis_opts=opts.AxisOpts(
45             type_="value",
46             position="left",
47             axislabel_opts=opts.LabelOpts(margin=20, color="#ffffff63"),
48             axisline_opts=opts.AxisLineOpts(
49                 linestyle_opts=opts.LineStyleOpts(width=2, color="#fff")
50             ),
51             axistick_opts=opts.AxisTickOpts(
52                 is_show=True,
53                 length=15,
54                 linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
55             ),
56             splitline_opts=opts.SplitLineOpts(
57                 is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
58             ),
59         ),
60         legend_opts=opts.LegendOpts(is_show=False),
61     )
62 )
63 line1.render('每月订单数量分析图.html')

　　在这里我们可以发现 1 月份和 4 月份药品销售数据较其他几个月更多

(7)五月每日订单量

 1 # 五月每日订单量
 2 colors = ['#C9DA36','#9ECB3C','#6DBC49','#37B44E','#3DBA78','#7D3990','#A63F98','#C31C88','#F57A34','#FA8F2F','#CF7B25','#CF7B25','#FF5733','#C70039']
 3 df_day = df3.groupby(df3['购药日期'].dt.day)['星期'].count()
 4 day_x_data = [str(i) for i in list(df_day.index)]
 5 day_y_data = df_day.values.tolist()
 6 times = [x.strftime('%Y-%m-%d') for x in list(pd.date_range('20160501', '20160531'))]
 7 data = [[times[index],day_y_data[index]] for index,item in enumerate( day_y_data)]
 8 Cal = (
 9     Calendar(init_opts=opts.InitOpts(width="800px", height="500px"))
10     .add(
11         series_name="五月每日订单量分布情况",
12         yaxis_data=data,
13         calendar_opts=opts.CalendarOpts(
14              pos_top='20%',
15              pos_left='5%',
16              range_="2016-05",
17              cell_size=40,
18              # 年月日标签样式设置
19              daylabel_opts=opts.CalendarDayLabelOpts(name_map="cn",
20                                                      margin=20,
21                                                      label_font_size=14,
22                                                      label_color='#EB1934', 
23                                                      label_font_weight='bold'
24                                                     ),
25              monthlabel_opts=opts.CalendarMonthLabelOpts(name_map="cn",
26                                                          margin=20,
27                                                          label_font_size=14,
28                                                          label_color='#EB1934', 
29                                                          label_font_weight='bold',
30                                                          is_show=False
31                                                         ),
32              yearlabel_opts=opts.CalendarYearLabelOpts(is_show=False),
33         ),
34         tooltip_opts='{c}',
35     )
36     .set_global_opts(
37         title_opts=opts.TitleOpts(
38             pos_top="2%", 
39             pos_left="center", 
40             title=""
41         ),
42         visualmap_opts=opts.VisualMapOpts(
43             orient="horizontal", 
44             max_=800,
45             pos_bottom='10%',
46             is_piecewise=True,
47             pieces=[{"min": 600},
48                     {"min": 300, "max": 599},
49                     {"min": 200, "max": 299},
50                     {"min": 160, "max": 199},
51                     {"min": 100, "max": 159},
52                     {"max": 99}],
53             range_color=['#ffeda0','#fed976','#fd8d3c','#fc4e2a','#e31a1c','#b10026']
54             
55         ),
56         legend_opts=opts.LegendOpts(is_show=True,
57                                     pos_top='5%',
58                                     item_width = 50,
59                                     item_height = 30,
60                                     textstyle_opts=opts.TextStyleOpts(font_size=16,color='#EB1934'),
61                                     legend_icon ='path://path://M465.621333 469.333333l-97.813333-114.133333a21.333333 21.333333 0 1 1 32.384-27.733333L512 457.856l111.786667-130.432a21.333333 21.333333 0 1 1 32.426666 27.776L558.357333 469.333333h81.493334c11.84 0 21.461333 9.472 21.461333 21.333334 0 11.776-9.6 21.333333-21.482667 21.333333H533.333333v85.333333h106.517334c11.861333 0 21.482667 9.472 21.482666 21.333334 0 11.776-9.6 21.333333-21.482666 21.333333H533.333333v127.850667c0 11.861333-9.472 21.482667-21.333333 21.482666-11.776 0-21.333333-9.578667-21.333333-21.482666V640h-106.517334A21.354667 21.354667 0 0 1 362.666667 618.666667c0-11.776 9.6-21.333333 21.482666-21.333334H490.666667v-85.333333h-106.517334A21.354667 21.354667 0 0 1 362.666667 490.666667c0-11.776 9.6-21.333333 21.482666-21.333334h81.472zM298.666667 127.957333C298.666667 104.405333 317.824 85.333333 341.12 85.333333h341.76C706.304 85.333333 725.333333 104.490667 725.333333 127.957333v42.752A42.645333 42.645333 0 0 1 682.88 213.333333H341.12C317.696 213.333333 298.666667 194.176 298.666667 170.709333V127.957333zM341.333333 170.666667h341.333334V128H341.333333v42.666667z m-105.173333-42.666667v42.666667H170.752L170.666667 895.893333 853.333333 896V170.773333L789.909333 170.666667V128h63.296C876.842667 128 896 147.072 896 170.773333v725.12C896 919.509333 877.013333 938.666667 853.333333 938.666667H170.666667a42.666667 42.666667 0 0 1-42.666667-42.773334V170.773333C128 147.157333 147.114667 128 170.752 128h65.408z'
62                                    ),
63     )
64 )
65 Cal.render('五月每日订单量分析图.html')

　　我们把五月每日订单单独拿出来分析，可以看出：苯磺酸氨氯地平片 (安内真)、开博通、酒石酸美托洛尔片 (倍他乐克) 等治疗高血压、心绞痛药物购买量比较多

(8)药品销售数据词云

 1 # 词云
 2 g = df3.groupby('商品名称').sum()
 3 drug_list = []
 4 for idx, value in enumerate(list(g.index)):
 5     drug_list += [value] * list(g['销售数量'].values)[idx]
 6 stylecloud.gen_stylecloud(
 7     text=' '.join(drug_list),
 8     font_path=r'STXINWEI.TTF',
 9     palette='cartocolors.qualitative.Bold_5',# 设置配色方案
10     icon_name='fas fa-lock', # 设置蒙版方案
11 #     background_color='black',
12     max_font_size=200,
13     output_name='药品销量.png',
14     )
15 Image.open("药品销量.png")

4.随机森林填补缺失值

　　利用随机森林进行填补缺失值的思想：随机森林是进行回归的操作，我们可以把那些包含缺失值的列当作标签，如果是很多列都有缺失值，那么就要按照每一列的缺失值的从小到大来填补（因为这样子的话，正确率会更加高一些，因为缺失值少的那个对特征等的要求更加低一些），然后在将剩下和原本就已经给的标签组成新的特征矩阵（一般情况下，最开始的标签是不会有缺失值的），在这个特征矩阵里面，将缺失值利用 numpy,pandas 或者 sklearn 的 impleImputer 填补为 0，因为 0 对数据的影响比较小。接着就是将取出的那个新的标签列，按照有没有缺失值分为 Ytrain 和 Ytest，同样的道理，按照新标签列有缺失值所在的行的位置，将新的特征矩阵分为 Xtrain 和 Xtest，然后就可以利用 RandomForestRegressor() 来进行训练和预测，利用 predict 接口来得到最后的 Y，其实在前面的 Ytest 并没有用处，只是来确定所在的行而已。在这里的 predict 出来的就是要填补的内容，将它把 Ytest 覆盖就可以了。如果有缺失值的列很多的话，就可以使用循环，不断的预测就可以了。最后所填补的缺失值的正确率要远比利用 0 填补，均值填补，中位数填补，最多数填补的高。

 1 from sklearn.impute import SimpleImputer
 2 from sklearn.ensemble import RandomForestRegressor
 3 import numpy as np
 4 data_copy = df.copy()
 5 data_copy.drop(data_copy.columns[0], axis=1, inplace=True)
 6 sindex = np.argsort(data_copy.isnull().sum()).values
 7 
 8 # 进行缺失值的填补，利用随机森林进行填补缺失值
 9 for i in sindex :
10     if data_copy.iloc[:,i].isnull().sum() == 0 :
11         continue
12     df = data_copy
13     fillc = df.iloc[:, i]
14     df = df.iloc[:,df.columns!=df.columns[i]]
15 
16 #在下面的是使用了0来对特征矩阵中的缺失值的填补，
17     df_0 = SimpleImputer(missing_values=np.nan
18                         ,strategy="constant"
19                         ,fill_value=0
20                         ).fit_transform(df)
21     Ytrain = fillc[fillc.notnull()]
22     Ytest = fillc[fillc.isnull()]
23     
24     Xtrain = df_0[Ytrain.index,:]
25     Xtest = df_0[Ytest.index,:]
26     
27     rfc = RandomForestRegressor()
28     rfc.fit(Xtrain, Ytrain)
29     Ypredict = rfc.predict(Xtest)
30     
31     data_copy.loc[data_copy.iloc[:,i].isnull(),data_copy.columns[i]] = Ypredict
32 data_copy.isnull().sum()

5完整代码附上

  1 import pandas as pd
  2 import stylecloud
  3 from sklearn.impute import SimpleImputer
  4 from sklearn.ensemble import RandomForestRegressor
  5 import numpy as np
  6 from PIL import Image
  7 from collections import Counter
  8 from pyecharts.charts import Bar
  9 from pyecharts.charts import Line
 10 from pyecharts.charts import Calendar
 11 from pyecharts import options as opts
 12 from pyecharts.commons.utils import JsCode
 13 from pyecharts.globals import SymbolType
 14 df = pd.read_excel('电商平台药品销售数据.xlsx')
 15 df.head(10)
 16 from sklearn.impute import SimpleImputer
 17 import numpy as np
 18 # 取出缺失值所在列的数值，sklearn当中特征矩阵必须是二维才能传入 使用reshape(-1,1)升维
 19 sums=df['实收金额'].values.reshape(-1,1)
 20 # 按平均值填充
 21 imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
 22 imp_mean=imp_mean.fit_transform(sums)
 23 
 24 df['实收金额']=imp_mean
 25 df
 26 df.info()
 27 df.shape
 28 df.isnull().sum()
 29 df[df.isnull().T.any()]
 30 df1 = df.copy()
 31 df1 = df1.dropna(subset=['购药时间'])
 32 df1[df1.isnull().T.any()]
 33 df1['社保卡号'].fillna('0000', inplace=True)
 34 df1.isnull().sum()
 35 df1['社保卡号'] = df1['社保卡号'].astype(str)
 36 df1['商品编码'] = df1['商品编码'].astype(str)
 37 df1['销售数量'] = df1['销售数量'].astype(int)
 38 df1.info()
 39 df1.head()
 40 df_tmp = pd.read_excel('电商平台药品销售数据.xlsx', converters={'社保卡号':str, '商品编码':str, '销售数量':int})
 41 df_tmp.head()
 42 df2 = df_tmp.copy()
 43 df2 = df2.dropna(subset=['购药时间'])
 44 df2['社保卡号'].fillna('0000', inplace=True)
 45 df2['销售数量'] = df2['销售数量'].astype(int)
 46 df2[['销售数量','应收金额','实收金额']].describe()
 47 df2.loc[(df2['销售数量'] < 0)]
 48 df2['销售数量'] = df2['销售数量'].abs()
 49 df2['应收金额'] = df2['应收金额'].abs()
 50 df2['实收金额'] = df2['实收金额'].abs()
 51 df2.loc[(df2['销售数量'] < 0) | (df2['应收金额'] < 0) | (df2['实收金额'] < 0)].sum()
 52 df3 = df2.copy()
 53 df3[['购药日期', '星期']] = df3['购药时间'].str.split(' ', 2, expand = True)
 54 df3 = df3[['购药日期', '星期','社保卡号','商品编码', '商品名称', '销售数量', '应收金额', '实收金额' ]]
 55 df3
 56 len(df3['购药日期'].unique())
 57 df3.groupby('购药日期').sum()
 58 color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0,
 59     [{offset: 0, color: '#FFFFFF'}, {offset: 1, color: '#ed1941'}], false)"""
 60 
 61 g1 = df3.groupby('星期').sum()
 62 x_data = list(g1.index)
 63 y_data = g1['销售数量'].values.tolist()
 64 b1 = (
 65         Bar()
 66         .add_xaxis(x_data)
 67         .add_yaxis('',y_data ,itemstyle_opts=opts.ItemStyleOpts(color=JsCode(color_js)))
 68         .set_global_opts(title_opts=opts.TitleOpts(title='一周各天药品销量',pos_top='2%',pos_left = 'center'),
 69             legend_opts=opts.LegendOpts(is_show=False),
 70             xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15)),
 71             yaxis_opts=opts.AxisOpts(name="销量",name_location='middle',name_gap=50,name_textstyle_opts=opts.TextStyleOpts(font_size=16)))
 72 
 73     )
 74 b1.render('一周各天药品销量柱状图.html')
 75 
 76 color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0,
 77     [{offset: 0, color: '#FFFFFF'}, {offset: 1, color: '#08519c'}], false)"""
 78 
 79 g2 = df3.groupby('商品名称').sum().sort_values(by='销售数量', ascending=False)
 80 x_data = list(g2.index)[:10]
 81 y_data = g2['销售数量'].values.tolist()[:10]
 82 b2 = (
 83         Bar()
 84         .add_xaxis(x_data)
 85         .add_yaxis('',y_data ,itemstyle_opts=opts.ItemStyleOpts(color=JsCode(color_js)))
 86         .set_global_opts(title_opts=opts.TitleOpts(title='药品销量前十',pos_top='2%',pos_left = 'center'),
 87             legend_opts=opts.LegendOpts(is_show=False),
 88             xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15)),
 89             yaxis_opts=opts.AxisOpts(name="销量",name_location='middle',name_gap=50,name_textstyle_opts=opts.TextStyleOpts(font_size=16)))
 90 
 91     )
 92 b2.render('药品销量前十柱状图改.html')
 93 
 94 # 设置样式
 95 color_js = """new echarts.graphic.LinearGradient(0, 1, 0, 0,
 96     [{offset: 0, color: '#25BEAD'}, {offset: 1, color: '#ed1941'}], false)"""
 97 
 98 area_color_js = (
 99     "new echarts.graphic.LinearGradient(0, 0, 0, 1, "
100     "[{offset: 0, color: '#25BEAD'}, {offset: 1, color: '#3fbbff0d'}], false)"
101 )
102 # 一周每天订单量
103 df_week = df3.groupby(['星期'])['实收金额'].count()
104 week_x_data = df_week.index
105 week_y_data = df_week.values.tolist()
106  
107 line1 = (
108     Line(init_opts=opts.InitOpts(bg_color=JsCode(color_js)))
109     .add_xaxis(xaxis_data=week_x_data)
110     .add_yaxis(
111         series_name="",
112         y_axis=week_y_data,        
113         is_smooth=True,
114         is_symbol_show=True,
115         symbol="circle",
116         symbol_size=6,
117         linestyle_opts=opts.LineStyleOpts(color="#fff"),
118         label_opts=opts.LabelOpts(is_show=True, position="top", color="white"),
119         itemstyle_opts=opts.ItemStyleOpts(
120             color="red", border_color="#fff", border_width=3
121         ),
122         tooltip_opts=opts.TooltipOpts(is_show=False),
123         areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1),
124     )
125     .set_global_opts(
126         title_opts=opts.TitleOpts(
127             title="一周每天订单量",
128             pos_top="2%",
129             pos_left="center",
130             title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16),
131         ),
132         xaxis_opts=opts.AxisOpts(
133             type_="category",
134             boundary_gap=True,
135             axislabel_opts=opts.LabelOpts(margin=30, color="#ffffff63",font_weight =900),
136             axisline_opts=opts.AxisLineOpts(is_show=False),
137             axistick_opts=opts.AxisTickOpts(
138                 is_show=True,
139                 length=25,
140                 linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
141             ),
142             splitline_opts=opts.SplitLineOpts(
143                 is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
144             ),
145         ),
146         yaxis_opts=opts.AxisOpts(
147             type_="value",
148             position="left",
149             axislabel_opts=opts.LabelOpts(margin=20, color="#ffffff63"),
150             axisline_opts=opts.AxisLineOpts(
151                 linestyle_opts=opts.LineStyleOpts(width=2, color="#fff")
152             ),
153             axistick_opts=opts.AxisTickOpts(
154                 is_show=True,
155                 length=15,
156                 linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
157             ),
158             splitline_opts=opts.SplitLineOpts(
159                 is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
160             ),
161         ),
162         legend_opts=opts.LegendOpts(is_show=False),
163     )
164 )
165 line1.render('一周每天订单量分析图.html')
166 
167 # 每月订单数量
168 df_month = df3.groupby(df3['购药日期'].dt.month)['星期'].count()
169 day_x_data = [str(i)+'月' for i in list(df_month.index)]
170 day_y_data = df_month.values.tolist()
171 line1 = (
172     Line(init_opts=opts.InitOpts(bg_color=JsCode(color_js)))
173     .add_xaxis(xaxis_data=day_x_data)
174     .add_yaxis(
175         series_name="",
176         y_axis=day_y_data,
177         is_smooth=True,
178         is_symbol_show=True,
179         symbol="circle",
180         symbol_size=6,
181         linestyle_opts=opts.LineStyleOpts(color="#fff"),
182         label_opts=opts.LabelOpts(is_show=True, position="top", color="black"),
183         itemstyle_opts=opts.ItemStyleOpts(
184             color="red", border_color="#fff", border_width=3
185         ),
186         tooltip_opts=opts.TooltipOpts(is_show=False),
187         areastyle_opts=opts.AreaStyleOpts(color=JsCode(area_color_js), opacity=1),
188     )
189     .set_global_opts(
190         title_opts=opts.TitleOpts(
191             title="每月订单量",
192             pos_top="2%",
193             pos_left="center",
194             title_textstyle_opts=opts.TextStyleOpts(color="#fff", font_size=16),
195         ),
196         xaxis_opts=opts.AxisOpts(
197             type_="category",
198             boundary_gap=True,
199             axislabel_opts=opts.LabelOpts(margin=30, color="#ffffff63",font_weight =900),
200             axisline_opts=opts.AxisLineOpts(is_show=False),
201             axistick_opts=opts.AxisTickOpts(
202                 is_show=True,
203                 length=25,
204                 linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
205             ),
206             splitline_opts=opts.SplitLineOpts(
207                 is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
208             ),
209         ),
210         yaxis_opts=opts.AxisOpts(
211             type_="value",
212             position="left",
213             axislabel_opts=opts.LabelOpts(margin=20, color="#ffffff63"),
214             axisline_opts=opts.AxisLineOpts(
215                 linestyle_opts=opts.LineStyleOpts(width=2, color="#fff")
216             ),
217             axistick_opts=opts.AxisTickOpts(
218                 is_show=True,
219                 length=15,
220                 linestyle_opts=opts.LineStyleOpts(color="#ffffff1f"),
221             ),
222             splitline_opts=opts.SplitLineOpts(
223                 is_show=True, linestyle_opts=opts.LineStyleOpts(color="#ffffff1f")
224             ),
225         ),
226         legend_opts=opts.LegendOpts(is_show=False),
227     )
228 )
229 line1.render('每月订单数量分析图.html')
230 # 五月每日订单量
231 colors = ['#C9DA36','#9ECB3C','#6DBC49','#37B44E','#3DBA78','#7D3990','#A63F98','#C31C88','#F57A34','#FA8F2F','#CF7B25','#CF7B25','#FF5733','#C70039']
232 df_day = df3.groupby(df3['购药日期'].dt.day)['星期'].count()
233 day_x_data = [str(i) for i in list(df_day.index)]
234 day_y_data = df_day.values.tolist()
235 times = [x.strftime('%Y-%m-%d') for x in list(pd.date_range('20160501', '20160531'))]
236 data = [[times[index],day_y_data[index]] for index,item in enumerate( day_y_data)]
237 Cal = (
238     Calendar(init_opts=opts.InitOpts(width="800px", height="500px"))
239     .add(
240         series_name="五月每日订单量分布情况",
241         yaxis_data=data,
242         calendar_opts=opts.CalendarOpts(
243              pos_top='20%',
244              pos_left='5%',
245              range_="2016-05",
246              cell_size=40,
247              # 年月日标签样式设置
248              daylabel_opts=opts.CalendarDayLabelOpts(name_map="cn",
249                                                      margin=20,
250                                                      label_font_size=14,
251                                                      label_color='#EB1934', 
252                                                      label_font_weight='bold'
253                                                     ),
254              monthlabel_opts=opts.CalendarMonthLabelOpts(name_map="cn",
255                                                          margin=20,
256                                                          label_font_size=14,
257                                                          label_color='#EB1934', 
258                                                          label_font_weight='bold',
259                                                          is_show=False
260                                                         ),
261              yearlabel_opts=opts.CalendarYearLabelOpts(is_show=False),
262         ),
263         tooltip_opts='{c}',
264     )
265     .set_global_opts(
266         title_opts=opts.TitleOpts(
267             pos_top="2%", 
268             pos_left="center", 
269             title=""
270         ),
271         visualmap_opts=opts.VisualMapOpts(
272             orient="horizontal", 
273             max_=800,
274             pos_bottom='10%',
275             is_piecewise=True,
276             pieces=[{"min": 600},
277                     {"min": 300, "max": 599},
278                     {"min": 200, "max": 299},
279                     {"min": 160, "max": 199},
280                     {"min": 100, "max": 159},
281                     {"max": 99}],
282             range_color=['#ffeda0','#fed976','#fd8d3c','#fc4e2a','#e31a1c','#b10026']
283             
284         ),
285         legend_opts=opts.LegendOpts(is_show=True,
286                                     pos_top='5%',
287                                     item_width = 50,
288                                     item_height = 30,
289                                     textstyle_opts=opts.TextStyleOpts(font_size=16,color='#EB1934'),
290                                     legend_icon ='path://path://M465.621333 469.333333l-97.813333-114.133333a21.333333 21.333333 0 1 1 32.384-27.733333L512 457.856l111.786667-130.432a21.333333 21.333333 0 1 1 32.426666 27.776L558.357333 469.333333h81.493334c11.84 0 21.461333 9.472 21.461333 21.333334 0 11.776-9.6 21.333333-21.482667 21.333333H533.333333v85.333333h106.517334c11.861333 0 21.482667 9.472 21.482666 21.333334 0 11.776-9.6 21.333333-21.482666 21.333333H533.333333v127.850667c0 11.861333-9.472 21.482667-21.333333 21.482666-11.776 0-21.333333-9.578667-21.333333-21.482666V640h-106.517334A21.354667 21.354667 0 0 1 362.666667 618.666667c0-11.776 9.6-21.333333 21.482666-21.333334H490.666667v-85.333333h-106.517334A21.354667 21.354667 0 0 1 362.666667 490.666667c0-11.776 9.6-21.333333 21.482666-21.333334h81.472zM298.666667 127.957333C298.666667 104.405333 317.824 85.333333 341.12 85.333333h341.76C706.304 85.333333 725.333333 104.490667 725.333333 127.957333v42.752A42.645333 42.645333 0 0 1 682.88 213.333333H341.12C317.696 213.333333 298.666667 194.176 298.666667 170.709333V127.957333zM341.333333 170.666667h341.333334V128H341.333333v42.666667z m-105.173333-42.666667v42.666667H170.752L170.666667 895.893333 853.333333 896V170.773333L789.909333 170.666667V128h63.296C876.842667 128 896 147.072 896 170.773333v725.12C896 919.509333 877.013333 938.666667 853.333333 938.666667H170.666667a42.666667 42.666667 0 0 1-42.666667-42.773334V170.773333C128 147.157333 147.114667 128 170.752 128h65.408z'
291                                    ),
292     )
293 )
294 Cal.render('五月每日订单量分析图.html')
295 
296 # 词云
297 g = df3.groupby('商品名称').sum()
298 drug_list = []
299 for idx, value in enumerate(list(g.index)):
300     drug_list += [value] * list(g['销售数量'].values)[idx]
301 stylecloud.gen_stylecloud(
302     text=' '.join(drug_list),
303     font_path=r'STXINWEI.TTF',
304     palette='cartocolors.qualitative.Bold_5',# 设置配色方案
305     icon_name='fas fa-lock', # 设置蒙版方案
306 #     background_color='black',
307     max_font_size=200,
308     output_name='药品销量.png',
309     )
310 Image.open("药品销量.png")
311 
312 data_copy = df.copy()
313 data_copy.drop(data_copy.columns[0], axis=1, inplace=True)
314 sindex = np.argsort(data_copy.isnull().sum()).values
315 # 进行缺失值的填补，利用随机森林进行填补缺失值
316 for i in sindex :
317     if data_copy.iloc[:,i].isnull().sum() == 0 :
318         continue
319     df = data_copy
320     fillc = df.iloc[:, i]
321     df = df.iloc[:,df.columns!=df.columns[i]]
322 
323 #在下面的是使用了0来对特征矩阵中的缺失值的填补，
324     df_0 = SimpleImputer(missing_values=np.nan
325                         ,strategy="constant"
326                         ,fill_value=0
327                         ).fit_transform(df)
328     Ytrain = fillc[fillc.notnull()]
329     Ytest = fillc[fillc.isnull()]
330     
331     Xtrain = df_0[Ytrain.index,:]
332     Xtest = df_0[Ytest.index,:]
333     
334     rfc = RandomForestRegressor()
335     rfc.fit(Xtrain, Ytrain)
336     Ypredict = rfc.predict(Xtest)
337     
338     data_copy.loc[data_copy.iloc[:,i].isnull(),data_copy.columns[i]] = Ypredict
339 data_copy.isnull().sum()

四、总结

　　在进行药品销售数量的大数据分析时，我通过对数据的分析和挖掘，得出了以下有益的结论：通过对销售数据的时间序列分析，我发现药品销售数量存在季节性波动，在春季和秋季销售数量通常较高，在夏季和冬季销售数量通常较低。在完成此设计过程中，我得到了许多收获。首先，我学会了如何进行大数据分析，包括如何清洗数据、如何使用数据分析工具进行数据分析和可视化。其次，我还学会了如何根据分析结果得出有益的结论并提出建议。在未来的工作中，我建议对数据进行更深入的分析，例如通过进行回归分析来更准确地预测药品销售数量的变化趋势，并进一步优化药品销售策略。此外，我还建议对数据进行实时更新，以便更好地反映市场变化并进行及时调整。

posted @ 2022-12-22 22:05 serviceii 阅读(1162) 评论(0) 收藏举报

刷新页面返回顶部

大数据分析——某电商平台药品销售数据分析

公告