夜的独白

  博客园  :: 首页  :: 新随笔  :: 联系 :: 订阅 订阅  :: 管理

google play store app数据源 提取码: 38jk

google play store的app数据分析

1. 加载数据

  • 加载数据分析使用的库
  • 加载数据前,先用文本编辑器简单浏览一下数据
  • 加载好数据之后,第一步先分别使用shape、head、count、describe和info方法看下数据
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    
    # 加载文件 
    # 这次只分析'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type'
    df = pd.read_csv('./googleplaystore.csv', usecols=(0, 1, 2, 3, 4, 5, 6))
    
    # 简单浏览下数据
    print(df.head())
    # 查看行列数量
    print(df.shape)
    # 查看各个列的非空数量
    print(df.count())
    
    # 使用describe和info方法看下数据的大概分布
    print(df.describe())
    print(df.info())
                                               App        Category  Rating  \
    0     Photo Editor & Candy Camera & Grid & ScrapBook  ART_AND_DESIGN     4.1   
    1                                Coloring book moana  ART_AND_DESIGN     3.9   
    2  U Launcher Lite – FREE Live Cool Themes, Hide ...  ART_AND_DESIGN     4.7   
    3                              Sketch - Draw & Paint  ART_AND_DESIGN     4.5   
    4              Pixel Draw - Number Art Coloring Book  ART_AND_DESIGN     4.3   
    
      Reviews  Size     Installs  Type  
    0     159   19M      10,000+  Free  
    1     967   14M     500,000+  Free  
    2   87510  8.7M   5,000,000+  Free  
    3  215644   25M  50,000,000+  Free  
    4     967  2.8M     100,000+  Free  
    (10841, 7)
    App         10841
    Category    10841
    Rating       9367
    Reviews     10841
    Size        10841
    Installs    10841
    Type        10840
    dtype: int64
                Rating
    count  9367.000000
    mean      4.193338
    std       0.537431
    min       1.000000
    25%       4.000000
    50%       4.300000
    75%       4.500000
    max      19.000000
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10841 entries, 0 to 10840
    Data columns (total 7 columns):
    App         10841 non-null object
    Category    10841 non-null object
    Rating      9367 non-null float64
    Reviews     10841 non-null object
    Size        10841 non-null object
    Installs    10841 non-null object
    Type        10840 non-null object
    dtypes: float64(1), object(6)
    memory usage: 592.9+ KB
    None
  • 从上面的运行结果得出
  • 数据一共有10841行
  • Rating和Type数据有缺失
  • Rating有一个19的异常值
  • Size的‘M’和‘k’和Installs的‘+’都需要处理,方便进一步计算

2. 数据清洗 # App

  • 查看有没有重复值
    print(df['App'].unique().size)
    9660
  • 有重复值,先不着急删除,为了不把其他列的异常值留下,先处理数值异常的列

3. 数据清洗 # Categoery

    print(df['Category'].value_counts(dropna=False))
    print(df[df['Category'] == '1.9'])
    FAMILY                 1972
    GAME                   1144
    TOOLS                   843
    MEDICAL                 463
    BUSINESS                460
    PRODUCTIVITY            424
    PERSONALIZATION         392
    COMMUNICATION           387
    SPORTS                  384
    LIFESTYLE               382
    FINANCE                 366
    HEALTH_AND_FITNESS      341
    PHOTOGRAPHY             335
    SOCIAL                  295
    NEWS_AND_MAGAZINES      283
    SHOPPING                260
    TRAVEL_AND_LOCAL        258
    DATING                  234
    BOOKS_AND_REFERENCE     231
    VIDEO_PLAYERS           175
    EDUCATION               156
    ENTERTAINMENT           149
    MAPS_AND_NAVIGATION     137
    FOOD_AND_DRINK          127
    HOUSE_AND_HOME           88
    AUTO_AND_VEHICLES        85
    LIBRARIES_AND_DEMO       85
    WEATHER                  82
    ART_AND_DESIGN           65
    EVENTS                   64
    COMICS                   60
    PARENTING                60
    BEAUTY                   53
    1.9                       1
    Name: Category, dtype: int64
                                               App Category  Rating Reviews  \
    10472  Life Made WI-Fi Touchscreen Photo Frame      1.9    19.0    3.0M   
    
             Size Installs Type  
    10472  1,000+     Free    0  
  • 有一条异常值,观察发现应该是Category值缺失,所以这里删除这条数据
    df.drop(index=10472, inplace=True)

4. 数据清洗 # Rating

    print(df['Rating'].value_counts(dropna=False))
    NaN     1474
    4.4     1109
    4.3     1076
    4.5     1038
    4.2      952
    4.6      823
    4.1      708
    4.0      568
    4.7      499
    3.9      386
    3.8      303
    5.0      274
    3.7      239
    4.8      234
    3.6      174
    3.5      163
    3.4      128
    3.3      102
    4.9       87
    3.0       83
    3.1       69
    3.2       64
    2.9       45
    2.8       42
    2.6       25
    2.7       25
    2.5       21
    2.3       20
    2.4       19
    1.0       16
    2.2       14
    1.9       13
    2.0       12
    1.8        8
    1.7        8
    2.1        8
    1.6        4
    1.5        3
    1.4        3
    1.2        1
    Name: Rating, dtype: int64
  • 一共有1474条NaN值,用平均值来填充
    df['Rating'].fillna(value=df['Rating'].mean(), inplace=True)

5. 数据清洗 # Reviews

    print(df['Rating'].value_counts(dropna=False))
    print(df['Reviews'].str.isnumeric().sum())
    4.193338     1474
    4.400000     1109
    4.300000     1076
    4.500000     1038
    4.200000      952
    4.600000      823
    4.100000      708
    4.000000      568
    4.700000      499
    3.900000      386
    3.800000      303
    5.000000      274
    3.700000      239
    4.800000      234
    3.600000      174
    3.500000      163
    3.400000      128
    3.300000      102
    4.900000       87
    3.000000       83
    3.100000       69
    3.200000       64
    2.900000       45
    2.800000       42
    2.700000       25
    2.600000       25
    2.500000       21
    2.300000       20
    2.400000       19
    1.000000       16
    2.200000       14
    1.900000       13
    2.000000       12
    2.100000        8
    1.800000        8
    1.700000        8
    1.600000        4
    1.400000        3
    1.500000        3
    1.200000        1
    Name: Rating, dtype: int64
    10840
  • 用value_counts看数据分布挺广,都是数字
  • 把Reviews的数据类型转换成‘i8’,方便后面的分析
    df['Reviews'] = df['Reviews'].astype('i8')
    print(df.describe())
        Rating       Reviews
    count  10840.000000  1.084000e+04
    mean       4.191757  4.441529e+05
    std        0.478907  2.927761e+06
    min        1.000000  0.000000e+00
    25%        4.100000  3.800000e+01
    50%        4.200000  2.094000e+03
    75%        4.500000  5.477550e+04
    max        5.000000  7.815831e+07

6. 数据清洗 # Size

    print(df['Size'].value_counts())
    Varies with device    1695
    11M                    198
    12M                    196
    14M                    194
    13M                    191
    15M                    184
    17M                    160
    19M                    154
    26M                    149
    16M                    149
    25M                    143
    20M                    139
    21M                    138
    10M                    136
    24M                    136
    18M                    133
    23M                    117
    22M                    114
    29M                    103
    27M                     97
    28M                     95
    30M                     84
    33M                     79
    3.3M                    77
    37M                     76
    35M                     72
    31M                     70
    2.9M                    69
    2.3M                    68
    2.5M                    68
                          ... 
    809k                     1
    39k                      1
    691k                     1
    241k                     1
    954k                     1
    378k                     1
    203k                     1
    887k                     1
    754k                     1
    253k                     1
    11k                      1
    787k                     1
    992k                     1
    626k                     1
    857k                     1
    54k                      1
    862k                     1
    743k                     1
    642k                     1
    234k                     1
    313k                     1
    82k                      1
    549k                     1
    400k                     1
    240k                     1
    778k                     1
    161k                     1
    478k                     1
    89k                      1
    154k                     1
    Name: Size, Length: 461, dtype: int64
  • 数据中存在‘M’和‘k’需要处理,还存在字符串1695个‘Varies with device’
  • 把‘Varies with device’用‘0’来替换
  • 把Size数据类型转换成f8
  • 然后再用平均值来填充‘0’值
    df['Size'] = df['Size'].str.replace('M', 'e+6')
    df['Size'] = df['Size'].str.replace('k', 'e+3')
    # 转换剩下的字符串
    df['Size'] = df['Size'].str.replace('Varies with device', '0')
    # 转换数据类型
    df['Size'] = df['Size'].astype('f8')
    df['Size'].replace(0, df['Size'].mean(), inplace=True)
    df['Size']
    0        1.900000e+07
    1        1.400000e+07
    2        8.700000e+06
    3        2.500000e+07
    4        2.800000e+06
    5        5.600000e+06
    6        1.900000e+07
    7        2.900000e+07
    8        3.300000e+07
    9        3.100000e+06
    10       2.800000e+07
    11       1.200000e+07
    12       2.000000e+07
    13       2.100000e+07
    14       3.700000e+07
    15       2.700000e+06
    16       5.500000e+06
    17       1.700000e+07
    18       3.900000e+07
    19       3.100000e+07
    20       1.400000e+07
    21       1.200000e+07
    22       4.200000e+06
    23       7.000000e+06
    24       2.300000e+07
    25       6.000000e+06
    26       2.500000e+07
    27       6.100000e+06
    28       4.600000e+06
    29       4.200000e+06
                 ...     
    10811    3.900000e+06
    10812    1.300000e+07
    10813    2.700000e+06
    10814    3.100000e+07
    10815    4.900000e+06
    10816    6.800000e+06
    10817    8.000000e+06
    10818    1.500000e+06
    10819    3.600000e+06
    10820    8.600000e+06
    10821    2.500000e+06
    10822    3.100000e+06
    10823    2.900000e+06
    10824    8.200000e+07
    10825    7.700000e+06
    10826    1.815209e+07
    10827    1.300000e+07
    10828    1.300000e+07
    10829    7.400000e+06
    10830    2.300000e+06
    10831    9.800000e+06
    10832    5.820000e+05
    10833    6.190000e+05
    10834    2.600000e+06
    10835    9.600000e+06
    10836    5.300000e+07
    10837    3.600000e+06
    10838    9.500000e+06
    10839    1.815209e+07
    10840    1.900000e+07
    Name: Size, Length: 10840, dtype: float64
    print(df.describe())
                 Rating       Reviews          Size
    count  10840.000000  1.084000e+04  1.084000e+04
    mean       4.191757  4.441529e+05  2.099045e+07
    std        0.478907  2.927761e+06  2.078345e+07
    min        1.000000  0.000000e+00  8.500000e+03
    25%        4.100000  3.800000e+01  5.900000e+06
    50%        4.200000  2.094000e+03  1.800000e+07
    75%        4.500000  5.477550e+04  2.600000e+07
    max        5.000000  7.815831e+07  1.000000e+08

7. 数据清洗 # Installs

  • 先查看分布
    print(df['Installs'].value_counts())
    1,000,000+        1579
    10,000,000+       1252
    100,000+          1169
    10,000+           1054
    1,000+             907
    5,000,000+         752
    100+               719
    500,000+           539
    50,000+            479
    5,000+             477
    100,000,000+       409
    10+                386
    500+               330
    50,000,000+        289
    50+                205
    5+                  82
    500,000,000+        72
    1+                  67
    1,000,000,000+      58
    0+                  14
    0                    1
    Name: Installs, dtype: int64
  • 分布比较少,直接替换
    df['Installs'] = df['Installs'].str.replace('+', '')
    df['Installs'] = df['Installs'].str.replace(',', '')
  • 转换数据类型为‘i8’
    df['Installs'] = df['Installs'].astype('i8')
    print(df.describe())
                Rating       Reviews          Size      Installs
    count  10840.000000  1.084000e+04  1.084000e+04  1.084000e+04
    mean       4.191757  4.441529e+05  2.099045e+07  1.546434e+07
    std        0.478907  2.927761e+06  2.078345e+07  8.502936e+07
    min        1.000000  0.000000e+00  8.500000e+03  0.000000e+00
    25%        4.100000  3.800000e+01  5.900000e+06  1.000000e+03
    50%        4.200000  2.094000e+03  1.800000e+07  1.000000e+05
    75%        4.500000  5.477550e+04  2.600000e+07  5.000000e+06
    max        5.000000  7.815831e+07  1.000000e+08  1.000000e+09

8. 数据清洗 # Type

  • info信息中查看到有na值,这里需要dropna参数
    print(df['Type'].value_counts(dropna=False))
    print(df[df['Type'].isnull()])
    Free    10039
    Paid      800
    NaN         1
    Name: Type, dtype: int64
                                App Category    Rating  Reviews          Size  \
    9148  Command & Conquer: Rivals   FAMILY  4.191757        0  1.815209e+07   
    
          Installs Type  
    9148         0  NaN  
    
  • 删除这条数据
    df.drop(index=9148, inplace=True)
  • 最后删除App重复的行
    df.drop_duplicates('App', inplace=True)
  • 数据清洗完毕,可以开始分析了
  • 整体情况
    print(df.describe())
     Rating       Reviews          Size      Installs
    count  9658.000000  9.658000e+03  9.658000e+03  9.658000e+03
    mean      4.176046  2.166150e+05  2.011053e+07  7.778312e+06
    std       0.494383  1.831413e+06  2.040865e+07  5.376100e+07
    min       1.000000  0.000000e+00  8.500000e+03  0.000000e+00
    25%       4.000000  2.500000e+01  5.300000e+06  1.000000e+03
    50%       4.200000  9.670000e+02  1.600000e+07  1.000000e+05
    75%       4.500000  2.940800e+04  2.500000e+07  1.000000e+06
    max       5.000000  7.815831e+07  1.000000e+08  1.000000e+09

9. 数据分析 # Category&App

  • 分类的个数
    print(df.Category.unique().size)
    33
  • 每个分类的App数量,排序,可以得出哪些分类的App最受开发者欢迎
    Category_App_count = df.groupby('Category').count().sort_values('App', ascending=False)['App']
    print(Category_App_count)
    plt.figure(figsize=(20,10),dpi=80)
    Category_App_count.plot(kind='barh')
    plt.savefig('./Category_App_count.png')
    plt.show()
    Category
    FAMILY                 1831
    GAME                    959
    TOOLS                   827
    BUSINESS                420
    MEDICAL                 395
    PERSONALIZATION         376
    PRODUCTIVITY            374
    LIFESTYLE               369
    FINANCE                 345
    SPORTS                  325
    COMMUNICATION           315
    HEALTH_AND_FITNESS      288
    PHOTOGRAPHY             281
    NEWS_AND_MAGAZINES      254
    SOCIAL                  239
    BOOKS_AND_REFERENCE     222
    TRAVEL_AND_LOCAL        219
    SHOPPING                202
    DATING                  171
    VIDEO_PLAYERS           163
    MAPS_AND_NAVIGATION     131
    EDUCATION               119
    FOOD_AND_DRINK          112
    ENTERTAINMENT           102
    AUTO_AND_VEHICLES        85
    LIBRARIES_AND_DEMO       84
    WEATHER                  79
    HOUSE_AND_HOME           74
    EVENTS                   64
    ART_AND_DESIGN           64
    PARENTING                60
    COMICS                   56
    BEAUTY                   53
    Name: App, dtype: int64
  • 33个分类App的数据可视化
    ![Category](https://img-blog.csdnimg.cn/20190730105320410.png?x-oss-
    process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0ZERkNKQkFGU0E=,size_16,color_FFFFFF,t_70)

  • App数量排名前十分类的数据可视化

    count_top_10 = df.groupby('Category').count()['App'].sort_values(ascending=False)[:10]
    print(count_top_10)
    plt.figure(figsize=(20,10),dpi=80)
    x = count_top_10.index
    y = count_top_10.values
    # 添加数据标签
    for a, b in zip(x, y):
        plt.text(a, b, b, ha='center', va='bottom', fontsize=12)
    plt.bar(x, y, width=0.5)
    plt.savefig('./count_top_10.png')
    plt.show()
    Category
    FAMILY             1831
    GAME                959
    TOOLS               827
    BUSINESS            420
    MEDICAL             395
    PERSONALIZATION     376
    PRODUCTIVITY        374
    LIFESTYLE           369
    FINANCE             345
    SPORTS              325
    Name: App, dtype: int64

![count_top_10](https://img-blog.csdnimg.cn/20190730110109437.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0ZERkNKQkFGU0E=,size_16,color_FFFFFF,t_70)

10. 数据分析 # Category&Installs

  • 33种分类的安装量排序
  • 安装量前10分类的可视化
    # 33种分类的安装量排序
    Category_Installs_mean = df.groupby('Category').mean()['Installs'].sort_values( ascending=False)
    print(Category_Installs_mean)
    # 安装量前10分类的可视化
    mean_top_10 = df.groupby('Category').mean()['Installs'].sort_values( ascending=False)[:10]
    print(mean_top_10)
    plt.figure(figsize=(20,10),dpi=80)
    x = mean_top_10.index
    y = mean_top_10.values.astype('i8')
    # 添加数据标签
    for a, b in zip(x, y):
        plt.text(a, b, b, ha='center', va='bottom', fontsize=12)
    plt.bar(x, y, width=0.5)
    plt.savefig('./mean_top_10.png')
    plt.show()
    Category
    COMMUNICATION          3.504215e+07
    VIDEO_PLAYERS          2.409143e+07
    SOCIAL                 2.296179e+07
    ENTERTAINMENT          2.072216e+07
    PHOTOGRAPHY            1.654501e+07
    PRODUCTIVITY           1.548955e+07
    GAME                   1.447229e+07
    TRAVEL_AND_LOCAL       1.321866e+07
    TOOLS                  9.675661e+06
    NEWS_AND_MAGAZINES     9.327629e+06
    BOOKS_AND_REFERENCE    7.504367e+06
    SHOPPING               6.932420e+06
    WEATHER                4.570893e+06
    PERSONALIZATION        4.075784e+06
    HEALTH_AND_FITNESS     3.972300e+06
    MAPS_AND_NAVIGATION    3.841846e+06
    SPORTS                 3.373768e+06
    EDUCATION              2.965983e+06
    FAMILY                 2.418319e+06
    FOOD_AND_DRINK         1.891060e+06
    ART_AND_DESIGN         1.786533e+06
    BUSINESS               1.659916e+06
    LIFESTYLE              1.365375e+06
    FINANCE                1.319851e+06
    HOUSE_AND_HOME         1.313682e+06
    DATING                 8.241293e+05
    COMICS                 8.032348e+05
    LIBRARIES_AND_DEMO     6.309037e+05
    AUTO_AND_VEHICLES      6.250613e+05
    PARENTING              5.253518e+05
    BEAUTY                 5.131519e+05
    EVENTS                 2.495806e+05
    MEDICAL                9.669159e+04
    Name: Installs, dtype: float64
    
    Category
    COMMUNICATION         3.504215e+07
    VIDEO_PLAYERS         2.409143e+07
    SOCIAL                2.296179e+07
    ENTERTAINMENT         2.072216e+07
    PHOTOGRAPHY           1.654501e+07
    PRODUCTIVITY          1.548955e+07
    GAME                  1.447229e+07
    TRAVEL_AND_LOCAL      1.321866e+07
    TOOLS                 9.675661e+06
    NEWS_AND_MAGAZINES    9.327629e+06
    Name: Installs, dtype: float64

![Category&Installs](https://img-blog.csdnimg.cn/2019073011184390.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0ZERkNKQkFGU0E=,size_16,color_FFFFFF,t_70)

  • 得出结论:娱乐社交类安装量最多

11. 数据分析 # Category&Reviews

  • 33种分类的评论数量排序
  • 评论数量前10分类的可视化
    # 33种分类的评论数量排序
    Category_Reviews_mean = df.groupby('Category').mean()['Reviews'].sort_values(ascending=False)
    print(Category_Reviews_mean)
    # 33种分类的评论数量排序
    top_mean_10 = df.groupby('Category').mean()['Reviews'].sort_values(ascending=False)[:10]
    print(top_mean_10)
    
    plt.figure(figsize=(20,10),dpi=80)
    x = top_mean_10.index
    y = top_mean_10.values.astype('i8')
    # 添加数据标签
    for a, b in zip(x, y):
        plt.text(a, b, b, ha='center', va='bottom', fontsize=12)
    plt.bar(x, y, width=0.5)
    plt.savefig('./top_mean_10.png')
    plt.show()
    Category
    SOCIAL                 953672.807531
    COMMUNICATION          907337.676190
    GAME                   648903.763295
    VIDEO_PLAYERS          414015.754601
    PHOTOGRAPHY            374915.551601
    ENTERTAINMENT          340810.294118
    TOOLS                  277335.644498
    SHOPPING               220553.118812
    WEATHER                155634.987342
    PRODUCTIVITY           148638.098930
    PERSONALIZATION        142401.808511
    MAPS_AND_NAVIGATION    135337.007634
    TRAVEL_AND_LOCAL       122464.570776
    EDUCATION              112303.764706
    SPORTS                 108765.578462
    NEWS_AND_MAGAZINES      91063.889764
    FAMILY                  78550.239214
    BOOKS_AND_REFERENCE     75321.234234
    HEALTH_AND_FITNESS      74171.371528
    FOOD_AND_DRINK          56473.464286
    COMICS                  41822.696429
    FINANCE                 36701.756522
    LIFESTYLE               32066.859079
    HOUSE_AND_HOME          26079.013514
    BUSINESS                23548.202381
    ART_AND_DESIGN          22175.046875
    DATING                  21190.315789
    PARENTING               15972.183333
    AUTO_AND_VEHICLES       13690.188235
    LIBRARIES_AND_DEMO      10795.607143
    BEAUTY                   7476.226415
    MEDICAL                  2994.863291
    EVENTS                   2515.906250
    Name: Reviews, dtype: float64
    Category
    SOCIAL           953672.807531
    COMMUNICATION    907337.676190
    GAME             648903.763295
    VIDEO_PLAYERS    414015.754601
    PHOTOGRAPHY      374915.551601
    ENTERTAINMENT    340810.294118
    TOOLS            277335.644498
    SHOPPING         220553.118812
    WEATHER          155634.987342
    PRODUCTIVITY     148638.098930
    Name: Reviews, dtype: float64

![Category&Reviews](https://img-blog.csdnimg.cn/2019073011360221.png?x-oss-
process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0ZERkNKQkFGU0E=,size_16,color_FFFFFF,t_70)

  • 得出结论:社交游戏视频评论多

12. 数据分析 # Category&Rating

  • 分类的打分数据
    Category_Rating_mean = df.groupby('Category').mean()['Rating'].sort_values(ascending=False)
    print(Category_Rating_mean)
    Category
    EVENTS                 4.363178
    EDUCATION              4.362956
    ART_AND_DESIGN         4.349614
    BOOKS_AND_REFERENCE    4.308393
    PERSONALIZATION        4.303077
    PARENTING              4.281960
    BEAUTY                 4.260553
    GAME                   4.244643
    SOCIAL                 4.238926
    WEATHER                4.238510
    HEALTH_AND_FITNESS     4.235199
    SHOPPING               4.225835
    SPORTS                 4.211275
    AUTO_AND_VEHICLES      4.190601
    PRODUCTIVITY           4.185022
    COMICS                 4.181848
    LIBRARIES_AND_DEMO     4.181371
    FAMILY                 4.181137
    FOOD_AND_DRINK         4.175461
    MEDICAL                4.173252
    PHOTOGRAPHY            4.159614
    HOUSE_AND_HOME         4.156771
    NEWS_AND_MAGAZINES     4.135385
    ENTERTAINMENT          4.135294
    COMMUNICATION          4.134647
    BUSINESS               4.133347
    FINANCE                4.125060
    LIFESTYLE              4.111489
    TRAVEL_AND_LOCAL       4.087380
    TOOLS                  4.059615
    VIDEO_PLAYERS          4.058137
    MAPS_AND_NAVIGATION    4.051854
    DATING                 4.018100
    Name: Rating, dtype: float64

12. 数据分析 # Category&Type

  • 分type数据
    print(df.groupby('Type')['App'].count())
    print(df.groupby('Type').sum()['Installs'].sort_values(ascending=False))
    Type
    Free    8902
    Paid     756
    Name: App, dtype: int64
    Type
    Free    75065572646
    Paid       57364881
    Name: Installs, dtype: int64
  • 免费占比大,收费占比小,免费仍然是主流
  • Category和Type一起分析
    df.groupby(['Type', 'Category']).sum()['Reviews'].sort_values(ascending=False)
    Type  Category           
    Free  GAME                   620725858
          COMMUNICATION          285727154
          TOOLS                  229184641
          SOCIAL                 227927559
          FAMILY                 140192916
          PHOTOGRAPHY            105236039
          VIDEO_PLAYERS           67471201
          PRODUCTIVITY            55418928
          PERSONALIZATION         53249927
          SHOPPING                44551246
          SPORTS                  35198178
          ENTERTAINMENT           34752641
          TRAVEL_AND_LOCAL        26801668
          NEWS_AND_MAGAZINES      23130027
          HEALTH_AND_FITNESS      21315562
          MAPS_AND_NAVIGATION     17721960
          BOOKS_AND_REFERENCE     16719518
          EDUCATION               13329503
          FINANCE                 12638908
          WEATHER                 12158723
          LIFESTYLE               11785249
          BUSINESS                 9865113
          FOOD_AND_DRINK           6321631
    Paid  FAMILY                   3632572
    Free  DATING                   3621936
          COMICS                   2342071
          HOUSE_AND_HOME           1929847
    Paid  GAME                     1572851
    Free  ART_AND_DESIGN           1417037
          MEDICAL                  1162965
                                   ...    
          BEAUTY                    396240
    Paid  PERSONALIZATION           293153
          TOOLS                     171937
          PRODUCTIVITY              171721
    Free  EVENTS                    161018
    Paid  SPORTS                    150635
          WEATHER                   136441
          PHOTOGRAPHY               115231
          COMMUNICATION              84214
          LIFESTYLE                  47422
          HEALTH_AND_FITNESS         45793
          EDUCATION                  34645
          BUSINESS                   25132
          FINANCE                    23198
          MEDICAL                    20006
          TRAVEL_AND_LOCAL           18073
          VIDEO_PLAYERS              13367
          ENTERTAINMENT              10009
          PARENTING                   8366
          MAPS_AND_NAVIGATION         7188
          AUTO_AND_VEHICLES           4163
          FOOD_AND_DRINK              3397
          ART_AND_DESIGN              2166
          BOOKS_AND_REFERENCE         1796
          DATING                      1608
          SHOPPING                     484
          SOCIAL                       242
          NEWS_AND_MAGAZINES           201
          LIBRARIES_AND_DEMO             4
          EVENTS                         0
    Name: Reviews, Length: 63, dtype: int64
  • 评论安装比
    Type_Category = df.groupby(['Type', 'Category']).mean()
    print((Type_Category['Reviews'] / Type_Category['Installs']).sort_values(ascending=False))
    Type  Category           
    Paid  VIDEO_PLAYERS          0.188268
          FAMILY                 0.175913
          WEATHER                0.168031
          PARENTING              0.166986
          DATING                 0.141674
          ART_AND_DESIGN         0.135375
          FINANCE                0.124988
          PRODUCTIVITY           0.121611
          SPORTS                 0.121107
          BUSINESS               0.118115
          TOOLS                  0.099533
          TRAVEL_AND_LOCAL       0.098727
          HEALTH_AND_FITNESS     0.096587
          PERSONALIZATION        0.089958
          AUTO_AND_VEHICLES      0.083011
          BOOKS_AND_REFERENCE    0.077029
          GAME                   0.074898
          COMMUNICATION          0.061920
          PHOTOGRAPHY            0.061334
          MAPS_AND_NAVIGATION    0.059356
          EDUCATION              0.057550
          FOOD_AND_DRINK         0.056617
    Free  COMICS                 0.052068
    Paid  ENTERTAINMENT          0.050045
          SHOPPING               0.047921
    Free  GAME                   0.044792
          SOCIAL                 0.041533
    Paid  SOCIAL                 0.040333
          LIFESTYLE              0.040218
          LIBRARIES_AND_DEMO     0.040000
                                   ...   
    Free  MAPS_AND_NAVIGATION    0.035221
          PERSONALIZATION        0.034821
          WEATHER                0.033747
          SPORTS                 0.032138
          SHOPPING               0.031815
          FAMILY                 0.031809
          MEDICAL                0.030903
          PARENTING              0.030185
          FOOD_AND_DRINK         0.029856
          TOOLS                  0.028648
          FINANCE                0.027768
          COMMUNICATION          0.025888
          DATING                 0.025703
          LIFESTYLE              0.023446
          PHOTOGRAPHY            0.022645
          AUTO_AND_VEHICLES      0.021844
          HOUSE_AND_HOME         0.019852
          HEALTH_AND_FITNESS     0.018640
          VIDEO_PLAYERS          0.017182
          LIBRARIES_AND_DEMO     0.017111
          ENTERTAINMENT          0.016443
          BEAUTY                 0.014569
          BUSINESS               0.014155
          ART_AND_DESIGN         0.012395
          EVENTS                 0.010081
          BOOKS_AND_REFERENCE    0.010036
          NEWS_AND_MAGAZINES     0.009763
          PRODUCTIVITY           0.009569
          TRAVEL_AND_LOCAL       0.009259
    Paid  EVENTS                 0.000000
    Length: 63, dtype: float64
  • 收费的App评论比率高

13. 数据分析 # 相关性 corr

    print(df.corr())
    Rating   Reviews      Size  Installs
    Rating    1.000000  0.054337  0.052751  0.039245
    Reviews   0.054337  1.000000  0.080578  0.625164
    Size      0.052751  0.080578  1.000000  0.050675
    Installs  0.039245  0.625164  0.050675  1.000000
  • 评论数和安装数强相关,其他的连0.1都不到,可以认为是不相关(0.5以上可以认为是相关的,0.3以上可以认为是弱相关)

在这里插入图片描述

posted on 2021-07-07 10:18  夜的独白  阅读(635)  评论(0)    收藏  举报