PUBG数据集分析报告
数据集分为两个部分
数据集1:
agg数据: 15 字段
- date: 时间
- game_size:队伍数量
- match_id:比赛
- match_mode: 对局模式(第一人称还是第三人称)
- party_size:组队模式(单人赛、双人赛、四人赛)
- player_assists:助攻次数
- player_dbno:击倒人数
- player_dist_ride:载具移动距离
- player_dist_walk:行走距离
- player_dmg:伤害数值
- player_kills:击杀人数
- player_name:玩家名称
- player_survive_time:玩家生存时间
- team_id:队伍id
- team_placement:队伍排名
数据集2:
kill dataset 12 字段
- killed_by:死亡方式
- killer_name:击杀者名字
- killer_placement:击杀者排名
- killer_position_x:击杀者位置x坐标
- killer_position_y:击杀者位置y坐标,
- map:地图
- match_id:比赛
- time:存活时间,
- victim_name:被击杀者名字
- victim_placement:被击杀者排名
- ictim_position_x:被击杀者位置x坐标
- victim_position_y:被击杀者位置y坐标
我们可以自由探索数据间的关系,发现一些有意思的现象
我想完成的目标:
- 击杀人数的分布情况
- 单变量分析: 击杀人数与排名, 组队模式与吃鸡
- 多变量间分析
- 落地成盒热点区域(存活时间小于180s)
- 最好用的枪械
- 战斗发生的距离分布
- 不同战斗距离中最好用的枪械
- 训练一个SVM预测是否能够吃鸡
及杀人数的分布情况
部分代码
导入数据 并处理丢失数据
    agg = './pubg/agg0.csv'
    kill = './pubg/kill0.csv'
    agg_df = pd.read_csv(agg)  # shape (13849287, 15)
    kill_df = pd.read_csv(kill) # shape(13426348, 12)
    agg_sub_df = agg_df[:1000000]
    kill_sub_df = kill_df[:1000000]
    agg_sub_deleted = agg_sub_df.dropna(axis=0)
    kill_sub_deleted = kill_sub_df.dropna(axis=0)
    agg_sub_deleted['player_kills'].describe() # 大部分人的击杀人数都是在1人和0人
count 998580.000000
mean 0.887166
std 1.555946
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 62.000000
Name: player_kills, dtype: float64
单变量分析: 击杀人数与排名, 组队模式与吃鸡
    # 查看击杀人数与排名的关系;  整体看,击杀人数超过10人可以排到前5名概率比较大, 40左右时有异常发生。
    data = agg_sub_deleted[['player_kills', 'team_placement']]
    sns.set(style="darkgrid")
    g = sns.relplot(x="player_kills", y="team_placement",height=4,linewidth=2,aspect=1.3, kind="line", data=data)
    g.fig.autofmt_xdate() # Rotate coordinates

    agg_sub_1 = agg_sub_deleted[agg_sub_deleted['team_placement']==1.0] # 筛选第一名数据
    party_size_data = agg_sub_1.groupby('party_size')  # 2 , 4, 1三种组队模式, 分别获胜率
    old = agg_sub_deleted.groupby('party_size')
    
    # 查看每种组队模式的获胜概率, 4 人组队获胜概率最大: 4%, 
    num = []
    for i in range(1,5):
        if i==3:
            continue
        size = old.get_group(i)
        size2 = party_size_data.get_group(i)
        num.append(round(size2.shape[0]/size.shape[0], 2)) 
    name = ['solo', 'double', 'four-people']
    f, ax = plt.subplots(figsize=(7, 3))
    plt.bar(range(len(num)), num, color=sns.color_palette("cubehelix",3), tick_label=name)
    plt.title('ranking 1 percentage by party_size')

多变量间分析
    # 相关性分析, date, match_id, ,match_mode, pary_size, player_name, team_id 这些字段不需要
    not_use = ['date', 'match_id', 'match_mode', 'party_size', 'player_name', 'team_id']
    all_columns = agg_sub_deleted.columns.values
    agg_sub_corr = agg_sub_deleted[[column for column in all_columns if column not in not_use]] #筛选不在not use中的字段
    corr_data = agg_sub_corr.corr(method='spearman')
    plt.subplots(figsize=(9, 9)) 
    sns.heatmap(corr_data, annot=True, vmax=1, square=True, cmap="Blues")
    plt.show()
关注’team_placement’列发现队伍排名除了与存活时间有强烈的负相关以外,与载具移动距离和行走距离也有较强负相关(排名越小越好所以是负相关)

详细查看 获胜率 和 载具移动距离的关系
    df_ride = agg_sub_deleted.loc[agg_sub_deleted['player_dist_ride']<10000, ['player_dist_ride', 'champion']]
    labels=["0k-1k", "1k-2k", "2k-3k", "3k-4k","4k-5k", "5k-6k", "6k-7k", "7k-8k"]
    
    df_ride['drive'] = pd.cut(df_ride['player_dist_ride'], 8, labels=labels) # pd.cut , 分割pandas 为10个等距子表
    df_ride.groupby('drive').champion.mean().plot.bar(rot=30, figsize=(10, 6), color=sns.color_palette("cubehelix",8))
    plt.xlabel("drive distance")
    plt.ylabel("prop of Champion")

落地成盒热点区域(存活时间小于180s)
部分代码
    from scipy.ndimage.filters import gaussian_filter
    import matplotlib.cm as cm
    from matplotlib.colors import Normalize
    
    kill_sub_df_die_map1 = kill_sub_df.loc[(kill_sub_df['map'] == 'ERANGEL')&(kill_sub_df['time'] < 180)&(kill_sub_df['victim_position_x']>0), :].dropna()
    
    
    bg = imread('erangel.jpg')
    hmap, extent = heatmap(plot_data_ev[:,0], plot_data_ev[:,1], 1.5, bins =200)
    alphas = np.clip(Normalize(0, hmap.max()/100, clip=True)(hmap)*1.5,0.0,1.)
    colors = Normalize(hmap.max()/100, hmap.max()/20, clip=True)(hmap)
    colors = cm.bwr(colors)
    colors[..., -1] = alphas
    
    fig, ax = plt.subplots(figsize = (8,8))
    ax.set_xlim(0, 4096);ax.set_ylim(0, 4096)
    ax.imshow(bg)
    ax.imshow(colors, extent = extent, origin = 'lower', cmap = cm.cool, alpha = 1)
    plt.gca().invert_yaxis()
    plt.title('erangel killed pepole Distribution')

最好用的枪械
部分代码
    gun_category = kill_sub_df['killed_by'].unique()
    MIRAMAR = kill_sub_df[kill_sub_df['map']=='MIRAMAR']
    ERANGEL = kill_sub_df[kill_sub_df['map']=='ERANGEL']
    kill_sub_df_gun_M = MIRAMAR.groupby('killed_by')
    kill_sub_df_gun_E = ERANGEL.groupby('killed_by')
    
    # 统计各重武器击杀人数
    M_data = {}
    E_data = {}
    for name in gun_category:  
        try:
            sub = kill_sub_df_gun_M.get_group(name)
            if sub.shape[0]>100:
                M_data[name]=sub.shape[0]
        except Exception as e:
            continue
            
    for name in gun_category:  
        try:
            sub = kill_sub_df_gun_E.get_group(name)
            if sub.shape[0]>100:
                E_data[name]=sub.shape[0]
        except Exception as e:
            continue
统计发现无论在艾伦格还是米拉马, M416, SCAR-L, M16A4 都是击杀人数前三名的武器

战斗发生的距离分布
部分代码
    vec1 = np.array([93091.37, 722236.4])
    vec2 = np.array([92238.68, 723375.1])
    dist = np.linalg.norm(vec1 - vec2)  
    
    vec1  =kill_sub_df[['killer_position_x', 'killer_position_y']].values
    vec2 = kill_sub_df[['victim_position_x', 'victim_position_y']].values
    
    dist = []
    for i in range(len(vec1)):
        dis = np.linalg.norm(vec1[i]-vec2[i])
        dist.append(dis//100)
        
    #增加一列击杀距离
    kill_sub_df['kill_dist'] = dist
    short_dis = kill_sub_df[kill_sub_df['kill_dist']<=50]
    mid_dis = kill_sub_df[(kill_sub_df['kill_dist']>50) & (kill_sub_df['kill_dist']<=300)] # 多条件筛选()&()
    long_dis = kill_sub_df[(kill_sub_df['kill_dist']>300) & (kill_sub_df['kill_dist']<800)]
    super_dis = kill_sub_df[kill_sub_df['kill_dist']>=800]
绝大部分人在50米以内近距离被击杀

不同战斗距离中最好用的枪械
部分代码
    shor_gun = short_dis.groupby('killed_by')
    mid_gun = mid_dis.groupby('killed_by') 
    long_gun  = long_dis.groupby('killed_by')
    super_gun = super_dis.groupby('killed_by')
    def count_gun(df, category):
        data = {}
        for name in category:  
            try:
                sub = df.get_group(name)
                if sub.shape[0]>100:
                    data[name]=sub.shape[0]
            except Exception as e:
                continue
        data = sorted(data.items(), key=lambda x:x[1],reverse=True)
        return data[1:4]
    
    shor_data = count_gun(shor_gun, gun_category)
    mid_data = count_gun(mid_gun, gun_category)
    long_data = count_gun(long_gun, gun_category)
    super_data = count_gun(super_gun, gun_category)
近距离战斗,中距离战斗,M416, SCAR-L, AKM 表现最好
远距离战斗, Kar98k, M416, Mini 14 三种武器表现最好
超过800米的战斗数据可能有问题

训练一个SVM预测是否能够吃鸡
部分代码
    # 不要的字段
    all_colum  = agg_sub_deleted.columns 
    no_use = ['match_id', 'match_mode', 'player_name', 'team_id', 'team_placement', 'date']
    agg_svm = agg_sub_deleted[[colum for colum in all_colum if colum not in no_use]]
    
    df_dataset = pd.concat([agg_negative_sub, agg_positive],axis=0) # 合并两个df,竖向拼接
    df_dataset = shuffle(df_dataset)
    y = df_dataset['champion'].values
    X = df_dataset.drop(columns=['champion']).values
    
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)
    clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
    clf.fit(X_train, y_train)
    print('mean accuracy:', clf.score(X_test, y_test))
mean accuracy: 0.940930317181151
手动检测
    test = agg_negative.iloc[0][:-1].values  # 删除标签列
    clf.predict([test]) == agg_negative.iloc[0][-1]
array([ True])
总结
- 大部分人的击杀人数都是在1人和0人, 平均击杀数0.88
- 击杀人数超过7人可以排到前5名概率很大
- 跳伞时应该避开图中标注热点区域
- 队伍排名除了与存活时间有强烈的负相关以外,与载具移动距离和行走距离也有较强负相关(排名越小越好所以是负相关
- 统计发现无论在艾伦格还是米拉马, M416, SCAR-L, M16A4 都是击杀人数前三名的武器
- 绝大部分人在50米以内近距离被击杀
- 近距离战斗,中距离战斗,M416, SCAR-L, AKM 表现最好
- 远距离战斗, Kar98k, M416, Mini 14 三种武器表现最好
- 选取合适的特征,SVM准确率达到了0.94
 イケイケ!

 
                    
                 
 
                
            
         浙公网安备 33010602011771号
浙公网安备 33010602011771号