【DW·智慧海洋(捕鱼作业分析)打卡】task03_特征工程 (复现top的各种特征工程:分箱特征、网格特征、统计特征、Embedding特征)

开源地址见Github:https://github.com/datawhalechina/team-learning

学习目标

  1. 学习特征工程的基本概念

  2. 学习topline代码的特征工程构造方法,实现构建有意义的特征工程

  3. 完成相应学习打卡任务

内容介绍

  1. 特征工程概述

  2. 赛题特征工程

    • 业务特征,根据先验知识进行专业性的特征构建
  3. 分箱特征

    • v、x、y的分箱特征
    • x、y分箱后并构造区域
  4. DataFramte特征

    • count计数值
    • shift偏移量
    • 统计特征
  5. Embedding特征

    • Word2vec构造词向量
    • NMF提取文本的主题分布
  6. 总结

特征工程概述

特征工程大体可分为3部分,特征构建、特征提取和特征选择。

  • 特征构建
    • 探索性数据分析
    • 数值特征
    • 类别特征
    • 时间特征
    • 文本特征
  • 特征提取和特征选择
    • 简化
    • 改善性能
    • 改善通用性/降低过拟合的风险
import gc
import multiprocessing as mp
import os
import pickle
import time
import warnings
from collections import Counter
from copy import deepcopy
from datetime import datetime
from functools import partial
from glob import glob

import geopandas as gpd
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from gensim.models import FastText, Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from pyproj import Proj
from scipy import sparse
from scipy.sparse import csr_matrix
from sklearn import metrics
from sklearn.cluster import DBSCAN
from sklearn.decomposition import NMF, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import f1_score, precision_recall_fscore_support
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm

os.environ['PYTHONHASHSEED'] = '0'
warnings.filterwarnings('ignore')

D:\Anaconda3\lib\site-packages\gensim\similarities\__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
  warnings.warn(msg)
# 不直接对DataFrame做append操作,提升运行速度
def get_data(file_path,max_lines = 2000):
    paths = os.listdir(file_path)
    tmp = []
    for t in tqdm(range(len(paths))):
        if len(tmp) > max_lines:break

        p = paths[t]
        with open('{}/{}'.format(file_path, p), encoding='utf-8') as f:
            next(f)
            for line in f.readlines():
                tmp.append(line.strip().split(','))
                if len(tmp) > max_lines:break

    tmp_df = pd.DataFrame(tmp)
    tmp_df.columns = ['渔船ID', 'x', 'y', '速度', '方向', 'time', 'type']
    return tmp_df

TRAIN_PATH = "E:/competition-data/017_wisdomOcean/hy_round1_train_20200102/"

# 采样数据行数
max_lines = 2000
df = get_data(TRAIN_PATH,max_lines=max_lines)
  0%|                                                                               | 6/7000 [00:00<00:02, 2999.86it/s]
# 基本预处理
label_dict1 = {'拖网': 0, '围网': 1, '刺网': 2}
label_dict2 = {0: '拖网', 1: '围网', 2: '刺网'}
name_dict = {'渔船ID': 'id', '速度': 'v', '方向': 'dir', 'type': 'label'}

df.rename(columns = name_dict, inplace = True)
df['label'] = df['label'].map(label_dict1)
cols = ['x','y','v']
for col in cols:
    df[col] = df[col].astype('float')
df['dir'] = df['dir'].astype('int')
df['time'] = pd.to_datetime(df['time'], format='%m%d %H:%M:%S')
df['date'] = df['time'].dt.date
df['hour'] = df['time'].dt.hour
df['month'] = df['time'].dt.month
df['weekday'] = df['time'].dt.weekday
df.head()

id x y v dir time label date hour month weekday
0 0 6.152038e+06 5.124873e+06 2.59 102 1900-11-10 11:58:19 0 1900-11-10 11 11 5
1 0 6.151230e+06 5.125218e+06 2.70 113 1900-11-10 11:48:19 0 1900-11-10 11 11 5
2 0 6.150421e+06 5.125563e+06 2.70 116 1900-11-10 11:38:19 0 1900-11-10 11 11 5
3 0 6.149612e+06 5.125907e+06 3.29 95 1900-11-10 11:28:19 0 1900-11-10 11 11 5
4 0 6.148803e+06 5.126252e+06 3.18 108 1900-11-10 11:18:19 0 1900-11-10 11 11 5

数据说明:

- id:渔船ID,整数
- x:记录位置横坐标,浮点数
- y:记录位置纵坐标,浮点数
- v:记录速度,浮点数
- dir:记录航向,整数
- time:时间,文本
- label:需要预测的标签,整数

赛题特征工程

构造各点的(x、y)坐标与特定点(6165599,5202660)的距离

df['x_dis_diff'] = (df['x'] - 6165599).abs()
df['y_dis_diff'] = (df['y'] - 5202660).abs()
df['base_dis_diff'] = ((df['x_dis_diff']**2)+(df['y_dis_diff']**2))**0.5    
del df['x_dis_diff'],df['y_dis_diff'] 
df['base_dis_diff'].head()
0    78959.780945
1    78763.845006
2    78577.185266
3    78399.867568
4    78231.955018
Name: base_dis_diff, dtype: float64

对时间,小时进行白天、黑天进行划分,5-20为白天1,其余为黑天0

df['day_nig'] = 0
df.loc[(df['hour'] > 5) & (df['hour'] < 20),'day_nig'] = 1
df['day_nig'].head()
0    1
1    1
2    1
3    1
4    1
Name: day_nig, dtype: int64

根据月份划分季度

# 季度
df['quarter'] = 0
df.loc[(df['month'].isin([1, 2, 3])), 'quarter'] = 1
df.loc[(df['month'].isin([4, 5, 6, ])), 'quarter'] = 2
df.loc[(df['month'].isin([7, 8, 9])), 'quarter'] = 3
df.loc[(df['month'].isin([10, 11, 12])), 'quarter'] = 4

动态速度,速度变化,角度变化,xy相似性等特征, 16个方位的特征划分

temp = df.copy()
temp.rename(columns={'id':'ship', 'dir':'d'},inplace=True)

# 给速度一个等级
def v_cut(v):
    if v < 0.1:
        return 0
    elif v < 0.5:
        return 1
    elif v < 1:
        return 2
    elif v < 2.5:
        return 3
    elif v < 5:
        return 4
    elif v < 10:
        return 5
    elif v < 20:
        return 5
    else:
        return 6
# 统计每个ship的对应速度等级的个数
def get_v_fea(df):
    df['v_cut'] = df['v'].apply(lambda x: v_cut(x))
    tmp = df.groupby(['ship', 'v_cut'], as_index=False)['v_cut'].agg({'v_cut_count': 'count'})
    # 通过pivot构建透视表
    tmp = tmp.pivot(index='ship', columns='v_cut', values='v_cut_count')

    new_col_nm = ['v_cut_' + str(col) for col in tmp.columns.tolist()]
    tmp.columns = new_col_nm
    tmp = tmp.reset_index()  # 把index恢复成data

    return tmp

c1 = get_v_fea(temp)

方位进行16均分

def add_direction(df):
    df['d16'] = df['d'].apply(lambda x: int((x / 22.5) + 0.5) % 16 if not np.isnan(x) else np.nan)
    return df
def get_d_cut_count_fea(df):
    df = add_direction(df)
    tmp = df.groupby(['ship', 'd16'], as_index=False)['d16'].agg({'d16_count': 'count'})
    tmp = tmp.pivot(index='ship', columns='d16', values='d16_count')
    new_col_nm = ['d16_' + str(col) for col in tmp.columns.tolist()]
    tmp.columns = new_col_nm
    tmp = tmp.reset_index()
    return tmp

c2 = get_d_cut_count_fea(temp)

统计速度为0的个数和统计量

def get_v0_fea(df):
    # 统计速度为0的个数,以及速度不为0的统计量
    df_zero_count = df.query("v==0")[['ship', 'v']].groupby('ship', as_index=False)['v'].agg(
        {'num_zero_v': 'count'})
    df_not_zero_agg = df.query("v!=0")[['ship', 'v']].groupby('ship', as_index=False)['v'].agg(
        {'v_max_drop_0': 'max',
         'v_min_drop_0': 'min',
         'v_mean_drop_0': 'mean',
         'v_std_drop_0': 'std',
         'v_median_drop_0': 'median',
         'v_skew_drop_0': 'skew'})
    tmp = df_zero_count.merge(df_not_zero_agg, on='ship', how='left')

    return tmp

c3 = get_v0_fea(temp)

获取百分位数的feature

def get_percentiles_fea(df_raw):
    key = ['x', 'y', 'v', 'd']
    temp = df_raw[['ship']].drop_duplicates('ship')
    for i in range(len(key)):
        # 加入x,v,d,y的中位数和各种位数
        tmp_dscb = df_raw.groupby('ship')[key[i]].describe(
            percentiles=[0.05] + [ii / 1000 for ii in range(125, 1000, 125)] + [0.95])
        raw_col_nm = tmp_dscb.columns.tolist()
        new_col_nm = [key[i] + '_' + col for col in raw_col_nm]
        tmp_dscb.columns = new_col_nm
        tmp_dscb = tmp_dscb.reset_index()
        # 删掉多余的统计特征
        tmp_dscb = tmp_dscb.drop([f'{key[i]}_count', f'{key[i]}_mean', f'{key[i]}_std',
                                  f'{key[i]}_min', f'{key[i]}_max'], axis=1)

        temp = temp.merge(tmp_dscb, on='ship', how='left')
    return temp

c4 = get_percentiles_fea(temp)

计算各点前后之间的delta

def get_d_change_rate_fea(df):
    import math
    import time
    temp = df.copy()
    # 以ship、time为主键进行排序
    temp.sort_values(['ship', 'time'], ascending=True, inplace=True)
    # 通过shift求相邻差异值,注意学习.shift(-1,1)的含义
    temp['timenext'] = temp.groupby('ship')['time'].shift(-1)
    temp['ynext'] = temp.groupby('ship')['y'].shift(-1)
    temp['xnext'] = temp.groupby('ship')['x'].shift(-1)
    # 将shift得到的差异量进行填充,为什么会有空值NaN?
    # 因为shift的起始位置是没法比较的,故用空值来代替
    temp['ynext'] = temp['ynext'].fillna(method='ffill')
    temp['xnext'] = temp['xnext'].fillna(method='ffill')
    # 这里笔者的理解是ynext/xnext,而不需要减去y和x,因为ynext和xnext本身就是偏移量了
    temp['angle_next'] = (temp['ynext'] - temp['y']) / (temp['xnext'] - temp['x'])
    temp['angle_next'] = np.arctan(temp['angle_next']) / math.pi * 180
    temp['angle_next_next'] = temp['angle_next'].shift(-1)
    temp['timediff'] = np.abs(temp['timenext'] - temp['time'])
    temp['timediff'] = temp['timediff'].fillna(method='ffill')
    temp['hc_xy'] = abs(temp['angle_next_next'] - temp['angle_next'])
    # 对于hc_xy这列的值>180度的,进行修改成360度求差,仅考虑与水平线的角度
    temp.loc[temp['hc_xy'] > 180, 'hc_xy'] = (360 - temp.loc[temp['hc_xy'] > 180, 'hc_xy'])
    temp['hc_xy_s'] = temp.apply(lambda x: x['hc_xy'] / x['timediff'].total_seconds(), axis=1)

    temp['d_next'] = temp.groupby('ship')['d'].shift(-1)
    temp['hc_d'] = abs(temp['d_next'] - temp['d'])
    temp.loc[temp['hc_d'] > 180, 'hc_d'] = 360 - temp.loc[temp['hc_d'] > 180, 'hc_d']
    temp['hc_d_s'] = temp.apply(lambda x: x['hc_d'] / x['timediff'].total_seconds(), axis=1)

    temp1 = temp[['ship', 'hc_xy_s', 'hc_d_s']]
    xy_d_rate = temp1.groupby('ship')['hc_xy_s'].agg([('hc_xy_s_max', 'max')])
    xy_d_rate = xy_d_rate.reset_index()
    d_d_rate = temp1.groupby('ship')['hc_d_s'].agg([('hc_d_s_max', 'max')])
    d_d_rate = d_d_rate.reset_index()

    tmp = xy_d_rate.merge(d_d_rate, on='ship', how='left')
    return tmp

c5 = get_d_change_rate_fea(temp)
c5
ship hc_xy_s_max hc_d_s_max
0 0 0.183673 0.188020
1 1 0.293241 0.340426
2 10 0.223041 0.341176
3 100 0.282311 0.300000
4 1000 0.969970 0.717172
5 1001 0.078424 0.270903
f1 = temp.merge(c1, on='ship',how='left')
f1 = f1.merge(c2, on='ship',how='left')
f1 = f1.merge(c3, on='ship',how='left')
f1 = f1.merge(c4, on='ship',how='left')
f1 = f1.merge(c5, on='ship',how='left')
f1
ship x y v d time label date hour month ... d_12.5% d_25% d_37.5% d_50% d_62.5% d_75% d_87.5% d_95% hc_xy_s_max hc_d_s_max
0 0 6.152038e+06 5.124873e+06 2.59 102 1900-11-10 11:58:19 0 1900-11-10 11 11 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.183673 0.188020
1 0 6.151230e+06 5.125218e+06 2.70 113 1900-11-10 11:48:19 0 1900-11-10 11 11 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.183673 0.188020
2 0 6.150421e+06 5.125563e+06 2.70 116 1900-11-10 11:38:19 0 1900-11-10 11 11 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.183673 0.188020
3 0 6.149612e+06 5.125907e+06 3.29 95 1900-11-10 11:28:19 0 1900-11-10 11 11 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.183673 0.188020
4 0 6.148803e+06 5.126252e+06 3.18 108 1900-11-10 11:18:19 0 1900-11-10 11 11 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.183673 0.188020
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1996 1001 6.246323e+06 5.241154e+06 0.11 0 1900-11-17 09:43:41 0 1900-11-17 9 11 ... 0.0 0.0 10.0 144.0 204.0 271.0 279.0 292.4 0.078424 0.270903
1997 1001 6.246323e+06 5.241154e+06 0.22 10 1900-11-17 09:34:10 0 1900-11-17 9 11 ... 0.0 0.0 10.0 144.0 204.0 271.0 279.0 292.4 0.078424 0.270903
1998 1001 6.246323e+06 5.241154e+06 0.11 0 1900-11-17 09:23:39 0 1900-11-17 9 11 ... 0.0 0.0 10.0 144.0 204.0 271.0 279.0 292.4 0.078424 0.270903
1999 1001 6.246323e+06 5.241154e+06 0.11 287 1900-11-17 09:13:40 0 1900-11-17 9 11 ... 0.0 0.0 10.0 144.0 204.0 271.0 279.0 292.4 0.078424 0.270903
2000 1001 6.246323e+06 5.241154e+06 0.32 271 1900-11-17 09:04:02 0 1900-11-17 9 11 ... 0.0 0.0 10.0 144.0 204.0 271.0 279.0 292.4 0.078424 0.270903

2001 rows × 83 columns

分箱特征

v、x、y的分箱特征+统计分箱特征

pre_cols = df.columns

df['v_bin'] = pd.qcut(df['v'], 200, duplicates='drop') # 速度进行 200分位数分箱
df['v_bin'] = df['v_bin'].map(dict(zip(df['v_bin'].unique(), range(df['v_bin'].nunique())))) # 分箱后映射编码
for f in ['x', 'y']:
    df[f + '_bin1'] = pd.qcut(df[f], 1000, duplicates='drop') # x,y位置分箱1000
    df[f + '_bin1'] = df[f + '_bin1'].map(dict(zip(df[f + '_bin1'].unique(), range(df[f + '_bin1'].nunique()))))#编码
    df[f + '_bin2'] = df[f] // 10000 # 取整操作
    df[f + '_bin1_count'] = df[f + '_bin1'].map(df[f + '_bin1'].value_counts()) #x,y不同分箱的数量映射
    df[f + '_bin2_count'] = df[f + '_bin2'].map(df[f + '_bin2'].value_counts()) #数量映射
    df[f + '_bin1_id_nunique'] = df.groupby(f + '_bin1')['id'].transform('nunique')#基于分箱1 id数量映射
    df[f + '_bin2_id_nunique'] = df.groupby(f + '_bin2')['id'].transform('nunique')#基于分箱2 id数量映射
for i in [1, 2]:
    # 特征交叉x_bin1(2),y_bin1(2) 形成类别 统计每类数量映射到列  
    df['x_y_bin{}'.format(i)] = df['x_bin{}'.format(i)].astype('str') + '_' + df['y_bin{}'.format(i)].astype('str')
    df['x_y_bin{}'.format(i)] = df['x_y_bin{}'.format(i)].map(
        dict(zip(df['x_y_bin{}'.format(i)].unique(), range(df['x_y_bin{}'.format(i)].nunique())))
    )
    df['x_bin{}_y_bin{}_count'.format(i, i)] = df['x_y_bin{}'.format(i)].map(df['x_y_bin{}'.format(i)].value_counts())
for stat in ['max', 'min']:
    # 统计x_bin1 y_bin1的最大最小值
    df['x_y_{}'.format(stat)] = df['y'] - df.groupby('x_bin1')['y'].transform(stat)
    df['y_x_{}'.format(stat)] = df['x'] - df.groupby('y_bin1')['x'].transform(stat)

new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols].head()
v_bin x_bin1 x_bin2 x_bin1_count x_bin2_count x_bin1_id_nunique x_bin2_id_nunique y_bin1 y_bin2 y_bin1_count ... y_bin1_id_nunique y_bin2_id_nunique x_y_bin1 x_bin1_y_bin1_count x_y_bin2 x_bin2_y_bin2_count x_y_max y_x_max x_y_min y_x_min
0 0.0 0 615.0 116 8 2 2 0 512.0 2 ... 2 1 0 1 0 3 -115954.675157 0.000000 0.000000 49790.106760
1 0.0 1 615.0 2 8 2 2 1 512.0 2 ... 1 1 1 1 0 3 0.000000 0.000000 53070.048324 808.872353
2 0.0 2 615.0 2 8 2 2 1 512.0 2 ... 1 1 2 1 0 3 0.000000 -808.872353 54707.512092 0.000000
3 1.0 3 614.0 2 77 2 2 2 512.0 2 ... 1 1 3 1 1 8 0.000000 0.000000 52951.293120 808.787673
4 2.0 4 614.0 2 77 2 2 2 512.0 2 ... 1 1 4 1 1 8 0.000000 -808.787673 55461.653028 0.000000

5 rows × 21 columns

将x、y进行分箱并构造(网格)区域

def traj_to_bin(traj=None, x_min=12031967.16239096, x_max=14226964.881853,
                y_min=1623579.449434373, y_max=4689471.1780792,
                row_bins=4380, col_bins=3136):

    # Establish bins on x direction and y direction
    x_bins = np.linspace(x_min, x_max, endpoint=True, num=col_bins + 1)
    y_bins = np.linspace(y_min, y_max, endpoint=True, num=row_bins + 1)

    # Determine each x coordinate belong to which bin
    traj.sort_values(by='x', inplace=True)
    x_res = np.zeros((len(traj), ))
    j = 0
    for i in range(1, col_bins + 1):
        low, high = x_bins[i-1], x_bins[i]
        while( j < len(traj)):
            # low - 0.001 for numeric stable.
            if (traj["x"].iloc[j] <= high) & (traj["x"].iloc[j] > low - 0.001):
                x_res[j] = i
                j += 1
            else:
                break
    traj["x_grid"] = x_res
    traj["x_grid"] = traj["x_grid"].astype(int)
    traj["x_grid"] = traj["x_grid"].apply(str)

    # Determine each y coordinate belong to which bin
    traj.sort_values(by='y', inplace=True)
    y_res = np.zeros((len(traj), ))
    j = 0
    for i in range(1, row_bins + 1):
        low, high = y_bins[i-1], y_bins[i]
        while( j < len(traj)):
            # low - 0.001 for numeric stable.
            if (traj["y"].iloc[j] <= high) & (traj["y"].iloc[j] > low - 0.001):
                y_res[j] = i
                j += 1
            else:
                break
    traj["y_grid"] = y_res
    traj["y_grid"] = traj["y_grid"].astype(int)
    traj["y_grid"] = traj["y_grid"].apply(str)

    # Determine which bin each coordinate belongs to.
    traj["no_bin"] = [i + "_" + j for i, j in zip(
        traj["x_grid"].values.tolist(), traj["y_grid"].values.tolist())]
    traj.sort_values(by='time', inplace=True)
    return traj

bin_size = 800
col_bins = int((14226964.881853 - 12031967.16239096) / bin_size)
row_bins = int((4689471.1780792 - 1623579.449434373) / bin_size)
pre_cols = df.columns
# 特征x_grid,y_grid,no_bin
df = traj_to_bin(df)

new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols]
x_grid y_grid no_bin
1606 0 0 0_0
1605 0 0 0_0
1604 0 0 0_0
1603 0 0 0_0
1602 0 0 0_0
... ... ... ...
1988 0 0 0_0
1987 0 0 0_0
1986 0 0 0_0
1985 0 0 0_0
1984 0 0 0_0

2001 rows × 3 columns

DataFrame特征

count计数值

def find_save_visit_count_table(traj_data_df=None, bin_to_coord_df=None):
    """Find and save the visit frequency of each bin."""
    visit_count_df = traj_data_df.groupby(["no_bin"]).count().reset_index()
    visit_count_df = visit_count_df[["no_bin", "x"]]
    visit_count_df.rename({"x":"visit_count"}, axis=1, inplace=True)
    return visit_count_df

def find_save_unique_visit_count_table(traj_data_df=None, bin_to_coord_df=None):
    """Find and save the unique boat visit count of each bin."""
    unique_boat_count_df = traj_data_df.groupby(["no_bin"])["id"].nunique().reset_index()
    unique_boat_count_df.rename({"id":"visit_boat_count"}, axis=1, inplace=True)

    unique_boat_count_df_save = pd.merge(bin_to_coord_df, unique_boat_count_df,
                                         on="no_bin", how="left")
    return unique_boat_count_df

traj_df = df[["id","x", "y",'time',"no_bin"]]
bin_to_coord_df = traj_df.groupby(["no_bin"]).median().reset_index()
bin_to_coord_df
no_bin x y
0 0_0 6.124951e+06 5.130672e+06
pre_cols = df.columns

# DataFrame tmp for finding POIs
visit_count_df = find_save_visit_count_table(
    traj_df, bin_to_coord_df)
unique_boat_count_df = find_save_unique_visit_count_table(
    traj_df, bin_to_coord_df)

# # 特征'visit_count','visit_boat_count'
df = df.merge(visit_count_df,on='no_bin',how='left')
df = df.merge(unique_boat_count_df,on='no_bin',how='left')

new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols].head()
visit_count visit_boat_count
0 2001 6
1 2001 6
2 2001 6
3 2001 6
4 2001 6

shift偏移量特征

  • 对x,y坐标进行时间平移 1 -1 2
    • 三角形求解上时刻1距离 下时刻-1距离 2距离
    • 2时刻距离等频分箱50
    • 上一时刻映射编码
pre_cols = df.columns

g = df.groupby('id')
for f in ['x', 'y']:
    #对x,y坐标进行时间平移 1 -1 2
    df[f + '_prev_diff'] = df[f] - g[f].shift(1)
    df[f + '_next_diff'] = df[f] - g[f].shift(-1)
    df[f + '_prev_next_diff'] = g[f].shift(1) - g[f].shift(-1)
    ## 三角形求解上时刻1距离  下时刻-1距离 2距离 
df['dist_move_prev'] = np.sqrt(np.square(df['x_prev_diff']) + np.square(df['y_prev_diff']))
df['dist_move_next'] = np.sqrt(np.square(df['x_next_diff']) + np.square(df['y_next_diff']))
df['dist_move_prev_next'] = np.sqrt(np.square(df['x_prev_next_diff']) + np.square(df['y_prev_next_diff']))
df['dist_move_prev_bin'] = pd.qcut(df['dist_move_prev'], 50, duplicates='drop')# 2时刻距离等频分箱50
df['dist_move_prev_bin'] = df['dist_move_prev_bin'].map(
    dict(zip(df['dist_move_prev_bin'].unique(), range(df['dist_move_prev_bin'].nunique())))
) #上一时刻映射编码

new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols].head()
x_prev_diff x_next_diff x_prev_next_diff y_prev_diff y_next_diff y_prev_next_diff dist_move_prev dist_move_next dist_move_prev_next dist_move_prev_bin
0 NaN -911.903731 NaN NaN 455.919062 NaN NaN 1019.524696 NaN NaN
1 911.903731 -911.965576 -1823.869307 -455.919062 455.831205 911.750267 1019.524696 1019.540730 2039.065423 1.0
2 911.965576 -918.791508 -1830.757085 -455.831205 20.360332 476.191538 1019.540730 919.017072 1891.673831 1.0
3 918.791508 -597.354368 -1516.145877 -20.360332 993.131365 1013.491697 919.017072 1158.940097 1823.695078 2.0
4 597.354368 -910.468269 -1507.822637 -993.131365 564.435006 1557.566370 1158.940097 1071.232628 2167.842730 3.0

统计特征

基本统计特征用法

补充:

分组统计特征agg的使用非常重要,在此进行代码示例,详细请参考:
http://joyfulpandas.datawhale.club/Content/ch4.html

  • 请注意{}和[]的使用

分组标准格式:

df.groupby(分组依据)[数据来源].使用操作

先分组,得到

gb = df.groupby(['School', 'Grade'])

  • 【a】使用多个函数

gb.agg(['具体方法(如内置函数)'])

如gb.agg(['sum'])

  • 【b】对特定的列使用特定的聚合函数

gb.agg({'指定列':'具体方法'})

如gb.agg({'Height':['mean','max'], 'Weight':'count'})

  • 【c】使用自定义函数

gb.agg(函数名或匿名函数)

如gb.agg(lambda x: x.mean()-x.min())

  • 【d】聚合结果重命名

gb.agg([
('重命名的名字',具体方法(如内置函数、自定义函数))
])

如gb.agg([('range', lambda x: x.max()-x.min()), ('my_sum', 'sum')])

  • 【e】 指定列使用特定的聚合函数 + 重命名
    gb.agg({‘col1’: [('range', lambda x: x.max()-x.min()), ('my_sum', 'sum')]})

另外需要注意,使用对一个或者多个列使用单个聚合的时候,重命名需要加方括号,否则就不知道是新的名字还是手误输错的内置函数字符串:

  • 下述代码主要使用了

一种是df.groupby('id').agg{'列名':'方法'},另一种是df.groupby('id')['列名'].agg(字典)

pre_cols = df.columns

def start(x):
    try:
        return x[0]
    except:
        return None

def end(x):
    try:
        return x[-1]
    except:
        return None


def mode(x):
    try:
        return pd.Series(x).value_counts().index[0]
    except:
        return None

for f in ['dist_move_prev_bin', 'v_bin']:
    # 上一时刻类别 速度类别映射处理
    df[f + '_sen'] = df['id'].map(df.groupby('id')[f].agg(lambda x: ','.join(x.astype(str))))
    
    # 一系列基本统计量特征 每列执行相应的操作
g = df.groupby('id').agg({
    'id': ['count'], 'x_bin1': [mode], 'y_bin1': [mode], 'x_bin2': [mode], 'y_bin2': [mode], 'x_y_bin1': [mode],
    'x': ['mean', 'max', 'min', 'std', np.ptp, start, end],
    'y': ['mean', 'max', 'min', 'std', np.ptp, start, end],
    'v': ['mean', 'max', 'min', 'std', np.ptp], 'dir': ['mean'],
    'x_bin1_count': ['mean'], 'y_bin1_count': ['mean', 'max', 'min'],
    'x_bin2_count': ['mean', 'max', 'min'], 'y_bin2_count': ['mean', 'max', 'min'],
    'x_bin1_y_bin1_count': ['mean', 'max', 'min'],
    'dist_move_prev': ['mean', 'max', 'std', 'min', 'sum'],
    'x_y_min': ['mean', 'min'], 'y_x_min': ['mean', 'min'],
    'x_y_max': ['mean', 'min'], 'y_x_max': ['mean', 'min'],
}).reset_index()
g.columns = ['_'.join(col).strip() for col in g.columns] #提取列名
g.rename(columns={'id_': 'id'}, inplace=True) #重命名id_
cols = [f for f in g.keys() if f != 'id'] #特征列名提取
cols
['id_count',
 'x_bin1_mode',
 'y_bin1_mode',
 'x_bin2_mode',
 'y_bin2_mode',
 'x_y_bin1_mode',
 'x_mean',
 'x_max',
 'x_min',
 'x_std',
 'x_ptp',
 'x_start',
 'x_end',
 'y_mean',
 'y_max',
 'y_min',
 'y_std',
 'y_ptp',
 'y_start',
 'y_end',
 'v_mean',
 'v_max',
 'v_min',
 'v_std',
 'v_ptp',
 'dir_mean',
 'x_bin1_count_mean',
 'y_bin1_count_mean',
 'y_bin1_count_max',
 'y_bin1_count_min',
 'x_bin2_count_mean',
 'x_bin2_count_max',
 'x_bin2_count_min',
 'y_bin2_count_mean',
 'y_bin2_count_max',
 'y_bin2_count_min',
 'x_bin1_y_bin1_count_mean',
 'x_bin1_y_bin1_count_max',
 'x_bin1_y_bin1_count_min',
 'dist_move_prev_mean',
 'dist_move_prev_max',
 'dist_move_prev_std',
 'dist_move_prev_min',
 'dist_move_prev_sum',
 'x_y_min_mean',
 'x_y_min_min',
 'y_x_min_mean',
 'y_x_min_min',
 'x_y_max_mean',
 'x_y_max_min',
 'y_x_max_mean',
 'y_x_max_min']
df = df.merge(g,on='id',how='left')

new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols].head()
dist_move_prev_bin_sen v_bin_sen id_count x_bin1_mode y_bin1_mode x_bin2_mode y_bin2_mode x_y_bin1_mode x_mean x_max ... dist_move_prev_min dist_move_prev_sum x_y_min_mean x_y_min_min y_x_min_mean y_x_min_min x_y_max_mean x_y_max_min y_x_max_mean y_x_max_min
0 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 88 611.0 508.0 252 6.123711e+06 6.151439e+06 ... 0.0 381420.840554 2458.92664 0.0 4603.814472 0.0 -5075.500661 -57432.286364 -3493.862248 -32066.348374
1 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 88 611.0 508.0 252 6.123711e+06 6.151439e+06 ... 0.0 381420.840554 2458.92664 0.0 4603.814472 0.0 -5075.500661 -57432.286364 -3493.862248 -32066.348374
2 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 88 611.0 508.0 252 6.123711e+06 6.151439e+06 ... 0.0 381420.840554 2458.92664 0.0 4603.814472 0.0 -5075.500661 -57432.286364 -3493.862248 -32066.348374
3 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 88 611.0 508.0 252 6.123711e+06 6.151439e+06 ... 0.0 381420.840554 2458.92664 0.0 4603.814472 0.0 -5075.500661 -57432.286364 -3493.862248 -32066.348374
4 nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5.... 19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0... 411 145 88 611.0 508.0 252 6.123711e+06 6.151439e+06 ... 0.0 381420.840554 2458.92664 0.0 4603.814472 0.0 -5075.500661 -57432.286364 -3493.862248 -32066.348374

5 rows × 54 columns

划分数据后进行统计

TODO group_feature函数异常,无法实现!!!

def group_feature(df, key, target, aggs, flag):
    print('group_feature:  ' , (key, target, aggs, flag))
    """通过字典的形式来构建方法和重命名"""
    agg_dict = {}
    agg_list = []
    for ag in aggs:
        # agg_dict['{}_{}_{}'.format(target,ag,flag)] = ag
        # agg_dict['{}_{}_{}'.format(target,ag,flag)] = [('{}_{}_{}'.format(target,ag,flag)+'_'+ag, ag)]
        agg_dict['{}_{}_{}'.format(target, ag, flag)] = [(ag, ag)]
        # agg_list.append(('{}_{}_{}'.format(target,ag,flag)+'__'+ag, ag))
    print(agg_dict)
    t = df.groupby(key)[target].agg(aggs).reset_index()
    return t # TODO 等待二次修复
    # return df

def extract_feature(df, train, flag):
    '''
    统计feature
    注意理解group_feature的使用和效果
    '''
    if (flag == 'on_night') or (flag == 'on_day'): 
        t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left')
        # return train
    
    
    if flag == "0":
        t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left')  
    elif flag == "1":
        t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left')
        t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)
        train = pd.merge(train, t, on='ship', how='left') 
        # .nunique().to_dict() 将nunique得到的对应唯一值统计量做成字典
        # to_dict() 与 map的使用可以很方便地构建一些统计量映射特征,如CTR(分类)问题中的转化率
        # 提问: 如果根据训练集给定的label(0,1)来构建训练集+测试集的转化率特征,注:测试集与训练集存在部分id相同
        hour_nunique = df.groupby('ship')['speed'].nunique().to_dict()
        train['speed_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique)   
        hour_nunique = df.groupby('ship')['direction'].nunique().to_dict()
        train['direction_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique)  

    t = group_feature(df, 'ship','x',['max','min','mean','median','std','skew'],flag)
    train = pd.merge(train, t, on='ship', how='left')
    t = group_feature(df, 'ship','y',['max','min','mean','median','std','skew'],flag)
    train = pd.merge(train, t, on='ship', how='left')
    t = group_feature(df, 'ship','base_dis_diff',['max','min','mean','std','skew'],flag)
    train = pd.merge(train, t, on='ship', how='left')

       
    train['x_max_x_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]
    train['y_max_y_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]
    train['y_max_x_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]
    train['x_max_y_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]
    train['slope_{}'.format(flag)] = train['y_max_y_min_{}'.format(flag)] / np.where(train['x_max_x_min_{}'.format(flag)]==0, 0.001, train['x_max_x_min_{}'.format(flag)])
    train['area_{}'.format(flag)] = train['x_max_x_min_{}'.format(flag)] * train['y_max_y_min_{}'.format(flag)]

    mode_hour = df.groupby('ship')['hour'].agg(lambda x:x.value_counts().index[0]).to_dict()
    train['mode_hour_{}'.format(flag)] = train['ship'].map(mode_hour)
    train['slope_median_{}'.format(flag)] = train['y_median_{}'.format(flag)] / np.where(train['x_median_{}'.format(flag)]==0, 0.001, train['x_median_{}'.format(flag)])

    return train



data  = df.copy()
data.rename(columns={
    'id':'ship',
    'v':'speed',
    'dir':'direction'
},inplace=True)
# 去重
data_label = data.drop_duplicates(['ship'],keep = 'first')

data_1 = data[data['speed']==0]
data_2 = data[data['speed']!=0]
data_label = extract_feature(data_1, data_label,"0")
data_label = extract_feature(data_2, data_label,"1")

data_1 = data[data['day_nig'] == 0]
data_2 = data[data['day_nig'] == 1]
data_label = extract_feature(data_1, data_label,"on_night")
data_label = extract_feature(data_2, data_label,"on_day")
data_label.rename(columns={'ship':'id','speed':'v','direction':'dir'},inplace=True)
data_label
group_feature:   ('ship', 'direction', ['max', 'median', 'mean', 'std', 'skew'], '0')
{'direction_max_0': [('max', 'max')], 'direction_median_0': [('median', 'median')], 'direction_mean_0': [('mean', 'mean')], 'direction_std_0': [('std', 'std')], 'direction_skew_0': [('skew', 'skew')]}
group_feature:   ('ship', 'x', ['max', 'min', 'mean', 'median', 'std', 'skew'], '0')
{'x_max_0': [('max', 'max')], 'x_min_0': [('min', 'min')], 'x_mean_0': [('mean', 'mean')], 'x_median_0': [('median', 'median')], 'x_std_0': [('std', 'std')], 'x_skew_0': [('skew', 'skew')]}
group_feature:   ('ship', 'y', ['max', 'min', 'mean', 'median', 'std', 'skew'], '0')
{'y_max_0': [('max', 'max')], 'y_min_0': [('min', 'min')], 'y_mean_0': [('mean', 'mean')], 'y_median_0': [('median', 'median')], 'y_std_0': [('std', 'std')], 'y_skew_0': [('skew', 'skew')]}
group_feature:   ('ship', 'base_dis_diff', ['max', 'min', 'mean', 'std', 'skew'], '0')
{'base_dis_diff_max_0': [('max', 'max')], 'base_dis_diff_min_0': [('min', 'min')], 'base_dis_diff_mean_0': [('mean', 'mean')], 'base_dis_diff_std_0': [('std', 'std')], 'base_dis_diff_skew_0': [('skew', 'skew')]}



---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

D:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:


pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()


pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()


pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._get_loc_duplicates()


pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()


pandas\_libs\index.pyx in pandas._libs.index.IndexEngine._unpack_bool_indexer()


KeyError: 'y_median_0'


The above exception was the direct cause of the following exception:


KeyError                                  Traceback (most recent call last)

<ipython-input-47-2b795bbafc35> in <module>
     75 data_1 = data[data['speed']==0]
     76 data_2 = data[data['speed']!=0]
---> 77 data_label = extract_feature(data_1, data_label,"0")
     78 data_label = extract_feature(data_2, data_label,"1")
     79 


<ipython-input-47-2b795bbafc35> in extract_feature(df, train, flag)
     58     mode_hour = df.groupby('ship')['hour'].agg(lambda x:x.value_counts().index[0]).to_dict()
     59     train['mode_hour_{}'.format(flag)] = train['ship'].map(mode_hour)
---> 60     train['slope_median_{}'.format(flag)] = train['y_median_{}'.format(flag)] / np.where(train['x_median_{}'.format(flag)]==0, 0.001, train['x_median_{}'.format(flag)])
     61 
     62     return train


D:\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3022             if self.columns.nlevels > 1:
   3023                 return self._getitem_multilevel(key)
-> 3024             indexer = self.columns.get_loc(key)
   3025             if is_integer(indexer):
   3026                 indexer = [indexer]


D:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
   3083 
   3084         if tolerance is not None:


KeyError: 'y_median_0'
new_cols = [i for i in data_label.columns if i not in df.columns]
df = df.merge(data_label[new_cols+['id']],on='id',how='left')

df[new_cols].head()

统计特征的具体使用

temp = df.copy()
temp.rename(columns={'id':'ship','dir':'d'},inplace=True)

def coefficient_of_variation(x):
    x = x.values
    if np.mean(x) == 0:
        return 0
    return np.std(x) / np.mean(x)

def max_2(x):
    x = list(x.values)
    x.sort(reverse=True)
    return x[1]

def max_3(x):
    x = list(x.values)
    x.sort(reverse=True)
    return x[2]

def diff_abs_mean(x):  # 统计特征 deta绝对值均值
    return np.mean(np.abs(np.diff(x)))

f1 = pd.DataFrame()
for col in ['x', 'y', 'v', 'd']:
    features = temp.groupby('ship', as_index=False)[col].agg({
        '{}_min'.format(col): 'min',
        '{}_max'.format(col): 'max',
        '{}_mean'.format(col): 'mean',
        '{}_median'.format(col): 'median',
        '{}_std'.format(col): 'std',
        '{}_skew'.format(col): 'skew',
        '{}_sum'.format(col): 'sum',
        '{}_diff_abs_mean'.format(col): diff_abs_mean,
        '{}_mode'.format(col): lambda x: x.value_counts().index[0],
        '{}_coefficient_of_variation'.format(col): coefficient_of_variation,
        '{}_max2'.format(col): max_2,
        '{}_max3'.format(col): max_3
    })
    if f1.shape[0] == 0:
        f1 = features
    else:
        f1 = f1.merge(features, on='ship', how='left')

f1['x_max_x_min'] = f1['x_max'] - f1['x_min']
f1['y_max_y_min'] = f1['y_max'] - f1['y_min']
f1['y_max_x_min'] = f1['y_max'] - f1['x_min']
f1['x_max_y_min'] = f1['x_max'] - f1['y_min']
# 计算斜率
f1['slope'] = f1['y_max_y_min'] / np.where(f1['x_max_x_min'] == 0, 0.001, f1['x_max_x_min'])
# 计算最远点之间的面积 距离的最大最小值
f1['area'] = f1['x_max_x_min'] * f1['y_max_y_min']
f1['dis_max_min'] = (f1['x_max_x_min'] ** 2 + f1['y_max_y_min'] ** 2) ** 0.5

# 计算x和y平均点之前的距离的平均值
f1['dis_mean'] = (f1['x_mean'] ** 2 + f1['y_mean'] ** 2) ** 0.5
f1['area_d_dis_max_min'] = f1['area'] / f1['dis_max_min']

# 加速度
temp.sort_values(['ship', 'time'], ascending=True, inplace=True)
temp['ynext'] = temp.groupby('ship')['y'].shift(-1)
temp['xnext'] = temp.groupby('ship')['x'].shift(-1)
temp['ynext'] = temp['ynext'].fillna(method='ffill')
temp['xnext'] = temp['xnext'].fillna(method='ffill')
temp['timenext'] = temp.groupby('ship')['time'].shift(-1)
temp['timediff'] = np.abs(temp['timenext'] - temp['time'])
temp['a_y'] = temp.apply(lambda x: (x['ynext'] - x['y']) / x['timediff'].total_seconds(), axis=1)
temp['a_x'] = temp.apply(lambda x: (x['xnext'] - x['x']) / x['timediff'].total_seconds(), axis=1)
for col in ['a_y', 'a_x']:
    f2 = temp.groupby('ship', as_index=False)[col].agg({
        '{}_max'.format(col): 'max',
        '{}_mean'.format(col): 'mean',
        '{}_min'.format(col): 'min',
        '{}_median'.format(col): 'median',
        '{}_std'.format(col): 'std'})
    f1 = f1.merge(f2, on='ship', how='left')

# 曲率
temp['y_pre'] = temp.groupby('ship')['y'].shift(1)
temp['x_pre'] = temp.groupby('ship')['x'].shift(1)
temp['y_pre'] = temp['y_pre'].fillna(method='bfill')
temp['x_pre'] = temp['x_pre'].fillna(method='bfill')
temp['d_pre'] = ((temp['x'] - temp['x_pre']) ** 2 + (temp['y'] - temp['y_pre']) ** 2) ** 0.5
temp['d_next'] = ((temp['xnext'] - temp['x']) ** 2 + (temp['ynext'] - temp['y']) ** 2) ** 0.5
temp['d_pre_next'] = ((temp['xnext'] - temp['x_pre']) ** 2 + (temp['ynext'] - temp['y_pre']) ** 2) ** 0.5
temp['curvature'] = (temp['d_pre'] + temp['d_next']) / temp['d_pre_next']

f2 = temp.groupby('ship', as_index=False)['curvature'].agg({
    'curvature_max': 'max',
    'curvature_mean': 'mean',
    'curvature_min': 'min',
    'curvature_median': 'median',
    'curvature_std': 'std'})
f1 = f1.merge(f2, on='ship', how='left')
f1

embedding特征

word2vec的简单说明

  • Question!
    为什么在数据挖掘类比赛中,我们需要word2vec或NMF(方法有很多,但这两种常用)来构造 “词嵌入特征”?

答: 上分!

确实,上分是现象,但背后却是对整体数据的考虑,上述的统计特征、业务特征等也都是考虑了数据的整体性,但是却难免忽略了数据间的关系。举个例子,对于所有人的年龄特征,如果仅做一些统计特征如平均值、最值,业务特征如标准体重=体重/年龄等,这些都是人为理解的。那将这些特征想象成一个个词,并将所有数据(或同一组数据)的这些词组合当成一篇文章来考虑,是不是就可以得到一些额外的规律,即特征。

个人理解:将特征和特征群之间的关系作为新的特征 (Word2vec就是在重建词的语言上下文)

word2vec的使用场景

目前为止,Word Embedding可以用到特征生成,文件聚类,文本分类和自然语言处理等任务,例如:

计算相似的词:Word Embedding可以被用来寻找与某个词相近的词。

构建一群相关的词:对不同的词进行聚类,将相关的词聚集到一起;

用于文本分类的特征:在文本分类问题中,因为词没法直接用于机器学习模型的训练,所以我们将词先投影到向量空间,这样之后便可以基于这些向量进行机器学习模型的训练;

用于文件的聚类

上面列举的是文本相关任务,当然目前词嵌入模型已经被扩展到方方面面。典型的,例如:

在微博上面,每个人都用一个词来表示,对每个人构建Embedding,然后计算人之间的相关性,得到关系最为相近的人;

在推荐问题里面,依据每个用户的购买的商品记录,对每个商品进行Embedding,就可以计算商品之间的相关性,并进行推荐;

在此次天池的航海问题中,对相同经纬度上不同的船进行Embedding,就可以得到每个船只的向量,就可以得到经常在某些区域工作的船只;

可以说,词嵌入为寻找物体之间相关性带来了巨大的帮助。现在基本每个数据竞赛都会见到Embedding技术。让我们来看看用的最多的Word2Vec模型。

Word2Vec在做什么?

Word2vec在向量空间中对词进行表示, 或者说词以向量的形式表示,在词向量空间中:相似含义的单词一起出现,而不同的单词则位于很远的地方。这也被称为语义关系。

神经网络不理解文本,而只理解数字。词嵌入提供了一种将文本转换为数字向量的方法。

Word2vec就是在重建词的语言上下文。那什么是语言上下文?在一般的生活情景中,当我们通过说话或写作来交流,其他人会试图找出句子的目的。例如,“印度的温度是多少”,这里的上下文是用户想知道“印度的温度”即上下文。

简而言之,句子的主要目标是语境。围绕口头或书面语言的单词或句子(披露)有助于确定上下文的意义。Word2vec通过上下文学习单词的矢量表示。

  • 参考文献

[NLP] 秒懂词向量Word2vec的本质:https://zhuanlan.zhihu.com/p/26306795

Word2vec构造词向量

Word2vec构造词向量

def traj_cbow_embedding(traj_data_corpus=None, embedding_size=70,
                        iters=40, min_count=3, window_size=25,
                        seed=9012, num_runs=5, word_feat="no_bin"):
    """CBOW embedding for trajectory data."""
    boat_id = traj_data_corpus['id'].unique()
    sentences, embedding_df_list, embedding_model_list = [], [], []
    for i in boat_id:
        traj = traj_data_corpus[traj_data_corpus['id']==i]
        sentences.append(traj[word_feat].values.tolist())

    print("\n@Start CBOW word embedding at {}".format(datetime.now()))
    print("-------------------------------------------")
    for i in tqdm(range(num_runs)):
        model = Word2Vec(sentences, size=embedding_size,
                                  min_count=min_count,
                                  workers=mp.cpu_count(),
                                  window=window_size,
                                  seed=seed, iter=iters, sg=0)

        # Sentance vector
        embedding_vec = []
        for ind, seq in enumerate(sentences):
            seq_vec, word_count = 0, 0
            for word in seq:
                if word not in model:
                    continue
                else:
                    seq_vec += model[word]
                    word_count += 1
            if word_count == 0:
                embedding_vec.append(embedding_size * [0])
            else:
                embedding_vec.append(seq_vec / word_count)
        embedding_vec = np.array(embedding_vec)
        embedding_cbow_df = pd.DataFrame(embedding_vec, 
            columns=["embedding_cbow_{}_{}".format(word_feat, i) for i in range(embedding_size)])
        embedding_cbow_df["id"] = boat_id
        embedding_df_list.append(embedding_cbow_df)
        embedding_model_list.append(model)
    print("-------------------------------------------")
    print("@End CBOW word embedding at {}".format(datetime.now()))
    return embedding_df_list, embedding_model_list
embedding_size=70
iters=70
min_count=3
window_size=25
num_runs=1

df_list, model_list = traj_cbow_embedding(df,
                                          embedding_size=embedding_size,
                                          iters=iters, min_count=min_count,
                                          window_size=window_size,
                                          seed=9012,
                                          num_runs=num_runs,
                                          word_feat="no_bin")

train_embedding_df_list = [d.reset_index(drop=True) for d in df_list]
fea = train_embedding_df_list[0]
fea = pd.DataFrame(fea)
pre_cols = df.columns
df = df.merge(fea,on='id',how='left')


new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols].head()
boat_id = df['id'].unique()
total_embedding = pd.DataFrame(boat_id, columns=["id"])
traj_data = df[['v','dir','id']].rename(columns = {'v':'speed','dir':'direction'})

# Step 1: Construct the words
traj_data_corpus = []
traj_data["speed_str"]     = traj_data["speed"].apply(lambda x: str(int(x*100)))
traj_data["direction_str"] = traj_data["direction"].apply(str)
traj_data["speed_dir_str"] = traj_data["speed_str"] + "_" + traj_data["direction_str"]
traj_data_corpus = traj_data[["id", "speed_str",
                                  "direction_str", "speed_dir_str"]]
print("\n@Round 2 speed embedding:")
df_list, model_list = traj_cbow_embedding(traj_data_corpus,
                                          embedding_size=10,
                                          iters=40, min_count=3,
                                          window_size=25, seed=9102,
                                          num_runs=1, word_feat="speed_str")
speed_embedding = df_list[0].reset_index(drop=True)
total_embedding = pd.merge(total_embedding, speed_embedding,
                           on="id", how="left")


print("\n@Round 2 direction embedding:")
df_list, model_list = traj_cbow_embedding(traj_data_corpus,
                                          embedding_size=12,
                                          iters=70, min_count=3,
                                          window_size=25, seed=9102,
                                          num_runs=1, word_feat="speed_dir_str")
speed_dir_embedding = df_list[0].reset_index(drop=True)
total_embedding = pd.merge(total_embedding, speed_dir_embedding,
                           on="id", how="left")
pre_cols = df.columns
df = df.merge(total_embedding,on='id',how='left')

new_cols = [i for i in df.columns if i not in pre_cols]
df[new_cols].head()

NMF提取文本的主题分布

class nmf_list(object):
    def __init__(self,data,by_name,to_list,nmf_n,top_n):
        self.data = data
        self.by_name = by_name
        self.to_list = to_list
        self.nmf_n = nmf_n
        self.top_n = top_n

    def run(self,tf_n):
        df_all = self.data.groupby(self.by_name)[self.to_list].apply(lambda x :'|'.join(x)).reset_index()
        self.data =df_all.copy()

        print('bulid word_fre')
        # 词频的构建
        def word_fre(x):
            word_dict = []
            x = x.split('|')
            docs = []
            for doc in x:
                doc = doc.split()
                docs.append(doc)
                word_dict.extend(doc)
            word_dict = Counter(word_dict)
            new_word_dict = {}
            for key,value in word_dict.items():
                new_word_dict[key] = [value,0]
            del word_dict  
            del x
            for doc in docs:
                doc = Counter(doc)
                for word in doc.keys():
                    new_word_dict[word][1] += 1
            return new_word_dict 
        self.data['word_fre'] = self.data[self.to_list].apply(word_fre)

        print('bulid top_' + str(self.top_n))
        # 设定100个高频词
        def top_100(word_dict):
            return sorted(word_dict.items(),key = lambda x:(x[1][1],x[1][0]),reverse = True)[:self.top_n]
        self.data['top_'+str(self.top_n)] = self.data['word_fre'].apply(top_100)
        def top_100_word(word_list):
            words = []
            for i in word_list:
                i = list(i)
                words.append(i[0])
            return words 
        self.data['top_'+str(self.top_n)+'_word'] = self.data['top_' + str(self.top_n)].apply(top_100_word)
        # print('top_'+str(self.top_n)+'_word的shape')
        print(self.data.shape)

        word_list = []
        for i in self.data['top_'+str(self.top_n)+'_word'].values:
            word_list.extend(i)
        word_list = Counter(word_list)
        word_list = sorted(word_list.items(),key = lambda x:x[1],reverse = True)
        user_fre = []
        for i in word_list:
            i = list(i)
            user_fre.append(i[1]/self.data[self.by_name].nunique())
        stop_words = []
        for i,j in zip(word_list,user_fre):
            if j>0.5:
                i = list(i)
                stop_words.append(i[0])

        print('start title_feature')
        # 讲融合后的taglist当作一句话进行文本处理
        self.data['title_feature'] = self.data[self.to_list].apply(lambda x: x.split('|'))
        self.data['title_feature'] = self.data['title_feature'].apply(lambda line: [w for w in line if w not in stop_words])
        self.data['title_feature'] = self.data['title_feature'].apply(lambda x: ' '.join(x))

        print('start NMF')
        # 使用tfidf对元素进行处理
        tfidf_vectorizer = TfidfVectorizer(ngram_range=(tf_n,tf_n))
        tfidf = tfidf_vectorizer.fit_transform(self.data['title_feature'].values)
        #使用nmf算法,提取文本的主题分布
        text_nmf = NMF(n_components=self.nmf_n).fit_transform(tfidf)


        # 整理并输出文件
        name = [str(tf_n) + self.to_list + '_' +str(x) for x in range(1,self.nmf_n+1)]
        tag_list = pd.DataFrame(text_nmf)
        print(tag_list.shape)
        tag_list.columns = name
        tag_list[self.by_name] = self.data[self.by_name]
        column_name = [self.by_name] + name
        tag_list = tag_list[column_name]
        return tag_list
data = df.copy()
data.rename(columns={'v':'speed','id':'ship'},inplace=True)
for j in range(1,4):
    print('********* {} *******'.format(j))
    for i in ['speed','x','y']:
        data[i + '_str'] = data[i].astype(str)
        nmf = nmf_list(data,'ship',i + '_str',8,2)
        nmf_a = nmf.run(j)
        nmf_a.rename(columns={'ship':'id'},inplace=True)
        data_label = data_label.merge(nmf_a,on = 'id',how = 'left')
new_cols = [i for i in data_label.columns if i not in df.columns]
df = df.merge(data_label[new_cols+['id']],on='id',how='left')

df[new_cols].head()

总结与思考

  • 赛题特征工程:该如何构建有效果的赛题特征工程

      参考:通过数据EDA、查阅对应赛题的参考文献,寻找并构建有实际意义的业务特征
    
  • 分箱特征:几乎所有topline代码中均有分箱特征的构造,为何分箱特征如此重要且有效。在什么情况下使用分箱特征的效果好?(为什么本赛题需要分箱特征)

      参考:分箱的原理
    
  • DataFrame特征:针对pandas DataFrame的内置方法的使用,可以构造出大量的统计特征。建议:自行整理一份针对表格数据的统计特征构造函数

      参考:DataWhale的joyful pandas
    
  • Embedding特征:上分秘籍,将序列转换成NLP文本中的一句话或一篇文章进行特征向量化为何效果如此之好。如何针对给定数据,调整参数构造较好的词向量?

      参考:Word2vec的学习
    

附录

学习来源

1 团队名称:Pursuing the Past Youth
链接:
https://github.com/juzstu/TianChi_HaiYang

2 团队名称:liu123的航空母舰队
链接:
https://github.com/MichaelYin1994/tianchi-trajectory-data-mining

3 团队名称:天才海神号
链接:
https://github.com/fengdu78/tianchi_haiyang?spm=5176.12282029.0.0.5b97301792pLch

4 团队名称:大白
链接:
https://github.com/Ai-Light/2020-zhihuihaiyang

5 团队名称:抗毒救灾
链接:
https://github.com/wudejian789/2020DCIC_A_Rank7_B_Rank12

6 团队名称:蜗牛坐车里团队
链接:
https://tianchi.aliyun.com/notebook-ai/detail?postId=114808

7 团队名称:用欧气驱散疫情
链接:
https://github.com/tudoulei/2020-Digital-China-Innovation-Competition

数据

所用数据是 hy_round1_train_20200102(初赛数据)

运行过程

针对各团队的整理的详细运行代码见 ipynb/*.ipynb
数字序号与上面相同

运行结果

文件输出见 result/*.csv

部分解释

推荐的学习资料

实战类:知名比赛的topline代码,如kaggle、天池等平台的开源代码

书籍类:

+《阿里云天池大赛赛题解析》
   
   【笔者也有博客笔记学习(https://blog.csdn.net/qq_44574333/article/details/109611764)】
   
+《美团机器学习实战》

教程类:

+ Joyful Pandas 强烈推荐!基础且高效
http://joyfulpandas.datawhale.club/
posted @ 2021-04-18 21:27  山枫叶纷飞  阅读(780)  评论(0编辑  收藏  举报