搜索广告 - 不平衡数据 Imbalanced Data

【IJCAI-2018】搜索广告 - 不平衡数据 Imbalanced Data

我并不擅长做比赛，也不擅长构造特征，也不擅长调参数，也没有服务器可以并行。大家的baseline都比我的模型要好。在这里写这篇文章，主要是想跟大家分享下我对数据的理解，以及我思考的一个大概框架，希望对大家能有那么一点点启发或者帮助。

像我这种无经验无战绩无队友，特征只会弄个dummy variable，降维只会PCA，模型只会LR，SVM，调参只会CV，ensemble只会求平均的人，每次在比赛里的存在感就是增大分母，当我看到大家在论坛分享自己的baseline的时候，真的是好高兴好兴奋，然后又看到大家在构造各种神奇的特征，模型的logloss居然有提高，真的是很佩服。由于我既没有聪明的头脑也没有足够的细致，于是就“拿来主义”至上，将论坛里看到的baseline copy下来，在电脑上跑了一下。哇，好牛，第一把就0.084，与排名最前的0.080，只差0.004，我这是要冲击leaderboard 的节奏啊~兴奋之余干劲更大了，各种吭哧吭哧搜模型，吭哧吭哧敲代码，感觉自带BGM，走路生风~反正就是那种全世界都属于我的感觉~但是~~当我想看看哪些预测为1的时候，我惊呆了，no one! 合着我1万8的test data，用模型预测出来之后，竟然没有一个1，一个都没有！

然后这就是问题了。可能聪明的大家早就知道了CTR的数据不平衡问题，但是愚钝如我啊，我竟然没有发现！

所以吐槽完了~

对于不平衡数据 Imbalanced Data，像这里的CTR里面的二分类预测，应该怎么处理呢？

正负样本比例严重不平衡的情况，比例达到了50:1，如果直接在此基础上做预测，对于样本量较小的类的召回率会极低。

因为传统的学习方法以降低总体分类精度为目标，将所有样本一视同仁，同等对待，造成了分类器在多数类的分类精度较高而在少数类的分类精度很低。例如ctr正负样本50:1的例子，算法就算全部预测为另一样本，准确率也会达到98%(50/51)，因此传统的学习算法在不平衡数据集中具有较大的局限性。传统的学习算法的预测结果就是favor the majority, 因为the minority 本身数量少，又本同等对待，因此miss the minority 的代价极小，所以结果就是favor the majority。

解决方法主要分为两个方面。

第一种方案主要从数据的角度出发，主要方法为抽样，既然我们的样本是不平衡的，那么可以通过某种策略进行抽样，从而让我们的数据相对均衡一些；resampling 方法包括 over-, under-, combination. over- is increasing # of minority, under- is decreasing # of majority.

第二种方案从算法的角度出发，考虑不同误分类情况代价的差异性对算法进行优化，使得我们的算法在不平衡数据下也能有较好的效果。改写cost function by giving large cost of misclassifying the minority labels.

PS：附件中有基于logloss , AUC 的对比的python代码，可以运行，不会memory error.

  1 # -*- coding: utf-8 -*-
  2 """
  3 Created on Wed Apr 4 10:53:58 2018
  4 @author : HaiyanJiang
  5 @email : jianghaiyan.cn@gmail.com
  6 
  7  
  8 
  9 what does the doc do?
 10 some ideas of improving the accuracy of imbalanced data classification.
 11 data characteristics:
 12 imbalanced data.
 13 the models:
 14 model_baseline : lgb
 15 model_baseline2 : another lgb
 16 model_baseline3 : bagging
 17 
 18  
 19 
 20 Other Notes:
 21 除了基本特征外，还包括了'用户'在当前小时内和当天的点击量统计特征，以及当前所在的小时。
 22 'context_day', 'context_hour',
 23 'user_query_day', 'user_query_hour', 'user_query_day_hour',
 24 non_feat = [
 25 'instance_id', 'user_id', 'context_id', 'item_category_list',
 26 'item_property_list', 'predict_category_property',
 27 'context_timestamp', 'TagTime', 'context_day'
 28 ]
 29 
 30  
 31 
 32 """
 33 
 34  
 35 
 36 import time
 37 import pandas as pd
 38 import lightgbm as lgb
 39 from sklearn.metrics import log_loss
 40 
 41  
 42 
 43 import numpy as np
 44 import itertools
 45 import matplotlib.pyplot as plt
 46 from sklearn.metrics import confusion_matrix
 47 from sklearn.metrics import auc, roc_curve
 48 from scipy import interp
 49 
 50  
 51 
 52 from sklearn.ensemble import BaggingClassifier
 53 from imblearn.ensemble import BalancedBaggingClassifier
 54 
 55  
 56 
 57 
 58 def read_bigcsv(filename, **kw):
 59 with open(filename) as rf:
 60 reader = pd.read_csv(rf, **kw, iterator=True)
 61 chunkSize = 100000
 62 chunks = []
 63 while True:
 64 try:
 65 chunk = reader.get_chunk(chunkSize)
 66 chunks.append(chunk)
 67 except StopIteration:
 68 print("Iteration is stopped.")
 69 break
 70 df = pd.concat(chunks, axis=0, join='outer', ignore_index=True)
 71 return df
 72 
 73  
 74 
 75 
 76 def timestamp2datetime(value):
 77 value = time.localtime(value)
 78 dt = time.strftime('%Y-%m-%d %H:%M:%S', value)
 79 return dt
 80 
 81  
 82 
 83 
 84 '''
 85 from matplotlib import pyplot as plt
 86 tt = data['context_timestamp']
 87 plt.plot(tt)
 88 # 可以看出时间是没有排好的,有一定的错位。如果做成online的模型，一定要将时间排好。
 89 # aa = data[data['user_id']==24779788309075]
 90 aa = data_train[data_train.duplicated(subset=None, keep='first')]
 91 bb = data_train[data_train.duplicated(subset=None, keep='last')]
 92 cc = data_train[data_train.duplicated(subset=None, keep=False)]
 93 
 94  
 95 
 96 a2 = pd.DataFrame(train_id)[pd.DataFrame(train_id).duplicated(keep=False)]
 97 b2 = train_id[train_id.duplicated(keep='last')]
 98 c2 = train_id[train_id.duplicated(keep=False)]
 99 
100  
101 
102 c2 = data_train[data_train.duplicated(subset=None, keep=False)]
103 
104  
105 
106 经验证, 'instance_id'有重复
107 a3 = Xdata[Xdata['instance_id']==1037061371711078396]
108 '''
109 
110  
111 
112 
113 def convert_timestamp(data):
114 '''
115 1. convert timestamp to datetime.
116 2. no sort, no reindex.
117 data.duplicated(subset=None, keep='first')
118 TagTime from-to is ('2018-09-18 00:00:01', '2018-09-24 23:59:47')
119 'user_query_day', 'user_query_day_hour', 'hour',
120 np.corrcoef(data['user_query_day'], data['user_query_hour'])
121 np.corrcoef(data['user_query_hour'], data['user_query_day_hour'])
122 np.corrcoef(data['user_query_day'], data['user_query_day_hour'])
123 '''
124 data['TagTime'] = data['context_timestamp'].apply(timestamp2datetime)
125 # data['TagTime'][0], data['TagTime'][len(data) - 1]
126 # x = data['TagTime'][len(data) - 1]
127 data['context_day'] = data['TagTime'].apply(lambda x: int(x[8:10]))
128 data['context_hour'] = data['TagTime'].apply(lambda x: int(x[11:13]))
129 query_day = data.groupby(['user_id', 'context_day']).size(
130 ).reset_index().rename(columns={0: 'user_query_day'})
131 data = pd.merge(data, query_day, 'left', on=['user_id', 'context_day'])
132 query_hour = data.groupby(['user_id', 'context_hour']).size(
133 ).reset_index().rename(columns={0: 'user_query_hour'})
134 data = pd.merge(data, query_hour, 'left', on=['user_id', 'context_hour'])
135 query_day_hour = data.groupby(
136 by=['user_id', 'context_day', 'context_hour']).size(
137 ).reset_index().rename(columns={0: 'user_query_day_hour'})
138 data = pd.merge(data, query_day_hour, 'left',
139 on=['user_id', 'context_day', 'context_hour'])
140 return data
141 
142  
143 
144 
145 def plot_confusion_matrix(cm, classes, normalize=False,
146 title='Confusion matrix',
147 cmap=plt.cm.Blues):
148 """
149 This function prints and plots the confusion matrix.
150 Normalization can be applied by setting 'normalize=True'.
151 """
152 if normalize:
153 cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
154 print("Normalized confusion matrix")
155 else:
156 print('Confusion matrix, without normalization')
157 print(cm)
158 plt.imshow(cm, interpolation='nearest', cmap=cmap)
159 plt.title(title)
160 plt.colorbar()
161 tick_marks = np.arange(len(classes))
162 plt.xticks(tick_marks, classes, rotation=45)
163 plt.yticks(tick_marks, classes)
164 fmt = '.2f' if normalize else 'd'
165 thresh = cm.max() / 2.
166 for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
167 plt.text(j, i, format(cm[i, j], fmt),
168 horizontalalignment="center",
169 color="white" if cm[i, j] > thresh else "black")
170 plt.tight_layout()
171 plt.ylabel('True label')
172 plt.xlabel('Predicted label')
173 
174  
175 
176 
177 def data_baseline():
178 filename = '../round1_ijcai_18_data/round1_ijcai_18_train_20180301.txt'
179 data = read_bigcsv(filename, sep=' ')
180 # data = pd.read_csv(filename, sep=' ')
181 data.drop_duplicates(inplace=True)
182 data.reset_index(drop=True, inplace=True) # very important
183 data = convert_timestamp(data)
184 train = data.loc[data['context_day'] < 24] # 18,19,20,21,22,23,24
185 test = data.loc[data['context_day'] == 24] # 暂时先使用第24天作为验证集
186 features = [
187 'item_id', 'item_brand_id', 'item_city_id', 'item_price_level',
188 'item_sales_level', 'item_collected_level', 'item_pv_level',
189 'user_gender_id', 'user_age_level', 'user_occupation_id',
190 'user_star_level', 'context_page_id', 'shop_id',
191 'shop_review_num_level', 'shop_review_positive_rate',
192 'shop_star_level', 'shop_score_service',
193 'shop_score_delivery', 'shop_score_description',
194 'user_query_day', 'user_query_day_hour', 'context_hour',
195 ]
196 x_train = train[features]
197 x_test = test[features]
198 y_train = train['is_trade']
199 y_test = test['is_trade']
200 return x_train, x_test, y_train, y_test
201 # x_train, x_test, y_train, y_test = data_baseline()
202 
203  
204 
205 
206 def model_baseline(x_train, y_train, x_test, y_test):
207 cat_names = [
208 'item_price_level',
209 'item_sales_level',
210 'item_collected_level',
211 'item_pv_level',
212 'user_gender_id',
213 'user_age_level',
214 'user_occupation_id',
215 'user_star_level',
216 'context_page_id',
217 'shop_review_num_level',
218 'shop_star_level',
219 ]
220 print("begin train...")
221 kw_lgb = dict(num_leaves=63, max_depth=7, n_estimators=80, random_state=6,)
222 clf = lgb.LGBMClassifier(**kw_lgb)
223 clf.fit(x_train, y_train, categorical_feature=cat_names,)
224 prob = clf.predict_proba(x_test,)[:, 1]
225 predict_score = [float('%.2f' % x) for x in prob]
226 loss_val = log_loss(y_test, predict_score)
227 # print(loss_val) # 0.0848226750637
228 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
229 mean_fpr = np.linspace(0, 1, 100)
230 mean_tpr = interp(mean_fpr, fpr, tpr)
231 x_auc = auc(fpr, tpr)
232 fig = plt.figure('fig1')
233 ax = fig.add_subplot(1, 1, 1)
234 name = 'base_lgb'
235 plt.plot(mean_fpr, mean_tpr, linestyle='--',
236 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
237 (x_auc, loss_val), lw=2)
238 y_pred = clf.predict(x_test)
239 cm1 = plt.figure()
240 cm = confusion_matrix(y_test, y_pred)
241 plot_confusion_matrix(cm, classes=[0, 1], title='Confusion matrix base1')
242 # add weighted according to the labels
243 clf = lgb.LGBMClassifier(**kw_lgb)
244 clf.fit(x_train, y_train,
245 sample_weight=[1 if y == 1 else 0.02 for y in y_train],
246 categorical_feature=cat_names)
247 prob = clf.predict_proba(x_test,)[:, 1]
248 predict_score = [float('%.2f' % x) for x in prob]
249 loss_val = log_loss(y_test, predict_score)
250 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
251 mean_fpr = np.linspace(0, 1, 100)
252 mean_tpr = interp(mean_fpr, fpr, tpr)
253 x_auc = auc(fpr, tpr)
254 name = 'base_lgb_weighted'
255 plt.figure('fig1') # 选择图
256 plt.plot(
257 mean_fpr, mean_tpr, linestyle='--',
258 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
259 (x_auc, loss_val), lw=2)
260 y_pred = clf.predict(x_test)
261 cm2 = plt.figure()
262 cm = confusion_matrix(y_test, y_pred)
263 plot_confusion_matrix(cm, classes=[0, 1],
264 title='Confusion matrix basemodle')
265 plt.figure('fig1') # 选择图
266 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')
267 # make nice plotting
268 ax.spines['top'].set_visible(False)
269 ax.spines['right'].set_visible(False)
270 ax.get_xaxis().tick_bottom()
271 ax.get_yaxis().tick_left()
272 ax.spines['left'].set_position(('outward', 10))
273 ax.spines['bottom'].set_position(('outward', 10))
274 plt.xlim([0, 1])
275 plt.ylim([0, 1])
276 plt.xlabel('False Positive Rate')
277 plt.ylabel('True Positive Rate')
278 plt.title('Receiver Operating Characteristic')
279 plt.legend(loc="lower right")
280 plt.show()
281 return cm1, cm2, fig
282 
283  
284 
285 
286 def model_baseline3(x_train, y_train, x_test, y_test):
287 bagging = BaggingClassifier(random_state=0)
288 balanced_bagging = BalancedBaggingClassifier(random_state=0)
289 bagging.fit(x_train, y_train)
290 balanced_bagging.fit(x_train, y_train)
291 prob = bagging.predict_proba(x_test)[:, 1]
292 predict_score = [float('%.2f' % x) for x in prob]
293 loss_val = log_loss(y_test, predict_score)
294 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
295 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
296 mean_fpr = np.linspace(0, 1, 100)
297 mean_tpr = interp(mean_fpr, fpr, tpr)
298 x_auc = auc(fpr, tpr)
299 fig = plt.figure('Bagging')
300 ax = fig.add_subplot(1, 1, 1)
301 name = 'base_Bagging'
302 plt.plot(mean_fpr, mean_tpr, linestyle='--',
303 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
304 (x_auc, loss_val), lw=2)
305 y_pred_bagging = bagging.predict(x_test)
306 cm_bagging = confusion_matrix(y_test, y_pred_bagging)
307 cm1 = plt.figure()
308 plot_confusion_matrix(cm_bagging,
309 classes=[0, 1],
310 title='Confusion matrix of BaggingClassifier')
311 # balanced_bagging
312 prob = balanced_bagging.predict_proba(x_test)[:, 1]
313 predict_score = [float('%.2f' % x) for x in prob]
314 loss_val = log_loss(y_test, predict_score)
315 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
316 mean_fpr = np.linspace(0, 1, 100)
317 mean_tpr = interp(mean_fpr, fpr, tpr)
318 x_auc = auc(fpr, tpr)
319 plt.figure('Bagging') # 选择图
320 name = 'base_Balanced_Bagging'
321 plt.plot(
322 mean_fpr, mean_tpr, linestyle='--',
323 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
324 (x_auc, loss_val), lw=2)
325 y_pred_balanced_bagging = balanced_bagging.predict(x_test)
326 cm_balanced_bagging = confusion_matrix(y_test, y_pred_balanced_bagging)
327 cm2 = plt.figure()
328 plot_confusion_matrix(cm_balanced_bagging,
329 classes=[0, 1],
330 title='Confusion matrix of BalancedBagging')
331 plt.figure('Bagging') # 选择图
332 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')
333 # make nice plotting
334 ax.spines['top'].set_visible(False)
335 ax.spines['right'].set_visible(False)
336 ax.get_xaxis().tick_bottom()
337 ax.get_yaxis().tick_left()
338 ax.spines['left'].set_position(('outward', 10))
339 ax.spines['bottom'].set_position(('outward', 10))
340 plt.xlim([0, 1])
341 plt.ylim([0, 1])
342 plt.xlabel('False Positive Rate')
343 plt.ylabel('True Positive Rate')
344 plt.title('Receiver Operating Characteristic')
345 plt.legend(loc="lower right")
346 plt.show()
347 return cm1, cm2, fig
348 
349  
350 
351 
352 def model_baseline2(x_train, y_train, x_test, y_test):
353 params = {
354 'task': 'train',
355 'boosting_type': 'gbdt',
356 'objective': 'multiclass',
357 'num_class': 2,
358 'verbose': 0,
359 'metric': 'logloss',
360 'max_bin': 255,
361 'max_depth': 7,
362 'learning_rate': 0.3,
363 'nthread': 4,
364 'n_estimators': 85,
365 'num_leaves': 63,
366 'feature_fraction': 0.8,
367 'num_boost_round': 160,
368 }
369 lgb_train = lgb.Dataset(x_train, label=y_train)
370 lgb_eval = lgb.Dataset(x_test, label=y_test, reference=lgb_train)
371 print("begin train...")
372 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval)
373 prob = bst.predict(x_test)[:, 1]
374 predict_score = [float('%.2f' % x) for x in prob]
375 loss_val = log_loss(y_test, predict_score)
376 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
377 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
378 x_auc = auc(fpr, tpr)
379 mean_fpr = np.linspace(0, 1, 100)
380 mean_tpr = interp(mean_fpr, fpr, tpr)
381 fig = plt.figure('weighted')
382 ax = fig.add_subplot(1, 1, 1)
383 name = 'base_lgb'
384 plt.plot(mean_fpr, mean_tpr, linestyle='--',
385 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
386 (x_auc, loss_val), lw=2)
387 cm1 = plt.figure()
388 cm = confusion_matrix(y_test, y_pred)
389 plot_confusion_matrix(cm, classes=[0, 1],
390 title='Confusion matrix basemodle')
391 # add weighted according to the labels
392 lgb_train = lgb.Dataset(
393 x_train, label=y_train,
394 weight=[1 if y == 1 else 0.02 for y in y_train])
395 lgb_eval = lgb.Dataset(
396 x_test, label=y_test, reference=lgb_train,
397 weight=[1 if y == 1 else 0.02 for y in y_test])
398 bst = lgb.train(params, lgb_train, valid_sets=lgb_eval)
399 prob = bst.predict(x_test)[:, 1]
400 predict_score = [float('%.2f' % x) for x in prob]
401 loss_val = log_loss(y_test, predict_score)
402 y_pred = [1 if x > 0.5 else 0 for x in predict_score]
403 fpr, tpr, thresholds = roc_curve(y_test, predict_score)
404 mean_fpr = np.linspace(0, 1, 100)
405 mean_tpr = interp(mean_fpr, fpr, tpr)
406 x_auc = auc(fpr, tpr)
407 plt.figure('weighted') # 选择图
408 name = 'base_lgb_weighted'
409 plt.plot(
410 mean_fpr, mean_tpr, linestyle='--',
411 label='{} (area = %0.2f, logloss = %0.2f)'.format(name) %
412 (x_auc, loss_val), lw=2)
413 cm2 = plt.figure()
414 cm = confusion_matrix(y_test, y_pred)
415 plot_confusion_matrix(cm, classes=[0, 1],
416 title='Confusion matrix basemodle')
417 plt.figure('weighted') # 选择图
418 plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='Luck')
419 # make nice plotting
420 ax.spines['top'].set_visible(False)
421 ax.spines['right'].set_visible(False)
422 ax.get_xaxis().tick_bottom()
423 ax.get_yaxis().tick_left()
424 ax.spines['left'].set_position(('outward', 10))
425 ax.spines['bottom'].set_position(('outward', 10))
426 plt.xlim([0, 1])
427 plt.ylim([0, 1])
428 plt.xlabel('False Positive Rate')
429 plt.ylabel('True Positive Rate')
430 plt.title('Receiver Operating Characteristic')
431 plt.legend(loc="lower right")
432 plt.show()
433 return cm1, cm2, fig
434 
435  
436 
437 
438 '''
439 1. logloss VS AUC
440 虽然 baseline 的 logloss= 0.0819, 确实很小，但是从 Confusion matrix 看出，
441 模型倾向于将所有的数据都分成多的那个，加了weight 之后稍好一点？
442 Though the logloss is 0.0819, which is a very small value.
443 Confusion matrix shows y_pred all 0, which feavors the majority classes.
444 
445  
446 
447 AUC 只有 0.64~0.67.
448 AUC如此小，按理来说不应该啊，但是为什么呢？
449 因为数据的label 极度不平衡，1 的比例大概只有 2%. 50:1.
450 AUC 对不平衡数据的分类性能测试更友好，用AUC去选特征，可能结果更好哦。
451 这里只提供一个大概的思考改进点。
452 2. handling with imbalanced data:
453 1. resampling, over- or under-,
454 over- is increasing # of minority, under- is decreasing # of majority.
455 2. revalue the loss function by giving large loss of misclassifying the
456 minority labels.
457 '''
458 
459  
460 
461 
462 if __name__ == "__main__":
463 x_train, x_test, y_train, y_test = data_baseline()
464 cm11, cm12, fig1 = model_baseline(x_train, y_train, x_test, y_test)
465 cm21, cm22, fig2 = model_baseline2(x_train, y_train, x_test, y_test)
466 cm31, cm32, fig3 = model_baseline3(x_train, y_train, x_test, y_test)
467 
468  
469 
470 fig1.savefig('./base_lgb_weighted.jpg', format='jpg')
471 cm11.savefig('./Confusion matrix1.jpg', format='jpg')
472 cm12.savefig('./Confusion matrix2.jpg', format='jpg')

posted @ 2018-04-09 09:23 change_world 阅读(1787) 评论(0) 收藏举报

刷新页面返回顶部

change_world

搜索广告 - 不平衡数据 Imbalanced Data

公告