类别不均衡处理

  1. resampling
    • over sampling
      • random over sampling
      • generate synthetic examples: SMOTE(synthetic minority oversampling technique) by a neareast neighbors approach
    • under sampling
      • random under sampling
      • Tomek links

         

  2. model-level methods
    • use class-banlaned loss(类别不平衡损失函数.pdf
      • 加权交叉熵
      • Focal Loss
      • CB Loss:可以不用关注推导,就是增加一个加权重因子(1−β)/(1−βni)(又称为类别平衡项)到损失函数中,其中超参数β∈(0,1),ni是类i的样本数量,达到的效果就是递减样本数多的那些类在loss上因为样本数多而产生的边际效益,如图

         


        类平衡项(1−β)/(1−βny)与模型和损失无关的, 在某种意义上,与损失函数L和预测得到的类概率p是相对独立的,可以应用到各种损失函数上。

         

    • select appropriate algorithms
      • tree-based models
      • Logistic regression: adjust the probability threshold 
    • combine multiple algorithms
      • under-sampling + ensemble 

         

      • under-sampling + class-banlaned loss
  3. evaluation metrics
    • Precision, recall, F1
    • Precision-Recall curve
    • AUC of the ROC curve
posted @ 2024-12-13 11:02  singyoutosleep  阅读(220)  评论(0)    收藏  举报