AutoGluon 教程

1 Tabular Prediction

1.1 加载数据 TabularDataset

数据集 TabularDataset 类,类似于 pandas.DataFrame 类:TabularDataset shares all the same attributes and methods of a pandas Dataframe.

  • 读取 csv 数据
path = 'train.csv'
train_data = TabularDataset(path)

1.2 预测器 TabularPredictor

1.2.1 初始化

autogluon.tabular.TabularPredictor()

主要参数:

  • label(str): Name of the column that contains the target variable to predict.

  • problem_type(str), default = None

    • {'binary', 'multiclass', 'regression', 'quantile'}
  • eval_metric: AutoGluon tunes factors such as hyperparameters, early-stopping, ensemble-weights, etc. in order to improve this metric on validation data.

    • None:

      • 'accuracy' for binary and multiclass classification

      • 'root_mean_squared_error' for regression

      • 'pinball_loss' for quantile

    • classification problem:

    • regression problem:

  • path(str), default = None: 模型存储路径

    • None: AutogluonModels/ag-[TIMESTAMP]
  • verbosity(int), default = 2

    • Higher levels correspond to more detailed print statements

    • set verbosity = 0 to suppress warnings

  • sample_weight(str), default = None

    • 用于指定 sample weight 的列名,被设为sample weight 的列不会用为特征

    • 其他:{'auto_weight', 'balance_weight'}

      • 'auto_weight': automatically choose a weighting strategy based on the data

      • 'balance_weight': equally weight classes in classification, no effect in regression

  • presets(list or str or dict), default = ['medium_quality']

    • presets=['good_quality', 'optimize_for_deployment']: to get good quality with minimal disk usage

    • presets='best_quality': to get the most accurate overall predictor (regardless of its efficiency)

1.3 模型训练

.fit(train_data, tuning_data=None,...)

  • train_data

  • tuning_data: 用于调参(如:early stopping and hyperparameter tuning)的验证集。

    • tuning_data = None,自动从 train_data 随机划分出部分样本作为验证集

1.3.1 .fit() 进阶参数

(1) hyperparameters

hyperparameters(str or dict), default = 'default'

  • 用于设置模型的某些参数

实例代码: 设置模型的参数

# 默认参数:
hyperparameters = {
  'NN_TORCH': {}, 
  'GBM': [
    {'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, 
    {},
    'GBMLarge',  # 内置配置
  ], 
  'CAT': {},  
  'XGB': {},  
  'FASTAI': {}, 
  'RF': [
    {'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, 
    {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, 
    {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}},
  ], 
  'XT': [
    {'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, 
    {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, 
    {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}},
  ], 
  'KNN': [
    {'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}},
  ],
}

每个模型只使用一次,对于多分类

hyperparameters = {
  'KNN': {'weights': 'distance'},
  'RF':  {'criterion': 'entropy'},
  'XT':  {'criterion': 'entropy'},
  'GBM': {},
  'XGB': {},
  'CAT': {},
  'NN_TORCH': {} }

(2) Bagging 和 Stacking 相关的参数

auto_stack(bool, default = False):

  • auto_stack=Ture 可以提高预测精度,但是会消耗大量时间

  • Setting num_bag_folds and num_stack_levels arguments will override auto_stack

num_bag_folds(int, default = None)

  • Number of k folds used for bagging of models

  • recommend values: 5-10

num_bag_sets(int, default = None)

  • Number of repeats of k-fold bagging to perform (values must be >= 1)

  • Total number of models trained during bagging = num_bag_folds * num_bag_sets

  • Defaults to 1 if time_limit is not specified, otherwise 20 (always disabled if num_bag_folds is not specified)

num_stack_levels(int, default = None)

  • Number of stacking levels to use in stack ensemble

  • set num_stack_levels = 0 to disable stack ensembling

  • recommend values: 1-3

holdout_frac(float, default = None):

  • Fraction of train_data to holdout as tuning data for optimizing hyperparameters

use_bag_holdout(bool, default = False)

  • If use_bag_holdout = True, a holdout_frac portion of the data is held-out from model bagging

1.4 模型总结

(1) 总结 .fit_summary().leaderboard()

.fit_summary(verbosity=3, show_plot=False)

.leaderboard(data=None, extra_info=False, extra_metrics=None, only_pareto_frontier=False, silent=False)

(2) 绘制模型结构 .plot_ensemble_model()

.plot_ensemble_model()

  • 需要安装 Graphviz软件 以及在 Python 环境中 安装 graphvizpygraphviz

    • 在 Windows 系统中,安装 pygraphviz 的步骤参考 site

目前的 AutoGluon 版本(v0.5.2).plot_ensemble_model() 方法在 Linux 中可以正常运行,但是在 Windows 中会报错,这是由于 node 的 path 属性中斜杠与反斜杠的问题,见 issue #1065。解决方法:重写改函数

def plot_ensemble_model(predictor, prune_unused_nodes=False):
    G = predictor._trainer.model_graph.copy()
    if prune_unused_nodes == True:
        nodes_without_outedge = [node for node, degree in dict(G.degree()).items() if degree < 1]
    else:
        nodes_without_outedge = []
    nodes_no_val_score = [node for node in G if G.nodes[node]['val_score'] == None]
    G.remove_nodes_from(nodes_without_outedge)
    G.remove_nodes_from(nodes_no_val_score)
    root_node = [n for n, d in G.out_degree() if d == 0]
    best_model_node = predictor.get_model_best()
    A = nx.nx_agraph.to_agraph(G)
    A.graph_attr.update(rankdir='BT')
    A.node_attr.update(fontsize=10)
    A.node_attr.update(shape='rectangle')
    for node in A.iternodes():
        node.attr['label'] = f"{node.name}\nVal score: {float(node.attr['val_score']):.4f}"

        node.attr['path'] = node.attr['label'].replace('\\', '/')     # 替换反斜杠

        if node.name == best_model_node:
            node.attr['style'] = 'filled'
            node.attr['fillcolor'] = '#ff9900'
            node.attr['shape'] = 'box3d'
        elif nx.has_path(G, node.name, best_model_node):
            node.attr['style'] = 'filled'
            node.attr['fillcolor'] = '#ffcc00'
    model_image_fname = os.path.join(predictor.path, 'ensemble_model.png')
    A.draw(model_image_fname, format='png', prog='dot')
    return G

1.5 预测

(1) .predict().predict_proba()

.predict(data, model=None, as_pandas=True, transform_features=True)

.predict_proba(data, model=None, as_pandas=True, as_multiclass=True, transform_features=True)

主要参数:

  • data(str or TabularDataset or pd.DataFrame)

    • str: file path
  • model(str): 使用的模型

    • Valid models are listed in this predictor by calling predictor.get_model_names()
  • as_pandas(bool), default = True

    • as_pandas=True: return pd.Series; otherwise, False return np.ndarray
  • as_multiclass(bool), default = True

1.6 评估

(1) .evaluate().evaluate_predictions()

.evaluate(data, model=None, silent=False, auxiliary_metrics=True, detailed_report=False)

.evaluate_predictions(y_true, y_pred, sample_weight=None, ...)

主要参数:

  • silent(bool), default = False

    • If False, performance results are printed.
  • auxiliary_metrics(bool), default = True

    • Should we compute other (problem_type specific) metrics in addition to the default metric
  • detailed_report(bool), default = False

    • Should we computed more detailed versions of the auxiliary_metrics? (requires auxiliary_metrics = True)

1.7 其他

.load(save_path) 加载模型

代码实例

# 不需要初始化
predictor = TabularPredictor.load(model_path)

.get_model_names() 获取所有模型名称

.delete_models(models_to_keep=None, models_to_delete=None, allow_delete_cascade=False, delete_from_disk=True, dry_run=True) 删除模型

参考资料

AutoGluon Tutorial, Predicting Columns in a Table - Quick Start, site

AutoGluon Tutorial, Predicting Columns in a Table - In Depth, site

AutoGluon API Tutorial, AutoGluon Predictors, site

Erickson et al., AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data, arXiv, site

posted @ 2022-09-09 12:09  veager  阅读(2034)  评论(0)    收藏  举报