AutoGluon 教程

1 Tabular Prediction

1.1 加载数据 `TabularDataset` 类

数据集 TabularDataset 类，类似于 pandas.DataFrame 类：TabularDataset shares all the same attributes and methods of a pandas Dataframe.

读取 csv 数据

path = 'train.csv'
train_data = TabularDataset(path)

1.2 预测器 `TabularPredictor` 类

1.2.1 初始化

autogluon.tabular.TabularPredictor() 类

主要参数：

label(str): Name of the column that contains the target variable to predict.
problem_type(str), default = None
- {'binary', 'multiclass', 'regression', 'quantile'}
eval_metric: AutoGluon tunes factors such as hyperparameters, early-stopping, ensemble-weights, etc. in order to improve this metric on validation data.
- None:
  - 'accuracy' for binary and multiclass classification
  - 'root_mean_squared_error' for regression
  - 'pinball_loss' for quantile
- classification problem:
- regression problem:
path(str), default = None: 模型存储路径
- None: AutogluonModels/ag-[TIMESTAMP]
verbosity(int), default = 2
- Higher levels correspond to more detailed print statements
- set verbosity = 0 to suppress warnings
sample_weight(str), default = None
- 用于指定 sample weight 的列名，被设为sample weight 的列不会用为特征
- 其他：{'auto_weight', 'balance_weight'}
  - 'auto_weight': automatically choose a weighting strategy based on the data
  - 'balance_weight': equally weight classes in classification, no effect in regression
presets(list or str or dict), default = ['medium_quality']
- presets=['good_quality', 'optimize_for_deployment']: to get good quality with minimal disk usage
- presets='best_quality': to get the most accurate overall predictor (regardless of its efficiency)

1.3 模型训练

.fit(train_data, tuning_data=None,...)

train_data
tuning_data: 用于调参（如：early stopping and hyperparameter tuning）的验证集。
- 当 tuning_data = None，自动从 train_data 随机划分出部分样本作为验证集

1.3.1 `.fit()` 进阶参数

(1) `hyperparameters`

hyperparameters(str or dict), default = 'default'

用于设置模型的某些参数

实例代码： 设置模型的参数

# 默认参数：
hyperparameters = {
  'NN_TORCH': {}, 
  'GBM': [
    {'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, 
    {},
    'GBMLarge',  # 内置配置
  ], 
  'CAT': {},  
  'XGB': {},  
  'FASTAI': {}, 
  'RF': [
    {'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, 
    {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, 
    {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}},
  ], 
  'XT': [
    {'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, 
    {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, 
    {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}},
  ], 
  'KNN': [
    {'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}},
  ],
}

每个模型只使用一次，对于多分类

hyperparameters = {
  'KNN': {'weights': 'distance'},
  'RF':  {'criterion': 'entropy'},
  'XT':  {'criterion': 'entropy'},
  'GBM': {},
  'XGB': {},
  'CAT': {},
  'NN_TORCH': {} }

(2) Bagging 和 Stacking 相关的参数

auto_stack(bool, default = False):

auto_stack=Ture 可以提高预测精度，但是会消耗大量时间
Setting num_bag_folds and num_stack_levels arguments will override auto_stack

num_bag_folds(int, default = None)

Number of k folds used for bagging of models
recommend values: 5-10

num_bag_sets(int, default = None)

Number of repeats of k-fold bagging to perform (values must be >= 1)
Total number of models trained during bagging = num_bag_folds * num_bag_sets
Defaults to 1 if time_limit is not specified, otherwise 20 (always disabled if num_bag_folds is not specified)

num_stack_levels(int, default = None)

Number of stacking levels to use in stack ensemble
set num_stack_levels = 0 to disable stack ensembling
recommend values: 1-3

holdout_frac(float, default = None):

Fraction of train_data to holdout as tuning data for optimizing hyperparameters

use_bag_holdout(bool, default = False)

If use_bag_holdout = True, a holdout_frac portion of the data is held-out from model bagging

1.4 模型总结

(1) 总结 `.fit_summary()` 和 `.leaderboard()`

.fit_summary(verbosity=3, show_plot=False)

.leaderboard(data=None, extra_info=False, extra_metrics=None, only_pareto_frontier=False, silent=False)

(2) 绘制模型结构 `.plot_ensemble_model()`

.plot_ensemble_model()

需要安装 Graphviz软件以及在 Python 环境中安装 graphviz 和 pygraphviz 库
- 在 Windows 系统中，安装 pygraphviz 的步骤参考 site

目前的 AutoGluon 版本（v0.5.2）.plot_ensemble_model() 方法在 Linux 中可以正常运行，但是在 Windows 中会报错，这是由于 node 的 path 属性中斜杠与反斜杠的问题，见 issue #1065。解决方法：重写改函数

def plot_ensemble_model(predictor, prune_unused_nodes=False):
    G = predictor._trainer.model_graph.copy()
    if prune_unused_nodes == True:
        nodes_without_outedge = [node for node, degree in dict(G.degree()).items() if degree < 1]
    else:
        nodes_without_outedge = []
    nodes_no_val_score = [node for node in G if G.nodes[node]['val_score'] == None]
    G.remove_nodes_from(nodes_without_outedge)
    G.remove_nodes_from(nodes_no_val_score)
    root_node = [n for n, d in G.out_degree() if d == 0]
    best_model_node = predictor.get_model_best()
    A = nx.nx_agraph.to_agraph(G)
    A.graph_attr.update(rankdir='BT')
    A.node_attr.update(fontsize=10)
    A.node_attr.update(shape='rectangle')
    for node in A.iternodes():
        node.attr['label'] = f"{node.name}\nVal score: {float(node.attr['val_score']):.4f}"

        node.attr['path'] = node.attr['label'].replace('\\', '/')     # 替换反斜杠

        if node.name == best_model_node:
            node.attr['style'] = 'filled'
            node.attr['fillcolor'] = '#ff9900'
            node.attr['shape'] = 'box3d'
        elif nx.has_path(G, node.name, best_model_node):
            node.attr['style'] = 'filled'
            node.attr['fillcolor'] = '#ffcc00'
    model_image_fname = os.path.join(predictor.path, 'ensemble_model.png')
    A.draw(model_image_fname, format='png', prog='dot')
    return G

1.5 预测

(1) `.predict()` 和 `.predict_proba()`

.predict(data, model=None, as_pandas=True, transform_features=True)

.predict_proba(data, model=None, as_pandas=True, as_multiclass=True, transform_features=True)

主要参数：

data(str or TabularDataset or pd.DataFrame)
- str: file path
model(str): 使用的模型
- Valid models are listed in this predictor by calling predictor.get_model_names()
as_pandas(bool), default = True
- as_pandas=True: return pd.Series; otherwise, False return np.ndarray
as_multiclass(bool), default = True

1.6 评估

(1) `.evaluate()` 和 `.evaluate_predictions()`

.evaluate(data, model=None, silent=False, auxiliary_metrics=True, detailed_report=False)

.evaluate_predictions(y_true, y_pred, sample_weight=None, ...)

主要参数：

silent(bool), default = False
- If False, performance results are printed.
auxiliary_metrics(bool), default = True
- Should we compute other (problem_type specific) metrics in addition to the default metric
detailed_report(bool), default = False
- Should we computed more detailed versions of the auxiliary_metrics? (requires auxiliary_metrics = True)

1.7 其他

.load(save_path) 加载模型

代码实例：

# 不需要初始化
predictor = TabularPredictor.load(model_path)

.get_model_names() 获取所有模型名称

.delete_models(models_to_keep=None, models_to_delete=None, allow_delete_cascade=False, delete_from_disk=True, dry_run=True) 删除模型

参考资料

AutoGluon Tutorial, Predicting Columns in a Table - Quick Start, site

AutoGluon Tutorial, Predicting Columns in a Table - In Depth, site

AutoGluon API Tutorial, AutoGluon Predictors, site

Erickson et al., AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data, arXiv, site

posted @ 2022-09-09 12:09 veager 阅读(2044) 评论(0) 收藏举报

刷新页面返回顶部

veager

AutoGluon 教程

1 Tabular Prediction

1.1 加载数据 TabularDataset 类

1.2 预测器 TabularPredictor 类