AutoGluon 教程
1 Tabular Prediction
1.1 加载数据 TabularDataset 类
数据集 TabularDataset 类,类似于 pandas.DataFrame 类:TabularDataset shares all the same attributes and methods of a pandas Dataframe.
- 读取
csv数据
path = 'train.csv'
train_data = TabularDataset(path)
1.2 预测器 TabularPredictor 类
1.2.1 初始化
autogluon.tabular.TabularPredictor() 类
主要参数:
-
label(str): Name of the column that contains the target variable to predict. -
problem_type(str),default = None{'binary', 'multiclass', 'regression', 'quantile'}
-
eval_metric: AutoGluon tunes factors such as hyperparameters, early-stopping, ensemble-weights, etc. in order to improve this metric on validation data.-
None:-
'accuracy'for binary and multiclass classification -
'root_mean_squared_error'for regression -
'pinball_loss'for quantile
-
-
classification problem:
-
regression problem:
-
-
path(str),default = None: 模型存储路径None:AutogluonModels/ag-[TIMESTAMP]
-
verbosity(int),default = 2-
Higher levels correspond to more detailed print statements
-
set
verbosity = 0to suppress warnings
-
-
sample_weight(str),default = None-
用于指定 sample weight 的列名,被设为sample weight 的列不会用为特征
-
其他:
{'auto_weight', 'balance_weight'}-
'auto_weight': automatically choose a weighting strategy based on the data -
'balance_weight': equally weight classes in classification, no effect in regression
-
-
-
presets(list or str or dict),default = ['medium_quality']-
presets=['good_quality', 'optimize_for_deployment']: to get good quality with minimal disk usage -
presets='best_quality': to get the most accurate overall predictor (regardless of its efficiency)
-
1.3 模型训练
.fit(train_data, tuning_data=None,...)
-
train_data -
tuning_data: 用于调参(如:early stopping and hyperparameter tuning)的验证集。- 当
tuning_data = None,自动从train_data随机划分出部分样本作为验证集
- 当
1.3.1 .fit() 进阶参数
(1) hyperparameters
hyperparameters(str or dict), default = 'default'
- 用于设置模型的某些参数
实例代码: 设置模型的参数
# 默认参数:
hyperparameters = {
'NN_TORCH': {},
'GBM': [
{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}},
{},
'GBMLarge', # 内置配置
],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [
{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}},
{'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}},
{'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}},
],
'XT': [
{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}},
{'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}},
{'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression']}},
],
'KNN': [
{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}},
],
}
每个模型只使用一次,对于多分类
hyperparameters = {
'KNN': {'weights': 'distance'},
'RF': {'criterion': 'entropy'},
'XT': {'criterion': 'entropy'},
'GBM': {},
'XGB': {},
'CAT': {},
'NN_TORCH': {} }
(2) Bagging 和 Stacking 相关的参数
auto_stack(bool, default = False):
-
auto_stack=Ture可以提高预测精度,但是会消耗大量时间 -
Setting
num_bag_foldsandnum_stack_levelsarguments will overrideauto_stack
num_bag_folds(int, default = None)
-
Number of k folds used for bagging of models
-
recommend values: 5-10
num_bag_sets(int, default = None)
-
Number of repeats of k-fold bagging to perform (values must be >= 1)
-
Total number of models trained during
bagging = num_bag_folds * num_bag_sets -
Defaults to 1 if
time_limitis not specified, otherwise 20 (always disabled ifnum_bag_foldsis not specified)
num_stack_levels(int, default = None)
-
Number of stacking levels to use in stack ensemble
-
set
num_stack_levels = 0to disable stack ensembling -
recommend values: 1-3
holdout_frac(float, default = None):
- Fraction of
train_datato holdout as tuning data for optimizing hyperparameters
use_bag_holdout(bool, default = False)
- If
use_bag_holdout = True, aholdout_fracportion of the data is held-out from model bagging
1.4 模型总结
(1) 总结 .fit_summary() 和 .leaderboard()
.fit_summary(verbosity=3, show_plot=False)
(2) 绘制模型结构 .plot_ensemble_model()
目前的 AutoGluon 版本(v0.5.2).plot_ensemble_model() 方法在 Linux 中可以正常运行,但是在 Windows 中会报错,这是由于 node 的 path 属性中斜杠与反斜杠的问题,见 issue #1065。解决方法:重写改函数
def plot_ensemble_model(predictor, prune_unused_nodes=False):
G = predictor._trainer.model_graph.copy()
if prune_unused_nodes == True:
nodes_without_outedge = [node for node, degree in dict(G.degree()).items() if degree < 1]
else:
nodes_without_outedge = []
nodes_no_val_score = [node for node in G if G.nodes[node]['val_score'] == None]
G.remove_nodes_from(nodes_without_outedge)
G.remove_nodes_from(nodes_no_val_score)
root_node = [n for n, d in G.out_degree() if d == 0]
best_model_node = predictor.get_model_best()
A = nx.nx_agraph.to_agraph(G)
A.graph_attr.update(rankdir='BT')
A.node_attr.update(fontsize=10)
A.node_attr.update(shape='rectangle')
for node in A.iternodes():
node.attr['label'] = f"{node.name}\nVal score: {float(node.attr['val_score']):.4f}"
node.attr['path'] = node.attr['label'].replace('\\', '/') # 替换反斜杠
if node.name == best_model_node:
node.attr['style'] = 'filled'
node.attr['fillcolor'] = '#ff9900'
node.attr['shape'] = 'box3d'
elif nx.has_path(G, node.name, best_model_node):
node.attr['style'] = 'filled'
node.attr['fillcolor'] = '#ffcc00'
model_image_fname = os.path.join(predictor.path, 'ensemble_model.png')
A.draw(model_image_fname, format='png', prog='dot')
return G
1.5 预测
(1) .predict() 和 .predict_proba()
.predict(data, model=None, as_pandas=True, transform_features=True)
.predict_proba(data, model=None, as_pandas=True, as_multiclass=True, transform_features=True)
主要参数:
-
data(str or TabularDataset or pd.DataFrame)str: file path
-
model(str): 使用的模型- Valid models are listed in this predictor by calling
predictor.get_model_names()
- Valid models are listed in this predictor by calling
-
as_pandas(bool),default = Trueas_pandas=True: returnpd.Series; otherwise,Falsereturnnp.ndarray
-
as_multiclass(bool),default = True
1.6 评估
(1) .evaluate() 和 .evaluate_predictions()
.evaluate(data, model=None, silent=False, auxiliary_metrics=True, detailed_report=False)
.evaluate_predictions(y_true, y_pred, sample_weight=None, ...)
主要参数:
-
silent(bool),default = False- If
False, performance results are printed.
- If
-
auxiliary_metrics(bool),default = True- Should we compute other (
problem_typespecific) metrics in addition to the default metric
- Should we compute other (
-
detailed_report(bool),default = False- Should we computed more detailed versions of the
auxiliary_metrics? (requiresauxiliary_metrics = True)
- Should we computed more detailed versions of the
1.7 其他
.load(save_path) 加载模型
代码实例:
# 不需要初始化
predictor = TabularPredictor.load(model_path)
.get_model_names() 获取所有模型名称
参考资料
AutoGluon Tutorial, Predicting Columns in a Table - Quick Start, site
AutoGluon Tutorial, Predicting Columns in a Table - In Depth, site
AutoGluon API Tutorial, AutoGluon Predictors, site
Erickson et al., AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data, arXiv, site

浙公网安备 33010602011771号