ml.net例子笔记1

1 ml.net例子概要

https://github.com/feiyun0112/machinelearning-samples.zh-cn/tree/master
https://gitee.com/mirrors_feiyun0112/machinelearning-samples.zh-cn
根据场景和机器学习问题/任务,官方ML.NET示例被分成多个类别,可通过下表访问:

二元分类

二元分类
情绪分析C#**     **F#
垃圾信息检测C#**     **F#
**信用卡欺诈识别
(Binary Classification)C#    **F#
心脏病预测C#

多类分类

多类分类
GitHub Issues 分类C#**  **F#
鸢尾花分类C#**    **F#
手写数字识别C#

建议

建议
产品推荐C#
**电影推荐
(Matrix Factorization)**C#
**电影推荐
(Field Aware Factorization Machines)**C#

回归

回归
价格预测C#**     **F#
销售预测C#
需求预测C#**    **F#

时间序列预测

时间序列预测
销售预测C#

异常情况检测

异常情况检测
销售高峰检测** C#       **C#
电力异常检测C#
**信用卡欺诈检测
(Anomaly Detection)**C#

聚类分析

聚类分析
客户细分C#**     **F#
鸢尾花聚类C#**     **F#

排名

排名
排名搜索引擎结果C#

计算机视觉

计算机视觉
**图像分类训练
(High-Level API) C# F#      ** **图像分类预测
(Pretrained TensorFlow model scoring) C#   F#        **C# **图像分类训练
(TensorFlow Featurizer Estimator) C#   **F#
**对象检测
(ONNX model scoring) C#       **C#

跨领域方案

跨领域方案
**Web API上的可扩展模型
**C#
**Razor Web应用程序上的可扩展模型
**C#
**Azure Functions上的可扩展模型
**C#
**Blazor Web应用程序上的可扩展模型
**C#
**大数据集
**C#
**使用DatabaseLoader加载数据
**C#
**使用LoadFromEnumerable加载数据
**C#
**模型可解释性
**C#
**导出到ONNX
**C#

2 ml.net例子笔记


工程目录结构,按照ml.net的使用类别, 以CLI modelbuilder csharp等进行了分类,分别对应ml.net的命令行自动学习;GUI方式的学习;API方式的学习

Vs缺少组件的自动安装

避免 windows路径太长的问题报错可以设置如下:
本地策略组编辑器
计算机配置/管理模板/系统/文件系统

3 二元分类

4 情绪分析

5 训练数据

这个问题集中在预测客户的评论是否具有正面或负面情绪。我们将使用小型的wikipedia-detox-datasets(一个用于训练的数据集,一个用于模型的准确性评估的数据集),这些数据集已经由人工处理过,并且每个评论都被分配了一个情绪标签:

  • 0 - 好评/正面
  • 1 - 差评/负面

我们将使用这些数据集构建一个模型,在预测时将分析字符串并预测情绪值为0或1。
训练数据类似如下:
Label rev_id comment year logged_in ns sample split
0 666674821.0 " He is a Rapist!!!!! Please edit the article to include this important fact. Thank You. — Preceding unsigned comment added by • " 2015 True article blocked train
0 24297552.0 The other two films Hitch and Magnolia are also directly related to the community in question, and may be of interest to those who see those films. So why not link to them? 2005 False article random train

6 数据结构

public class SentimentIssue
{
    [LoadColumn(0)]
    public bool Label { get; set; }
    [LoadColumn(2)]
    public string Text { get; set; }
}

只选取了数据中的0,2这两个列

        IDataView dataView = mlContext.Data.LoadFromTextFile<SentimentIssue>(DataPath, hasHeader: true);
        TrainTestData trainTestSplit = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);
        IDataView trainingData = trainTestSplit.TrainSet;
        IDataView testData = trainTestSplit.TestSet;

使用定义的输入数据类型加载数据,并以2-8方式(0.2)切分训练和验证数据

7 训练

        // STEP 2: Common data process configuration with pipeline data transformations          
        var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: nameof(SentimentIssue.Text));
        // STEP 3: Set the training algorithm, then create and config the modelBuilder                            
        var trainer = mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "Label", featureColumnName: "Features");
        var trainingPipeline = dataProcessPipeline.Append(trainer);
        // STEP 4: Train the model fitting to the DataSet
        ITransformer trainedModel = trainingPipeline.Fit(trainingData);

dataProcessPipelin变量是一个数据处理管道,通过调用mlContext.Transforms.Text.FeaturizeText方法来创建。这个方法的参数有两个,分别是outputColumnName和inputColumnName。
outputColumnName表示输出的特征列的名称,这里设置为"Features"。
inputColumnName表示输入的文本列的名称,这里使用了nameof(SentimentIssue.Text),它是一个C#语言的语法,表示SentimentIssue.Text的名称。

SdcaLogisticRegression方法创建一个二分类训练器trainer,并设置标签列名为"Label",特征列名为"Features"。这个训练器使用Sdca算法实现逻辑回归模型。

8 评估模型

        // STEP 5: Evaluate the model and show accuracy stats
        var predictions = trainedModel.Transform(testData);
        var metrics = mlContext.BinaryClassification.Evaluate(data: predictions, labelColumnName: "Label", scoreColumnName: "Score");

ConsoleHelper.PrintBinaryClassificationMetrics(trainer.ToString(), metrics);
// STEP 6: Save/persist the trained model to a .ZIP file
mlContext.Model.Save(trainedModel, trainingData.Schema, ModelPath);
Console.WriteLine("The model is saved to {0}", ModelPath);

使用测试数据评价模型的训练结果并保存模型

9 工程运行结果

10 垃圾信息检测

11 训练数据

近6000条被分类为“垃圾信息”或“ham”(不是垃圾信息)的消息。
下载的数据类似如下
ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham Ok lar... Joking wif u oni...
spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham U dun say so early hor... U c already then say...

12 数据结构

class SpamInput
{
    [LoadColumn(0)]
    public string Label { get; set; }
    [LoadColumn(1)]
    public string Message { get; set; }

}
对应数据中的2列,都是字符串类型

13 训练

        // Create the estimator which converts the text label to boolean, featurizes the text, and adds a linear trainer.
        // Data process configuration with pipeline data transformations 
        var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")
                                  .Append(mlContext.Transforms.Text.FeaturizeText("FeaturesText", new Microsoft.ML.Transforms.Text.TextFeaturizingEstimator.Options
                                  {
                                      WordFeatureExtractor = new Microsoft.ML.Transforms.Text.WordBagEstimator.Options { NgramLength = 2, UseAllLengths = true },
                                      CharFeatureExtractor = new Microsoft.ML.Transforms.Text.WordBagEstimator.Options { NgramLength = 3, UseAllLengths = false },
                                      Norm = Microsoft.ML.Transforms.Text.TextFeaturizingEstimator.NormFunction.L2,
                                  }, "Message"))
                                  .Append(mlContext.Transforms.CopyColumns("Features", "FeaturesText"))
                                  .AppendCacheCheckpoint(mlContext);

由于机器学习默认处理的都是数字,因此文字内容需要处理
将"Label"列的值映射为键值对。
对"Message"列进行文本向量化,使用WordBagEstimator将文本转换为特征向量。WordBagEstimator的选项参数包括:NgramLength:ngram的长度。UseAllLengths:是否使用所有长度的ngram。Norm:特征向量的归一化方式。
复制"FeaturesText"列为"Features"列。
在管道中添加一个缓存检查点。【在机器学习模型的训练过程中,数据处理是非常耗时的操作。缓存检查点是一种优化技术,可以将数据处理的结果缓存起来,以避免在后续训练迭代中重复处理同样的数据。这样可以显著提高训练速度,减少训练时间。在预测过程中也可以使用缓存检查点,以加速数据处理的过程。】
// Set the training algorithm
var trainer = mlContext.MulticlassClassification.Trainers.OneVersusAll(mlContext.BinaryClassification.Trainers.AveragedPerceptron(labelColumnName: "Label", numberOfIterations: 10, featureColumnName: "Features"), labelColumnName: "Label")
.Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel", "PredictedLabel"));
var trainingPipeLine = dataProcessPipeline.Append(trainer);
// Evaluate the model using cross-validation.
// Cross-validation splits our dataset into 'folds', trains a model on some folds and
// evaluates it on the remaining fold. We are using 5 folds so we get back 5 sets of scores.
// Let's compute the average AUC, which should be between 0.5 and 1 (higher is better).
Console.WriteLine("=============== Cross-validating to get model's accuracy metrics ===============");
var crossValidationResults = mlContext.MulticlassClassification.CrossValidate(data: data, estimator: trainingPipeLine, numberOfFolds: 5);
ConsoleHelper.PrintMulticlassClassificationFoldsAverageMetrics(trainer.ToString(), crossValidationResults);

        // Now let's train a model on the full dataset to help us get better results
        var model = trainingPipeLine.Fit(data);
  1. 设置训练算法:这里使用的是 OneVersusAll,一种常用的多类别分类策略,基本上是每次将某一类别作为正例,其他所有类别作为负例,训练一个二分类器。使用的训练器是 AveragedPerceptron,这是一种常用的二分类器。最后通过 Append 方法将转换操作(将标签列映射为预测标签列)添加到训练器中。
  2. 将训练器添加到数据预处理管道中:dataProcessPipeline.Append(trainer) 这行代码将训练器添加到数据预处理管道中,这样在训练模型时,会首先对数据进行预处理,然后再进行训练。
  3. 进行交叉验证:通过 MulticlassClassification.CrossValidate 方法进行交叉验证,该方法会按照设定的折数(这里为 5)将数据集分割成若干部分,用一部分数据进行训练,用另一部分数据评估模型的性能。这样可以得到每折的评分,然后计算平均 AUC(Area Under the Curve)。
  4. 在全数据集上训练模型:trainingPipeLine.Fit(data) 这行代码在全数据集上训练模型,以得到更好的结果。

14 工程运行结果

15 建议

16 产品推荐

ProductRecommender.csproj将这个工程手工加入进来

该工程是基于购买的历史,推荐产品相关购买

17 训练数据

如下是脱敏的数据
ProductID ProductID_Copurchased
0 1
0 2
0 3
0 4
0 5
1 0
1 2
1 4
1 5
1 15

18 数据结构

    public class ProductEntry
    {
        [KeyType(count : 262111)]
        public uint ProductID { get; set; }
        [KeyType(count : 262111)]
        public uint CoPurchaseProductID { get; set; }
    }

训练数据中产品类别就是262111个

19 训练

        //STEP 2: Read the trained data using TextLoader by defining the schema for reading the product co-purchase dataset
        //        Do remember to replace amazon0302.txt with dataset from [https://snap.stanford.edu/data/amazon0302.html](https://snap.stanford.edu/data/amazon0302.html)
        var traindata = mlContext.Data.LoadFromTextFile(path:TrainingDataLocation,
                                                  columns: new[]
                                                            {
                                                                new TextLoader.Column("Label", DataKind.Single, 0),
                                                                new TextLoader.Column(name:nameof(ProductEntry.ProductID), dataKind:DataKind.UInt32, source: new [] { new TextLoader.Range(0) }, keyCount: new KeyCount(262111)), 
                                                                new TextLoader.Column(name:nameof(ProductEntry.CoPurchaseProductID), dataKind:DataKind.UInt32, source: new [] { new TextLoader.Range(1) }, keyCount: new KeyCount(262111))
                                                            },
                                                  hasHeader: true,
                                                  separatorChar: '\t');
        //STEP 3: Your data is already encoded so all you need to do is specify options for MatrxiFactorizationTrainer with a few extra hyperparameters
        //        LossFunction, Alpa, Lambda and a few others like K and C as shown below and call the trainer. 
        MatrixFactorizationTrainer.Options options = new MatrixFactorizationTrainer.Options();
        options.MatrixColumnIndexColumnName = nameof(ProductEntry.ProductID);
        options.MatrixRowIndexColumnName = nameof(ProductEntry.CoPurchaseProductID);
        options.LabelColumnName= "Label";
        options.LossFunction = MatrixFactorizationTrainer.LossFunctionType.SquareLossOneClass;
        options.Alpha = 0.01;
        options.Lambda = 0.025;
        // For better results use the following parameters
        //options.K = 100;
        //options.C = 0.00001;
        //Step 4: Call the MatrixFactorization trainer by passing options.
        var est = mlContext.Recommendation().Trainers.MatrixFactorization(options);
        
        //STEP 5: Train the model fitting to the DataSet
        //Please add Amazon0302.txt dataset from [https://snap.stanford.edu/data/amazon0302.html](https://snap.stanford.edu/data/amazon0302.html) to Data folder if FileNotFoundException is thrown.
        ITransformer model = est.Fit(traindata);

Matrix Factorization训练器是一种用于训练推荐系统的工具。它主要用于处理用户行为数据,并学习用户兴趣和物品之间的潜在关系。这种训练器可以应用于多种推荐场景,如电影推荐、商品推荐等。
Matrix Factorization训练器通过将用户行为数据表示为一个矩阵,并使用矩阵分解技术对矩阵进行分解,从而挖掘出用户兴趣和物品之间的潜在关系。这种训练器通常采用随机梯度下降(SGD)等优化算法进行训练,并使用正则化技术来防止过拟合。

20 工程运行结果

修改了下原工程代码,多推荐几个:
var pes = new ProductEntry[]
{
new ProductEntry() {
ProductID = 3,
CoPurchaseProductID = 63
},
new ProductEntry() {
ProductID = 3,
CoPurchaseProductID = 20
},
new ProductEntry() {
ProductID = 262108,
CoPurchaseProductID = 262109
},
new ProductEntry() {
ProductID = 262108,
CoPurchaseProductID = 3
}
};
var predictionengine = mlContext.Model.CreatePredictionEngine<ProductEntry, Copurchase_prediction>(model);
foreach (var p in pes)
{
var prediction = predictionengine.Predict(p);
Console.WriteLine($"For ProductID = {p.ProductID} and CoPurchaseProductID = {p.CoPurchaseProductID}");
Console.WriteLine(" the predicted score is " + Math.Round(prediction.Score, 1));
}

可见这个算法偏差不小
机器学习的过程不同的算法差别可能会很大

21 电影推荐-矩阵分解

22 训练数据

电影评分,标题,流派等信息
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance

recommendation-ratings-train.csv
userId,movieId,rating,timestamp
1,1,4,964982703
1,3,4,964981247
1,6,4,964982224
1,47,5,964983815

23 数据结构

public class MovieRating
{
    [LoadColumn(0)]
    public float userId;
    [LoadColumn(1)]
    public float movieId;
    [LoadColumn(2)]
    public float Label;

}

24 训练

        //STEP 3: Transform your data by encoding the two features userId and movieID. These encoded features will be provided as input
        //        to our MatrixFactorizationTrainer.
        var dataProcessingPipeline = mlcontext.Transforms.Conversion.MapValueToKey(outputColumnName: "userIdEncoded", inputColumnName: nameof(MovieRating.userId))
                       .Append(mlcontext.Transforms.Conversion.MapValueToKey(outputColumnName: "movieIdEncoded", inputColumnName: nameof(MovieRating.movieId)));
        
        //Specify the options for MatrixFactorization trainer            
        MatrixFactorizationTrainer.Options options = new MatrixFactorizationTrainer.Options();
        options.MatrixColumnIndexColumnName = "userIdEncoded";
        options.MatrixRowIndexColumnName = "movieIdEncoded";
        options.LabelColumnName = "Label";
        options.NumberOfIterations = 20;
        options.ApproximationRank = 100;
        //STEP 4: Create the training pipeline 
        var trainingPipeLine = dataProcessingPipeline.Append(mlcontext.Recommendation().Trainers.MatrixFactorization(options));
        //STEP 5: Train the model fitting to the DataSet
        Console.WriteLine("=============== Training the model ===============");
        ITransformer model = trainingPipeLine.Fit(trainingDataView);

25 工程运行结果

26 电影推荐-场景感知分解机

这个例子将训练和使用分开来写
samples\csharp\end-to-end-apps\Recommendation-MovieRecommender\MovieRecommender_Model 这个是模型训练
samples\csharp\end-to-end-apps\Recommendation-MovieRecommender\MovieRecommender 这个是 asp.net core程序

27 训练数据

userId,movieId,rating,timestamp
1,1,4,964982703
1,3,4,964981247
1,6,4,964982224
1,47,5,964983815
原始数据使用如下函数切分进行评分归一化,并按照9:1的比例进行切分
/*
* FieldAwareFactorizationMachine the learner used in this example requires the problem to setup as a binary classification problem.
* The DataPrep method performs two tasks:
* 1. It goes through all the ratings and replaces the ratings > 3 as 1, suggesting a movie is recommended and ratings < 3 as 0, suggesting
a movie is not recommended
2. This piece of code also splits the ratings.csv into rating-train.csv and ratings-test.csv used for model training and testing respectively.
*/
public static void DataPrep()

28 数据结构

public class MovieRating
{
    [LoadColumn(0)]
    public string userId;
    [LoadColumn(1)]
    public string movieId;
    [LoadColumn(2)]
    public bool Label;
}

这个和矩阵分解算法是一样的

29 训练

     // ML.NET doesn't cache data set by default. Therefore, if one reads a data set from a file and accesses it many times, it can be slow due to
     // expensive featurization and disk operations. When the considered data can fit into memory, a solution is to cache the data in memory. Caching is especially
     // helpful when working with iterative algorithms which needs many data passes. Since SDCA is the case, we cache. Inserting a
     // cache step in a pipeline is also possible, please see the construction of pipeline below.
     trainingDataView = mlContext.Data.Cache(trainingDataView);
     Console.WriteLine("=============== Transform Data And Preview ===============", color);
     Console.WriteLine();
     //STEP 4: Transform your data by encoding the two features userId and movieID.
     //        These encoded features will be provided as input to FieldAwareFactorizationMachine learner
     var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText(outputColumnName: "userIdFeaturized", inputColumnName: nameof(MovieRating.userId))
                                   .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "movieIdFeaturized", inputColumnName: nameof(MovieRating.movieId))
                                   .Append(mlContext.Transforms.Concatenate("Features", "userIdFeaturized", "movieIdFeaturized"))); 
     Common.ConsoleHelper.PeekDataViewInConsole(mlContext, trainingDataView, dataProcessPipeline, 10);
        
     // STEP 5: Train the model fitting to the DataSet
     Console.WriteLine("=============== Training the model ===============", color);
     Console.WriteLine();
     var trainingPipeLine = dataProcessPipeline.Append(mlContext.BinaryClassification.Trainers.FieldAwareFactorizationMachine(new string[] { "Features" }));
     var model = trainingPipeLine.Fit(trainingDataView);


运行这个工程就把模型训练出来并保存成文件了

30 模型消费和使用

运行这个工程

初始界面都是静态数据的展示,选择如Ankit查看如下:

点击推荐,显示如下:

MovieRatingPrediction prediction = null;
foreach (var movie in _movieService.GetTrendingMovies)
{
// Call the Rating Prediction for each movie prediction
prediction = _model.Predict(new MovieRating
{
userId = id.ToString(),
movieId = movie.MovieID.ToString()
});
// Normalize the prediction scores for the "ratings" b/w 0 - 100
float normalizedscore = Sigmoid(prediction.Score);
// Add the score for recommendation of each movie in the trending movie list
ratings.Add((movie.MovieID, normalizedscore));
}
对当前的票房电影每个算个推荐的评分

posted @ 2023-12-15 18:25  2012  阅读(63)  评论(0编辑  收藏  举报