Microsoft ML情感分析打印不正确的预测结果?



我使用c#和Microsoft ML库构建了一个文本分析模型。微软提供的数据集擅长预测一些评论字符串的值,如Batteries not included,它打印一个负数,No batteries,它也打印一个负数的预测值。然而,我已经对Not badThis is really bad等值进行了测试,它为两者打印了Positive的预测值,这是不正确的。是否有一个更大的数据集文本文件,我可以用来提高我的模型的准确性。我从微软的情感分析文档中实现了这个教程。数据集非常小,60kb用于训练文本分析模型。数据集名称为yelp_labelled.txt。它包含示例语句,每个语句的值为0 (Negative)或1(Positive)。我在哪里可以找到一个更大的数据集来训练我的文本分析预测?我使用的代码在

下面
using AnalysisSentiment;
using Microsoft.ML;
using Microsoft.ML.Data;
using static Microsoft.ML.DataOperationsCatalog;
//create a field to hold the data file
string _dataPath = "yelp_labelled.txt";
//initialize the context
MLContext mlContext = new MLContext();
TrainTestData splitDataView = LoadData(mlContext);
ITransformer model = BuildAndTrainModel(mlContext, splitDataView.TrainSet);
Evaluate(mlContext, model, splitDataView.TestSet);
UseModelWithSingleItem(mlContext, model);

TrainTestData LoadData(MLContext mlContext)
{
IDataView dataView = mlContext.Data.LoadFromTextFile<SentimentData>(_dataPath, hasHeader: false);
TrainTestData splitDataView = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);
return splitDataView;   
}
ITransformer BuildAndTrainModel(MLContext mlContext, IDataView splitTrainSet)
{
var estimator = mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: nameof(SentimentData.SentimentText))
.Append(mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "Label", featureColumnName: "Features"));
Console.WriteLine("=============== Create and Train the Model ===============");
var model = estimator.Fit(splitTrainSet);
Console.WriteLine("=============== End of training ===============");
Console.WriteLine();
return model;
}
void Evaluate(MLContext mlContext, ITransformer model, IDataView splitTestSet)
{
Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");
IDataView predictions = model.Transform(splitTestSet);
CalibratedBinaryClassificationMetrics metrics = mlContext.BinaryClassification.Evaluate(predictions, "Label");
Console.WriteLine();
Console.WriteLine("Model quality metrics evaluation");
Console.WriteLine("--------------------------------");
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.AreaUnderRocCurve:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
Console.WriteLine("=============== End of model evaluation ===============");
}
void UseModelWithSingleItem(MLContext mlContext, ITransformer model)
{
PredictionEngine<SentimentData, SentimentPrediction> predictionFunction = mlContext.Model.CreatePredictionEngine<SentimentData, SentimentPrediction>(model);
SentimentData sampleStatement = new SentimentData
{
SentimentText = "not bad"
};
var resultPrediction = predictionFunction.Predict(sampleStatement);
Console.WriteLine();
Console.WriteLine("=============== Prediction Test of model with a single sample and test dataset ===============");
Console.WriteLine();
Console.WriteLine($"Sentiment: {resultPrediction.SentimentText} | Prediction: {(Convert.ToBoolean(resultPrediction.Prediction) ? "Positive" : "Negative")} | Probability: {resultPrediction.Probability} ");
Console.WriteLine("=============== End of Predictions ===============");
Console.WriteLine();
}
  • 迁移学习:由于您的数据集很低,最好的方法是做对情感数据集进行预训练,比如IMBD电影评论等
  • 但是,您正在使用的模型是简单的逻辑回归,不支持预训练和微调。因此,您将不得不更改您的底层ML模型到深度学习模型。
  • 添加更多类似数据:如果不能更改下划线逻辑回归模型,然后您可以尝试添加IMDB数据集到你的数据集和训练从头开始,看看你的模型测试性能改善了。它可能工作,因为IMDB是一个两类(正负)数据集,它看起来非常类似于你的数据集。

最新更新