选择sklearn管道对用户文本数据进行分类

我正在Python中开发一个机器学习应用程序（使用sklearn模块），目前正在尝试确定用于执行推理的模型。问题的简要描述：

考虑到用户数据的许多实例，我试图根据相对的关键字包含将它们分类为不同的类别。它是有监督的，所以我有很多已经分类的预分类数据的例子。（每条数据在2到12个字左右。）

我目前正试图在两种潜在的模式之间做出决定：

计数矢量器+多项式朴素贝叶斯。使用sklearn的CountVectorizer获取训练数据中的关键字计数。然后，使用朴素贝叶斯方法，使用sklearn的多项式NB模型对数据进行分类。
对关键字计数使用tf idf术语加权+标准朴素贝叶斯。使用CountVectorizer获得训练数据的关键字计数矩阵，使用sklearn的TfidfTransformer将该数据转换为tf-idf加权，然后将其转储到标准Naive Bayes模型中。

我已经阅读了两个方法中使用的类的文档，这两个方法似乎都很好地解决了我的问题。

对于这类问题，标准朴素贝叶斯模型的tf idf加权可能优于多项式朴素贝叶斯，有什么明显的原因吗？这两种方法都有明显的问题吗

朴素贝叶斯和多项式NB是相同的算法。你得到的区别在于tfidf转换，它会惩罚语料库中大量文档中出现的单词。

我的建议：使用tfidf并调整特征的TfidfVectorization的sublinear_tf、二进制参数和规范化参数。

也可以尝试scikit learn中可用的所有类型的不同分类器，我怀疑如果你正确地调整正则化类型的值（惩罚为l1或l2的八分之一）和正则化参数（alpha），这些分类器会给你更好的结果。

如果你对它们进行适当的调整，我怀疑你可以使用具有"对数"损失（逻辑回归）或"铰链"损失（SVM）的SGD分类器获得更好的结果。

人们通常通过scikit学习中的GridSearchCV类来调整参数。

我同意David的评论。你会想训练不同的模型，看看哪一个是最好的。

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from pprint import pprint
df = pd.DataFrame({'Keyword': ['buy widget', 'buy widgets', 'fiberglass widget',
                               'fiberglass widgets', 'how much are widget',
                               'how much are widgets', 'installing widget',
                               'installing widgets', 'vinyl widget', 'vinyl widgets',
                               'widget cost', 'widget estimate', 'widget install',
                               'widget installation', 'widget price', 'widget pricing',
                               'widgets cost', 'widgets estimate', 'widgets install',
                               'widgets installation', 'widgets price', 'widgets pricing',
                               'wood widget', 'wood widgets'],
                   'Label': ['Buy', 'Buy', 'Fiberglass', 'Fiberglass', 'Cost', 'Cost',
                             'Install', 'Install', 'Vinyl', 'Vinyl', 'Cost', 'Estimate',
                             'Install', 'Install', 'Cost', 'Cost', 'Cost', 'Estimate',
                             'Install', 'Install', 'Cost', 'Cost', 'Wood', 'Wood']},
                  columns=['Label', 'Keyword'])
X = df['Keyword']
y = df['Label']
##pipeline = Pipeline(steps=[
##  ('cvect', CountVectorizer()),
##  ('mnb', MultinomialNB())
##  ])
pipeline = Pipeline(steps=[
  ('tfidf', TfidfVectorizer()),
  ('bnb', BernoulliNB())
  ])
parameters = {'tfidf__ngram_range': [(1,1), (1,2)],
              'tfidf__stop_words': [None, 'english'],
              'tfidf__use_idf': [True, False],
              'bnb__alpha': [0.0, 0.5, 1.0],
              'bnb__binarize': [None, 0.2, 0.5, 0.7, 1.0],
              'bnb__fit_prior': [True, False]}
grid = GridSearchCV(pipeline, parameters, scoring='accuracy', cv=2, verbose=1)
grid.fit(X, y)
print('Best score:', grid.best_score_)
print('Best parameters:', pprint(grid.best_params_, indent=2))
# Here's how to predict (uncomment)
#pred = grid.predict(['buy wood widget', 'how much is a widget'])
#print(pred)

相关内容

最新更新

热门标签：