scikit学习中用于分类算法的文本特征输入格式

我开始使用scikit学习做一些NLP。我已经使用了NLTK中的一些分类器，现在我想尝试在scikit-learn中实现的分类器。

我的数据基本上是句子，我从这些句子的一些单词中提取特征来做一些分类任务。我的大部分特征都是名词性的：单词的词性（POS）、向左的单词、向左的POS单词、向右的单词、向右单词、从一个单词到另一个单词的句法关系路径等

当我使用NLTK分类器（决策树、朴素贝叶斯）进行一些实验时，特征集只是一个字典，其中包含了特征的相应值：标称值。例如：[{"postag"："名词"，"wleft"："房子"，"路径"："VPNPNP"，…}，…]。我只需要把这个传给分类器，他们就完成了自己的工作。

这是所用代码的一部分：

def train_classifier(self):
        if self.reader == None:
            raise ValueError("No reader was provided for accessing training instances.")
        # Get the argument candidates
        argcands = self.get_argcands(self.reader)
        # Extract the necessary features from the argument candidates
        training_argcands = []
        for argcand in argcands:
            if argcand["info"]["label"] == "NULL":
                training_argcands.append( (self.extract_features(argcand), "NULL") )
            else:
                training_argcands.append( (self.extract_features(argcand), "ARG") )
        # Train the appropriate supervised model
        self.classifier = DecisionTreeClassifier.train(training_argcands)
        return

以下是提取的特征集之一的示例：

[({'phrase': u'np', 'punct_right': 'NULL', 'phrase_left-sibling': 'NULL', 'subcat': 'fcl=np np vp np pu', 'pred_lemma': u'revelar', 'phrase_right-sibling': u'np', 'partial_path': 'vp fcl', 'first_word-postag': 'Brasxc3xadlia PROP', 'last_word-postag': 'Brasxc3xadlia PROP', 'phrase_parent': u'fcl', 'pred_context_right': u'um', 'pred_form': u'revela', 'punct_left': 'NULL', 'path': 'vpxc2xa1fcl!np', 'position': 0, 'pred_context_left_postag': u'ADV', 'voice': 0, 'pred_context_right_postag': u'ART', 'pred_context_left': u'hoje'}, 'NULL')]

正如我之前提到的，大多数功能都是标称的（字符串值）。

现在，我想试用scikit学习包中的分类器。据我所知，这种类型的特征集对于sklearn中实现的算法来说是不可接受的，因为所有的特征值都必须是数字，并且必须在数组或矩阵中。因此，我使用DictVectorizer类转换了"原始"特征集。然而，当我通过这个变换后的矢量时，我会得到以下错误：

# With DecisionTreeClass
Traceback (most recent call last): 
.....
self.classifier.fit(train_argcands_feats,new_train_argcands_target)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 458, in fit
    X = np.asarray(X, dtype=DTYPE, order='F')
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
    return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number

# With GaussianNB
Traceback (most recent call last):
....
self.classifier.fit(train_argcands_feats,new_train_argcands_target)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 156, in fit
    n_samples, n_features = X.shape
ValueError: need more than 0 values to unpack

当我只使用DictVectorizer（）时，就会出现这些错误。然而，如果我使用DictVectorizer（稀疏=False），我甚至在代码进入训练部分之前就得到了错误：

Traceback (most recent call last):
train_argcands_feats = self.feat_vectorizer.fit_transform(train_argcands_feats)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 123, in fit_transform
    return self.transform(X)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 212, in transform
    Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
ValueError: array is too big.

由于这个错误，显然必须使用稀疏表示。

所以问题是：我如何转换我的名义特征，以便使用scikit learn提供的分类算法？

提前感谢你能给我的所有帮助。

更新

正如下面的答案所建议的那样，我尝试使用NLTK包装器进行scikit学习。我刚刚更改了创建分类器的代码行：

self.classifier = SklearnClassifier(DecisionTreeClassifier())

然后，当我调用"train"方法时，我得到以下结果：

File "/usr/local/lib/python2.7/dist-packages/nltk/classify/scikitlearn.py", line 100, in train
    X = self._convert(featuresets)
  File "/usr/local/lib/python2.7/dist-packages/nltk/classify/scikitlearn.py", line 109, in _convert
    return self._featuresets_to_coo(featuresets)
  File "/usr/local/lib/python2.7/dist-packages/nltk/classify/scikitlearn.py", line 126, in _featuresets_to_coo
    values.append(self._dtype(v))
ValueError: could not convert string to float: np

因此，显然，包装器无法创建稀疏矩阵，因为这些特性是标称的。然后，我回到原来的问题。

ValueError: array is too big.非常明确：不能在内存中分配（n_samples，n_features）的密集数组数据结构。在一个连续的内存块中存储那么多零是无用的（在您的情况下也是不可能的）。请使用DictVectorizer文档中的稀疏数据结构。

此外，如果您更喜欢NLTK API，您可以使用其scikit-learn集成，而不是使用scikit-learn DictVectorizer:

http://nltk.org/_modules/nltk/classify/scikitlearn.html

看一下文件的末尾。

scikit learn的NLTK包装器的问题是，它实际上希望dicts将功能名称映射到数值，所以这不会解决这种情况下的问题。DictVectorizer是scikit学习更复杂的，因为当它遇到字符串特征值时，它会进行"K中的一个"编码；以下是如何使用它：

>>> data = [({'first_word-postag': 'Brasxc3xadlia PROP',
   'last_word-postag': 'Brasxc3xadlia PROP',
   'partial_path': 'vp fcl',
   'path': 'vpxc2xa1fcl!np',
   'phrase': u'np',
   'phrase_left-sibling': 'NULL',
   'phrase_parent': u'fcl',
   'phrase_right-sibling': u'np',
   'position': 0,
   'pred_context_left': u'hoje',
   'pred_context_left_postag': u'ADV',
   'pred_context_right': u'um',
   'pred_context_right_postag': u'ART',
   'pred_form': u'revela',
   'pred_lemma': u'revelar',
   'punct_left': 'NULL',
   'punct_right': 'NULL',
   'subcat': 'fcl=np np vp np pu',
   'voice': 0},
  'NULL')]

将此列表分成两个列表，一个包含样本，另一个包含相应的标签：

>>> samples, labels = zip(*data)

将样本传递给DictVectorizer.fit（您也可以选择在单独的参数中传递标签，但它们将被忽略）：

>>> v = DictVectorizer()
>>> X = v.fit_transform(samples)
>>> X
<1x19 sparse matrix of type '<type 'numpy.float64'>'
    with 19 stored elements in COOrdinate format>

然后，您应该能够将X传递给接受稀疏输入的scikit学习分类器。正如@ogrisel已经指出的那样，GaussianNB不会那样做。对于NLP任务，您将希望使用MultinomialNB或BernoulliNB，因为它们是专门为离散数据设计的。

相关内容

最新更新

热门标签：