如何将不同的输入放入sklearn管道中



我正在使用sklearn中的Pipeline对文本进行分类。

在这个例子中,Pipeline我有一个TfIDF矢量器和一些用FeatureUnion和分类器包装的自定义特性作为Pipeline步骤,然后我拟合训练数据并进行预测:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
# load custom features and FeatureUnion with Vectorizer
features = []
measure_features = MeasureFeatures() # this class includes my custom features
features.append(('measure_features', measure_features))
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
features.append(('ngram', countVecWord))
all_features = FeatureUnion(features)
# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
pipeline = Pipeline(
    [('all', all_features ),
    ('clf', LinearSVC1),
    ])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
# etc.

上面的代码工作得很好,但有一个转折点。我想对文本进行词性标记,并对标记文本使用不同的矢量器。

X = ['I am a sentence', 'an example']
X_tagged = do_tagging(X) 
# X_tagged = ['PP AUX DET NN', 'DET NN']
Y = [1, 2]
X_dev = ['another sentence']
X_dev_tagged = do_tagging(X_dev)
# load custom featues and FeatureUnion with Vectorizer
features = []
measure_features = MeasureFeatures() # this class includes my custom features
features.append(('measure_features', measure_features))
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
# new POS Vectorizer
countVecPOS = TfidfVectorizer(ngram_range=(1, 4), max_features= 2000)
features.append(('ngram', countVecWord))
features.append(('pos_ngram', countVecWord))
all_features = FeatureUnion(features)
# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
pipeline = Pipeline(
    [('all', all_features ),
    ('clf', LinearSVC1),
    ])
# how do I fit both X and X_tagged here
# how can the different vectorizers get either X or X_tagged?
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
# etc.

如何正确匹配此类数据?两个矢量器如何区分原始文本和pos文本?我有什么选择?

我也有自定义功能,其中一些会采用原始文本,另一些则采用POS文本。

编辑:添加MeasureFeatures()

from sklearn.base import BaseEstimator
import numpy as np
class MeasureFeatures(BaseEstimator):
    def __init__(self):
        pass
    def get_feature_names(self):
        return np.array(['type_token', 'count_nouns'])
    def fit(self, documents, y=None):
        return self
    def transform(self, x_dataset):

        X_type_token = list()
        X_count_nouns = list()
        for sentence in x_dataset:
            # takes raw text and calculates type token ratio
            X_type_token.append(type_token_ratio(sentence))
            # takes pos tag text and counts number of noun pos tags (NN, NNS etc.)
            X_count_nouns.append(count_nouns(sentence))
        X = np.array([X_type_token, X_count_nouns]).T
        print X
        print X.shape
        if not hasattr(self, 'scalar'):
            self.scalar = StandardScaler().fit(X)
        return self.scalar.transform(X)

然后,这个功能转换器需要为count_nouns()函数获取标记文本,或者为type_token_ratio()获取原始文本

我认为您必须对2个变压器(TfidfTransformer和PostTransformer)执行FeatureUnion。当然,您需要定义Posttransformer
也许这篇文章会对你有所帮助。

也许你的管道会是这样的。

pipeline = Pipeline([
  ('features', FeatureUnion([
    ('ngram_tf_idf', Pipeline([
      ('counts_ngram', CountVectorizer()),
      ('tf_idf_ngram', TfidfTransformer())
    ])),
    ('pos_tf_idf', Pipeline([
      ('pos', POSTransformer()),          
      ('counts_pos', CountVectorizer()),
      ('tf_idf_pos', TfidfTransformer())
    ])),
    ('measure_features', MeasureFeatures())
  ])),
  ('classifier', LinearSVC())
])

这假设MeasureFeaturesPosttransformer是符合sklearn API的变压器。

相关内容

  • 没有找到相关文章

最新更新