Sklearn:具有字典和文本数据的功能联合

我有一个数据框架，例如：

     text_data                worker_dicts                  outcomes
0    "Some string"           {"Sector":"Finance",             0
                              "State: NJ"}                   
1    "Another string"        {"Sector":"Programming",         1
                              "State: NY"}

它既有文本信息，又有一列是字典。（真正的worker_dicts还有更多字段）。我对二进制结果列感兴趣。

我最初尝试做的是将text_data和worker_dict同时组合，粗略地串联这两个列，然后在此上运行多项式NB：

    df['stacked_features']=df['text_data'].astype(str)+'_'+df['worker_dicts']
    stacked_features = np.array(df['stacked_features'])
    outcomes = np.array(df['outcomes'])
    text_clf = Pipeline([('vect', TfidfVectorizer(stop_words='english'), ngram_range = (1,3)), 
   ('clf', MultinomialNB())])
    text_clf = text_clf.fit(stacked_features, outcomes)

但是我的准确性非常差，我认为与在两种类型的功能上拟合一个模型（就像我正在使用堆叠一样）相比，拟合两个独立的模型可以更好地利用数据。

我将如何利用功能联盟？worker_dicts有点奇怪，因为这是一本词典，所以我对如何解析它感到非常困惑。

如果您的字典条目在您的示例中似乎是分类的，那么我将在执行其他处理之前从字典条目中创建不同的列。

new_features = pd.DataFrame(df['worker_dicts'].values.tolist())

然后，new_features将是其自己的数据框架，具有Sector和State的列，您还可以根据需要进行一个热编码，除TFIDF或text_data列的其他功能提取外。为了在管道中使用它，您需要创建一个新的变压器类，因此我可能建议您单独应用字典解析和TFIDF，然后堆叠结果，然后在管道中添加onehotencoding，以便您指定列将变压器应用于。（由于您要编码的类别是字符串，因此您可能需要使用labelbinarizer类而不是编码转换的onehotencoder类。）

如果您只想使用管道单独使用TFIDF，则需要使用嵌套管道和功能固定设置来提取此处所述的列。

，如果您在DataFrames X1和X2中具有一个热门编码功能，如下所述，您的文本功能在X3中，则可以执行以下操作来创建管道。（还有许多其他选择，这只是一种方法）

X = pd.concat([X1, X2, X3], axis=1)
def select_text_data(X):
    return X['text_data']
def select_remaining_data(X):
    return X.drop('text_data', axis=1)

# pipeline to get all tfidf and word count for first column
text_pipeline = Pipeline([
    ('column_selection', FunctionTransformer(select_text_data, validate=False)),
    ('tfidf', TfidfVectorizer())
])

final_pipeline = Pipeline([('feature-union', FeatureUnion([('text-features', text_pipeline), 
                               ('other-features', FunctionTransformer(select_remaining_data))
                              ])),
                          ('clf', LogisticRegression())
                          ])

（MultinomialnB在管道中无法使用，因为它没有fit和fit_transform方法）

相关内容

最新更新

热门标签：