如何在scikit learn中从TfidfTransformer中获取最匹配的功能名称



下面是一个代码片段,显示了scikit-learn中基于TF-IDF的评分测试文档。

如何获得x_test_tfidf中每行的前5个词汇元素及其分数?

我知道count_vect.get_feature_names可以获得与每列相对应的单词,但我不知道如何1(获得每行前5个最大的列(类似于这样?(,以及2(将功能名称映射到这些列(可能通过设置索引?(。

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
df = pd.DataFrame({'text':[
'this is sentence one, about one thing',
'this is sentence two, about another thing',
'this is sentence three, about a third thing',
'this is sentence four, about a fourth thing']})
train, test = train_test_split(df, test_size=0.5, random_state=42)
# Transform words (unigrams and bigrams) via tfidf
# See https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
# See https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
count_vect = CountVectorizer(ngram_range=(1, 2))
tfidf_transformer = TfidfTransformer()
x_train_counts = count_vect.fit_transform(train['text'])
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)
# Get the test matrix using the trained tf-idf numbers
x_test_counts = count_vect.transform(test['text'])
x_test_tfidf = tfidf_transformer.transform(x_test_counts)
# Produce tfidf scores for query_text
query_text = 'what about another thing'
query_text_df = pd.DataFrame({'text': [query_text]})
query_text_counts = count_vect.transform(query_text_df['text'])
query_text_tfidf = tfidf_transformer.transform(query_text_counts)
# Produce scores that match test set with query_text
scores = x_test_tfidf * query_text_tfidf.T
print(scores)

期望的结果是:

[[('about', 0.6), ('another', 0.6), ('thing', 0.4)],
[('about', 0.6), ('thing', 0.4)]]

因为这两个测试行具有与querytext匹配的单词。

EDIT:以下是部分答案,但没有"前5名"功能,输出看起来很混乱。

也许为了获得不混乱的前5名最终结果,它应该是"长"的形式,即一行是一个单元格。

result = pd.DataFrame(
data=x_test_tfidf.multiply(query_text_tfidf).toarray(),
columns=count_vect.get_feature_names())
with pd.option_context('display.max_rows', None,
'display.max_columns', None):
print(result)

输出:

about  about one  about third   is  is sentence  one  one about  
0  0.267261        0.0          0.0  0.0          0.0  0.0        0.0   
1  0.316228        0.0          0.0  0.0          0.0  0.0        0.0   
one thing  sentence  sentence one  sentence three     thing  third  
0        0.0       0.0           0.0             0.0  0.267261    0.0   
1        0.0       0.0           0.0             0.0  0.316228    0.0   
third thing  this  this is  three  three about  
0          0.0   0.0      0.0    0.0          0.0  
1          0.0   0.0      0.0    0.0          0.0  

编辑2:在这里找到答案的其余部分,并将其写为答案。

这对我有效。

# Produce top words between search text and each test set text
# See also https://stackoverflow.com/a/40434047/34935
tmp = pd.DataFrame(data=x_test_tfidf.multiply(query_text_tfidf).toarray(),
columns=count_vect.get_feature_names())
tmp = tmp.apply(lambda row: sorted(zip(tmp.columns, row),
key=lambda cv: -cv[1]), axis=1)
nlargest = 5
vals = []
for key, val in zip(tmp.index, tmp.values.tolist()):
val_tuples = val[:nlargest]
vals.append('%d|%s' % (key, ', '.join(
[str(tup) for tup in val_tuples])))
test['top_keywords'] = vals

最新更新