下面是一个代码片段,显示了scikit-learn中基于TF-IDF的评分测试文档。
如何获得x_test_tfidf中每行的前5个词汇元素及其分数?
我知道count_vect.get_feature_names
可以获得与每列相对应的单词,但我不知道如何1(获得每行前5个最大的列(类似于这样?(,以及2(将功能名称映射到这些列(可能通过设置索引?(。
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
df = pd.DataFrame({'text':[
'this is sentence one, about one thing',
'this is sentence two, about another thing',
'this is sentence three, about a third thing',
'this is sentence four, about a fourth thing']})
train, test = train_test_split(df, test_size=0.5, random_state=42)
# Transform words (unigrams and bigrams) via tfidf
# See https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
# See https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
count_vect = CountVectorizer(ngram_range=(1, 2))
tfidf_transformer = TfidfTransformer()
x_train_counts = count_vect.fit_transform(train['text'])
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)
# Get the test matrix using the trained tf-idf numbers
x_test_counts = count_vect.transform(test['text'])
x_test_tfidf = tfidf_transformer.transform(x_test_counts)
# Produce tfidf scores for query_text
query_text = 'what about another thing'
query_text_df = pd.DataFrame({'text': [query_text]})
query_text_counts = count_vect.transform(query_text_df['text'])
query_text_tfidf = tfidf_transformer.transform(query_text_counts)
# Produce scores that match test set with query_text
scores = x_test_tfidf * query_text_tfidf.T
print(scores)
期望的结果是:
[[('about', 0.6), ('another', 0.6), ('thing', 0.4)],
[('about', 0.6), ('thing', 0.4)]]
因为这两个测试行具有与querytext匹配的单词。
EDIT:以下是部分答案,但没有"前5名"功能,输出看起来很混乱。
也许为了获得不混乱的前5名最终结果,它应该是"长"的形式,即一行是一个单元格。
result = pd.DataFrame(
data=x_test_tfidf.multiply(query_text_tfidf).toarray(),
columns=count_vect.get_feature_names())
with pd.option_context('display.max_rows', None,
'display.max_columns', None):
print(result)
输出:
about about one about third is is sentence one one about
0 0.267261 0.0 0.0 0.0 0.0 0.0 0.0
1 0.316228 0.0 0.0 0.0 0.0 0.0 0.0
one thing sentence sentence one sentence three thing third
0 0.0 0.0 0.0 0.0 0.267261 0.0
1 0.0 0.0 0.0 0.0 0.316228 0.0
third thing this this is three three about
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
编辑2:在这里找到答案的其余部分,并将其写为答案。
这对我有效。
# Produce top words between search text and each test set text
# See also https://stackoverflow.com/a/40434047/34935
tmp = pd.DataFrame(data=x_test_tfidf.multiply(query_text_tfidf).toarray(),
columns=count_vect.get_feature_names())
tmp = tmp.apply(lambda row: sorted(zip(tmp.columns, row),
key=lambda cv: -cv[1]), axis=1)
nlargest = 5
vals = []
for key, val in zip(tmp.index, tmp.values.tolist()):
val_tuples = val[:nlargest]
vals.append('%d|%s' % (key, ', '.join(
[str(tup) for tup in val_tuples])))
test['top_keywords'] = vals