在python(pandas)中完成搜索引擎的最后一步

我有一个字典，它基本上存储了一个大数据帧(很多行和12列(中的所有单词，字典看起来像这样：

vocabulary = {'hello':[3,1998,876,3888], 'beautiful':[677, 4, 56],......}

其中，值是单词所在的dataFrame的行。

我想做的是，以字符串(查询(作为输入，

query = 'a beautiful house with big windows'

只返回包含输入句子中所有单词的行的Dataframe的某些列(我们称之为A、B、C、D(。我已经为词汇表和输入查询预处理了数据(词干、停止语、删除标点符号…(。有人能帮我吗？谢谢

如果我理解正确，您需要检查query句子中的每个单词，找到这些单词出现在哪行(来自vocabularydict(，并返回查询中所有单词的公共行。如果是这样的话，这是一个解决方案(我已经简化了你的例子(：

vocabulary = {'hello':[3,1998,876,3888], 'beautiful':[677, 4, 56, 3, 876]}
query = 'hello beautiful'
words = set(query.split())
rows = [vocabulary[w] for w in words]
common_rows = rows[0]
for r in rows[1:]:
common_rows = list(set(common_rows) & set(r))
print(common_rows)

[3876]

要从DataFrame中选择行，您只需要执行以下操作：

df.loc[common_rows, ["A", "B", "C", "D"]]

相关内容

最新更新

热门标签：