如何在pandas数据框架中搜索col,并在DF col 中存在单词时生成列表



我有一个数据框架,句子作为一个col,我想做的是创建一个函数,将搜索所有的句子(每一行的句子col)在这个列表中的单词:search_words = ['cat', 'dog', 'pet']

然后它将生成一个包含每个单词的句子的新列表。如。包含cat[]的句子列表,不包含cat[]的句子列表,以此类推,查找search_words列表中的其他单词。

使用str.extractall查找与search_words匹配的行:

# create a regex of words to search
pat = fr"b({'|'.join(search_words)})b"
out = df.join(df['sentences'].str.extractall(pat)
.droplevel(1).squeeze()
.rename('words'))
此时,您的输出如下所示:
>>> out
sentences words
0               my cat   cat
1               my dog   dog
2               my pet   pet
3  your cat and my dog   cat
3  your cat and my dog   dog
4  your dog and my pet   dog
4  your dog and my pet   pet
5  your pet and my cat   pet
5  your pet and my cat   cat
>>> pat
'\b(cat|dog|pet)\b'

现在在两列之间使用pd.crosstab:

out = pd.crosstab(out['sentences'], out['words']).astype(bool)

输出:

>>> out
words                  cat    dog    pet
sentences
my cat                True  False  False
my dog               False   True  False
my pet               False  False   True
your cat and my dog   True   True  False
your dog and my pet  False   True   True
your pet and my cat   True  False   True

现在你可以创建任何列表:

# match 'cat'
>>> out.loc[out['cat']].index.tolist()
# no match 'dog'
>>> out.loc[~out['dog']].index.tolist()
['my cat', 'my pet', 'your pet and my cat']
import pandas as pd
df = pd.DataFrame({'sentences': ['I have a cat', 'I have a dog', 'I have a pet', 'I have a parrot']})
search_words = ['cat', 'dog', 'pet']
def search_sentences(df, search_words):
for word in search_words:
df[word] = df['sentences'].str.contains(word)
return df
search_sentences(df, search_words)

相关内容

最新更新