我有一个数据框架,句子作为一个col,我想做的是创建一个函数,将搜索所有的句子(每一行的句子col)在这个列表中的单词:search_words = ['cat', 'dog', 'pet']
然后它将生成一个包含每个单词的句子的新列表。如。包含cat[]的句子列表,不包含cat[]的句子列表,以此类推,查找search_words列表中的其他单词。
使用str.extractall
查找与search_words
匹配的行:
# create a regex of words to search
pat = fr"b({'|'.join(search_words)})b"
out = df.join(df['sentences'].str.extractall(pat)
.droplevel(1).squeeze()
.rename('words'))
此时,您的输出如下所示:
>>> out
sentences words
0 my cat cat
1 my dog dog
2 my pet pet
3 your cat and my dog cat
3 your cat and my dog dog
4 your dog and my pet dog
4 your dog and my pet pet
5 your pet and my cat pet
5 your pet and my cat cat
>>> pat
'\b(cat|dog|pet)\b'
现在在两列之间使用pd.crosstab
:
out = pd.crosstab(out['sentences'], out['words']).astype(bool)
输出:
>>> out
words cat dog pet
sentences
my cat True False False
my dog False True False
my pet False False True
your cat and my dog True True False
your dog and my pet False True True
your pet and my cat True False True
现在你可以创建任何列表:
# match 'cat'
>>> out.loc[out['cat']].index.tolist()
# no match 'dog'
>>> out.loc[~out['dog']].index.tolist()
['my cat', 'my pet', 'your pet and my cat']
import pandas as pd
df = pd.DataFrame({'sentences': ['I have a cat', 'I have a dog', 'I have a pet', 'I have a parrot']})
search_words = ['cat', 'dog', 'pet']
def search_sentences(df, search_words):
for word in search_words:
df[word] = df['sentences'].str.contains(word)
return df
search_sentences(df, search_words)