我知道如何使用预定义的单词列表计算pandas列中匹配的单词,并将计数分配到另一列(类似于这里的这篇文章)。但是我想知道是否有一种方法或函数可以将计数分配给以列方式匹配的单词列。
index | text
1 | "I have a pen and ipod, but I lost it today."
2 | "I have pineapple and pen, but I lost it today."
long_list = ['pen', 'pineapple', 'ipod']
index | text | pen | pineapple | ipod |
1 | "I have a pen and ipod, but I lost it today." | 1 | 0 | 1 |
2 | "I have pineapple and pen, but I lost it today." | 1 | 1 | 0 |
下面是一个使用extract
和命名捕获组的简洁解决方案:
regex = '|'.join(map(lambda i: f'(?P<{i}>{i})', long_list))
df.join(df['text'].str.extract(regex).notnull().astype(int))
输出:
index text pen pineapple ipod
1 I have a pen and ipod, but I lost it today. 1 0 0
2 I have pineapple and pen, but I lost it today. 0 1 0
如果单词包含无效字符,也可以使用未命名的捕获组(它们将编号为0/1/2/3),然后重命名列:
long_list = ['pen', 'pineapple', 'ipod', 'cheese cake']
regex = '|'.join(map(lambda x: f'({x})', long_list))
df.join(df['text'].str.extract(regex)
.notnull().astype(int)
.rename(columns=dict(enumerate(long_list)))
)
输出:
index text pen pineapple ipod cheese cake
1 I have a pen ... 1 0 0 0
2 I have pineap... 0 1 0 0
工作原理
extract
将为每个捕获组创建一个列,以组名作为列名,在单元格中使用匹配的字符串,否则使用NaN。然后使用notnull
+astype(int)
将该输出转换为整数关于正则表达式的注释NB。正则表达式的格式为'(?P<pen>pen)|(?P<pineapple>pineapple)|(?P<ipod>ipod)'
为了确保整个单词匹配(即铅笔不应该匹配钢笔),让我们添加单词边界(b
):
regex = '|'.join(map(lambda i: fr'(?P<{i}>b{i}b)', long_list))
给出:'(?P<pen>\bpen\b)|(?P<pineapple>\bpineapple\b)|(?P<ipod>\bipod\b)'
如果使用的单词包含空格(或python变量中无效的字符),则应替换/删除这些字符:
regex = '|'.join(map(lambda i: fr'(?P<{i.replace(" ", "_")}>b{i}b)', long_list))
变量count出现次数
df.join(df['text']
.str.extractall(regex)
.notnull().astype(int)
.groupby(level=0).sum()
)
输出(我将输入修改为两次"pen"(第一行):
index text pen pineapple ipod
1 I have a pen and another pen an ipod, but I lo... 2 0 1
2 I have pineapple and pen, but I lost it today. 1 1 0
尝试使用pd.get_dummies
与str.findall
:
>>> df.join(pd.get_dummies(df['text'].str.findall(f'({"|".join(long_list)})').explode()).groupby(level=0).sum())
index text ipod pen pineapple
0 1 I have a pen and ipod, but I lost it today. 1 1 0
1 2 I have pineapple and pen, but I lost it today. 0 1 1
>>>
不需要for循环。
您可以尝试使用str.contains
for i in long_list:
df.loc[df.text.str.contains(i), i] = 1