根据列表中的多个单词从pandas数据帧中提取所有短语

我有一个列表，L:

L=['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']

我有一个熊猫数据帧，DF:

文本
对象在人之前和之后
对象在人的后面
右边的对象紧挨着人的左上侧

尝试：

df["Extracted_Value"] = (
df.Text.apply(
lambda x: "|".join(w if w in L else "" for w in x.split()).strip("|")
)
.replace(r"|{2,}", "_", regex=True)
.str.replace("|", " ", regex=False)
)
print(df)

打印：

Text           Extracted_Value
0             the objects are both before and after the person              before_after
1                              the object is behind the person                    behind
2  the object in right is next to top left hand side of person  right_top left hand side

编辑：改编@Wiktor对熊猫的回答：

pattern = fr"b((?:{'|'.join(L)})(?:s+(?:{'|'.join(L)}))*)b"
df["Extracted_Value"] = (
df["Text"].str.extractall(pattern).groupby(level=0).agg("_".join)
)
print(df)

您需要使用

pattern = fr"b(?:{'|'.join(L)})(?:s+(?:{'|'.join(L)}))*b"

正则表达式看起来像

b(?:top|left|behind|before|right|after|hand|side)(?:s+(?:top|left|behind|before|right|after|hand|side))*b

请参阅regex演示。

它将匹配

b-一个词的边界
(?:{'|'.join(L)})——L中的一个词
(?:s+(?:{'|'.join(L)}))*-一个或多个空白的零次或多次重复，然后是L列表中的一个单词
b——一个单词边界

Python演示：

import pandas as pd
L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
df = pd.DataFrame({'Text':["the objects are both before and after the person","the object is behind the person", "the object in right is next to top left hand side of person"]})
pattern = fr"b(?:{'|'.join(L)})(?:s+(?:{'|'.join(L)}))*b"

输出：

>>> df['Text'].str.findall(pattern).str.join("_").replace({"": None})
0                before_after
1                      behind
2    right_top left hand side
Name: Text, dtype: object

这对我来说很有效，它只是将列表中的每个项目与每行短语中的每个项进行比较。

L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
df = pd.DataFrame(
['the objects are both before and after the person',
'the object is behind the person',
'the object in right is next to top left hand side of person'], columns=['Text'])
df['Extracted_Value'] = df['Text'].str.split().apply(lambda x: '_'.join([m for m in x if m in L])).replace('',np.nan)

我的输出是

Text    Extracted_Value
0   the objects are both before and after the person    before_after
1   the object is behind the person                     behind
2   the object in right is next to top left hand s...   right_top_left_hand_side

相关内容

最新更新

热门标签：