我有一个列表,L:
L=['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
我有一个熊猫数据帧,DF:
文本 |
---|
对象在人之前和之后 |
对象在人的后面 |
右边的对象紧挨着人的左上侧 |
尝试:
df["Extracted_Value"] = (
df.Text.apply(
lambda x: "|".join(w if w in L else "" for w in x.split()).strip("|")
)
.replace(r"|{2,}", "_", regex=True)
.str.replace("|", " ", regex=False)
)
print(df)
打印:
Text Extracted_Value
0 the objects are both before and after the person before_after
1 the object is behind the person behind
2 the object in right is next to top left hand side of person right_top left hand side
编辑:改编@Wiktor对熊猫的回答:
pattern = fr"b((?:{'|'.join(L)})(?:s+(?:{'|'.join(L)}))*)b"
df["Extracted_Value"] = (
df["Text"].str.extractall(pattern).groupby(level=0).agg("_".join)
)
print(df)
您需要使用
pattern = fr"b(?:{'|'.join(L)})(?:s+(?:{'|'.join(L)}))*b"
正则表达式看起来像
b(?:top|left|behind|before|right|after|hand|side)(?:s+(?:top|left|behind|before|right|after|hand|side))*b
请参阅regex演示。
它将匹配
b
-一个词的边界(?:{'|'.join(L)})
——L
中的一个词(?:s+(?:{'|'.join(L)}))*
-一个或多个空白的零次或多次重复,然后是L
列表中的一个单词b
——一个单词边界
Python演示:
import pandas as pd
L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
df = pd.DataFrame({'Text':["the objects are both before and after the person","the object is behind the person", "the object in right is next to top left hand side of person"]})
pattern = fr"b(?:{'|'.join(L)})(?:s+(?:{'|'.join(L)}))*b"
输出:
>>> df['Text'].str.findall(pattern).str.join("_").replace({"": None})
0 before_after
1 behind
2 right_top left hand side
Name: Text, dtype: object
这对我来说很有效,它只是将列表中的每个项目与每行短语中的每个项进行比较。
L = ['top', 'left', 'behind', 'before', 'right', 'after', 'hand', 'side']
df = pd.DataFrame(
['the objects are both before and after the person',
'the object is behind the person',
'the object in right is next to top left hand side of person'], columns=['Text'])
df['Extracted_Value'] = df['Text'].str.split().apply(lambda x: '_'.join([m for m in x if m in L])).replace('',np.nan)
我的输出是
Text Extracted_Value
0 the objects are both before and after the person before_after
1 the object is behind the person behind
2 the object in right is next to top left hand s... right_top_left_hand_side