在字符串修改中包含单词边界以更具体



Background

以下是对跳过空列表并继续函数的修改的微小变化

import pandas as pd
Names =    [list(['ann']),
list([]),
list(['elisabeth', 'lis']),
list(['his','he']),
list([])]
df = pd.DataFrame({'Text' : ['ann had an anniversery today', 
'nothing here', 
'I like elisabeth and lis 5 lists ',
'one day he and his cheated',
'same here'
], 
'P_ID': [1,2,3, 4,5], 
'P_Name' : Names
})
#rearrange columns
df = df[['Text', 'P_ID', 'P_Name']]
df
Text                P_ID  P_Name
0   ann had an anniversery today        1   [ann]
1   nothing here                        2   []
2   I like elisabeth and lis 5 lists    3   [elisabeth, lis]
3   one day he and his cheated          4   [his, he]
4   same here                           5   []

下面的代码有效

m = df['P_Name'].str.len().ne(0)
df.loc[m, 'New'] = df.loc[m, 'Text'].replace(df.loc[m].P_Name,'**BLOCK**',regex=True) 

并执行以下操作

1)使用P_Name中的名称,通过放置**BLOCK**来阻止Text列中的相应文本

2)产生新的色谱柱New

如下所示

Text  P_ID P_Name  New
0                     **BLOCK** had an **BLOCK**iversery today
1                     NaN
2                     I like **BLOCK** and **BLOCK** 5 **BLOCK**ts
3                     one day **BLOCK** and **BLOCK** c**BLOCK**ated
4                     NaN

问题

但是,此代码工作得有点"太好"。

使用P_Name['his','he']阻止Text

示例:one day he and his cheated变为one day **BLOCK** and **BLOCK** c**BLOCK**ated

期望:one day he and his cheated变得one day **BLOCK** and **BLOCK** cheated

在这个例子中,我希望cheated保持cheated而不是成为c**BLOCK**ated

期望的输出

Text P_ID P_Name  New
0                     **BLOCK** had an anniversery today
1                     NaN
2                     I like **BLOCK** and **BLOCK**5 lists
3                     one day **BLOCK** and **BLOCK** cheated
4                     NaN

问题

如何实现我想要的输出?

您需要为df.loc[m].P_Name列表中的每个字符串添加单词边界,如下所示:

s = df.loc[m].P_Name.map(lambda x: [r'b'+item+r'b' for item in x])
Out[71]:
0                   [bannb]
2    [belisabethb, blisb]
3           [bhisb, bheb]
Name: P_Name, dtype: object
df.loc[m, 'Text'].replace(s, '**BLOCK**',regex=True)
Out[72]:
0       **BLOCK** had an anniversery today
2    I like **BLOCK** and **BLOCK** 5 lists
3    one day **BLOCK** and **BLOCK** cheated
Name: Text, dtype: object

有时 for 循环是很好的做法

df['New']=[pd.Series(x).replace(dict.fromkeys(y,'**BLOCK**') ).str.cat(sep=' ')for x , y in zip(df.Text.str.split(),df.P_Name)]
df.New.where(df.P_Name.astype(bool),inplace=True)
df
Text  ...                                  New
0       ann had an anniversery today  ...     **BLOCK** had an anniversery today
1                       nothing here  ...                                  NaN
2  I like elisabeth and lis 5 lists   ...   I like **BLOCK** and **BLOCK** 5 lists
3         one day he and his cheated  ...  one day **BLOCK** and **BLOCK** cheated
4                          same here  ...                                  NaN
[5 rows x 4 columns]

最新更新