如何在 Python 中的模式匹配时从文本中获取单词大小写



我有一个包含两列 Stg 和 Txt 的数据框。任务是检查每个 Txt 行的 Stg 列中的所有单词,并将匹配的单词输出到新列中,同时保持单词大小写与 Txt 中的单词大小写相同。

Example Code:
from pandas import DataFrame
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = DataFrame(new,columns= ['Stg','Txt'])
my_list = df["Stg"].tolist()
import re
def words_in_string(word_list, a_string):
word_set = set(word_list)
pattern = r'b({0})b'.format('|'.join(word_list))
for found_word in re.finditer(pattern, a_string):
word = found_word.group(0)
if word in word_set:
word_set.discard(word)
yield word
if not word_set:
raise StopIteration 
df['new'] = ''
for i,values in enumerate(df['Txt']):
a=[]
b = []
for word in words_in_string(my_list, values):
a=word
b.append(a)
df['new'][i] = b
exit

上面的代码从 Stg 列返回大小写。有没有办法从 Txt 获得案例。我还想检查整个字符串,而不是像文本"双向"那样的子字符串,当前代码返回单词 way。

Current Output:
Stg            Txt                                   new
0   way           An early term                           []
1   Early         two-way allowed                         [way, allowed]
2   phone         New Phone feature that allowed          [allowed]
3   allowed       amazing universe                        []
4   type          new day                                 []
5   brand name    the brand name is stage                 [brand name]

Expected Output:
Stg            Txt                                   new
0   way           An early term                           [early]
1   Early         two-way allowed                         [allowed]
2   phone         New Phone feature that allowed          [Phone, allowed]
3   allowed       amazing universe                        []
4   type          new day                                 []
5   brand name    the brand name is stage                 [brand name]

你应该使用带有负面回溯的Series.str.findall

import pandas as pd
import re
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = pd.DataFrame(new,columns= ['Stg','Txt'])
pattern = "|".join(f"w*(?<![A-Za-z-;:,/|]){i}\b" for i in new["Stg"])
df["new"] = df["Txt"].str.findall(pattern, flags=re.IGNORECASE)
print (df)
#
Stg                             Txt               new
0         way                   An early term           [early]
1       Early                 two-way allowed         [allowed]
2       phone  New Phone feature that allowed  [Phone, allowed]
3     allowed                amazing universe                []
4        type                         new day                []
5  brand name         the brand name is stage      [brand name]

相关内容

  • 没有找到相关文章

最新更新