我有一个包含两列 Stg 和 Txt 的数据框。任务是检查每个 Txt 行的 Stg 列中的所有单词,并将匹配的单词输出到新列中,同时保持单词大小写与 Txt 中的单词大小写相同。
Example Code:
from pandas import DataFrame
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = DataFrame(new,columns= ['Stg','Txt'])
my_list = df["Stg"].tolist()
import re
def words_in_string(word_list, a_string):
word_set = set(word_list)
pattern = r'b({0})b'.format('|'.join(word_list))
for found_word in re.finditer(pattern, a_string):
word = found_word.group(0)
if word in word_set:
word_set.discard(word)
yield word
if not word_set:
raise StopIteration
df['new'] = ''
for i,values in enumerate(df['Txt']):
a=[]
b = []
for word in words_in_string(my_list, values):
a=word
b.append(a)
df['new'][i] = b
exit
上面的代码从 Stg 列返回大小写。有没有办法从 Txt 获得案例。我还想检查整个字符串,而不是像文本"双向"那样的子字符串,当前代码返回单词 way。
Current Output:
Stg Txt new
0 way An early term []
1 Early two-way allowed [way, allowed]
2 phone New Phone feature that allowed [allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
Expected Output:
Stg Txt new
0 way An early term [early]
1 Early two-way allowed [allowed]
2 phone New Phone feature that allowed [Phone, allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]
你应该使用带有负面回溯的Series.str.findall
:
import pandas as pd
import re
new = {'Stg': ['way','Early','phone','allowed','type','brand name'],
'Txt': ['An early term','two-way allowed','New Phone feature that allowed','amazing universe','new day','the brand name is stage']
}
df = pd.DataFrame(new,columns= ['Stg','Txt'])
pattern = "|".join(f"w*(?<![A-Za-z-;:,/|]){i}\b" for i in new["Stg"])
df["new"] = df["Txt"].str.findall(pattern, flags=re.IGNORECASE)
print (df)
#
Stg Txt new
0 way An early term [early]
1 Early two-way allowed [allowed]
2 phone New Phone feature that allowed [Phone, allowed]
3 allowed amazing universe []
4 type new day []
5 brand name the brand name is stage [brand name]