我正在尝试将字符串与两个DataFrame列进行比较。category_df['column_text_to_find']包含要在其他数据帧df2['columm_text_to_search']中匹配的字符串。新列df2[匹配的文本']应返回在df2[列_text_to_search']中找到的df[列_txt_to_find']。我的预期结果是
['column_text_to_search'] ['column_text_to_find'] ['matched text']
'SP * GRAPHICSDIRECT.ascdadv' 'GRAPHICSDIRECT' 'GRAPHICSDIRECT'
'99 CENTS ONLY #777#' '99 CENTS ONLY' '99 CENTS ONLY'
'PAYPAL *BESTBUY COM #3422#' 'BESTBUY' 'BESTBUY'
不幸的是,我的代码返回了一个错误。
csv导入:
for f in all_files:
df = pd.read_csv(f, sep=',',header=[3])
df2 = df
删除空格:
df2['column_text_to_search']=df2['column_text_to_search'].str.strip()
搜索和匹配文本:
ch = category_df['column_text_to_find']
pat = r'b({0})b'.format('|'.join(ch))
df2['matched text'] = df2['column_text_to_search'].str.findall(pat, flags =
re.IGNORECASE).map("_".join)
df2.head()
错误:
TypeError: sequence item 0: expected str instance, tuple found
您可以使用
pattern = r'(?i)b({0})b'.format("|".join(df["column_text_to_find"].to_list()))
df["column_text_to_search"].str.findall(pattern).str.join('_')
或者,如果您的";单词";要查找的可以在字符串中的任何位置包含特殊字符:
pattern = r'(?i)(?!Bw)({0})(?<!wB)'.format("|".join( sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True) ))
df["column_text_to_search"].str.findall(pattern).str.join('_')
注意的使用
(?i)
-它支持不区分大小写的搜索b...b
-单词边界允许对自然语言单词进行全单词搜索(如果"wors"可以在任意位置包含特殊字符,则不能依赖单词边界((?!Bw)
/(?<!wB)
-动态自适应单词边界,如果要查找的单词中的相邻字符是单词字符,则只需要单词边界"|".join(df["column_text_to_find"].to_list())
-在column_text_to_find列内形成基于交替的值模式sorted(map(re.escape, df["column_text_to_find"].to_list()), key=len, reverse=True)
-按长度降序对要查找的单词进行排序,并对其进行转义以在regex中使用.findall(pattern)
-查找所有出现的模式和.str.join('_')
-将它们与_
连接起来
如果我理解你的问题陈述,这是解决你的问题的代码
df2[match_text]=''
for j in range(len(df2)):
for i in range(len(category_df)):
if df2.column_text_to_search[j] in category_df.column_text_to_find[i]:
df2.match_text[j]=df2.column_text_to_search[j]
break