Python数据帧:DF列中的字符串包含来自不同DF的子字符串和匹配时返回的子字符串值



同事,

也许你可以帮我完成一项看似简单的任务,但我还没有足够的经验来解决这个问题。

假设我们有两个数据帧:

  1. df1包含子字符串
  2. df2包含较长的文本块,其中一些包含df1中的子字符串
df1 = {'subst': ['LONDON BRIDGE', 'TRUE GRIT', 'FIVE TIMES FIVE', 'THREE TIME DEAD', 'TRUE IS NOT', 'OH NO', 'LEBRON JAMES']}
df2 = {'strng': ['LEBRON JAMES SCORED 20', 'THREE TIMES DEAD JOHNY WAS HELL OF THE COOK', 'TRUE IS NOT WHAT YOU THINK', 'FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED']}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)

这是我需要的:

  1. 我需要遍历行,以检查df1['subst']中的子字符串是否存在于df2['srng']的任何位置
  2. 如果它存在于df2中,我希望df2中的新列['match_df1']包含来自df1的子字符串值

df2中的最终输出看起来像这个

strngmatch_df1
勒布朗·詹姆斯得分20约翰三次去世
真的不是你想的真的不是
五倍五不是勒布朗的得分五倍五

正如@Chris所注意到的,这个答案可能会起作用
然后只过滤空字符串,如下所示:

>>> for ind1 in df1.index:
...    df1.loc[ind1, 'strng'] = ', '.join(list(df2[df2['strng'].str.contains(df1['subst'][ind1])]['strng']))
>>> df1[df1['strng'].str.len() > 0]
subst                strng
2   FIVE TIMES FIVE      FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED
4   TRUE IS NOT          TRUE IS NOT WHAT YOU THINK
6   LEBRON JAMES         LEBRON JAMES SCORED 20

所有代码:

import pandas as pd
df1 = {'subst': ['LONDON BRIDGE', 'TRUE GRIT', 'FIVE TIMES FIVE', 'THREE TIME DEAD', 'TRUE IS NOT', 'OH NO', 'LEBRON JAMES']}
df2 = {'strng': ['LEBRON JAMES SCORED 20', 'THREE TIMES DEAD JOHNY WAS HELL OF THE COOK', 'TRUE IS NOT WHAT YOU THINK', 'FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED']}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
for ind1 in df1.index:
df1.loc[ind1, 'strng'] = ', '.join(list(df2[df2['strng'].str.contains(df1['subst'][ind1])]['strng']))
df1[df1['strng'].str.len() > 0]

最新更新