同事,
也许你可以帮我完成一项看似简单的任务,但我还没有足够的经验来解决这个问题。
假设我们有两个数据帧:
- df1包含子字符串
- df2包含较长的文本块,其中一些包含df1中的子字符串
df1 = {'subst': ['LONDON BRIDGE', 'TRUE GRIT', 'FIVE TIMES FIVE', 'THREE TIME DEAD', 'TRUE IS NOT', 'OH NO', 'LEBRON JAMES']}
df2 = {'strng': ['LEBRON JAMES SCORED 20', 'THREE TIMES DEAD JOHNY WAS HELL OF THE COOK', 'TRUE IS NOT WHAT YOU THINK', 'FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED']}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
这是我需要的:
- 我需要遍历行,以检查df1['subst']中的子字符串是否存在于df2['srng']的任何位置
- 如果它存在于df2中,我希望df2中的新列['match_df1']包含来自df1的子字符串值
df2中的最终输出看起来像这个
strng | match_df1 |
---|---|
勒布朗·詹姆斯得分20约翰三次去世 | |
真的不是你想的 | 真的不是 |
五倍五不是勒布朗的得分 | 五倍五 |
正如@Chris所注意到的,这个答案可能会起作用
然后只过滤空字符串,如下所示:
>>> for ind1 in df1.index:
... df1.loc[ind1, 'strng'] = ', '.join(list(df2[df2['strng'].str.contains(df1['subst'][ind1])]['strng']))
>>> df1[df1['strng'].str.len() > 0]
subst strng
2 FIVE TIMES FIVE FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED
4 TRUE IS NOT TRUE IS NOT WHAT YOU THINK
6 LEBRON JAMES LEBRON JAMES SCORED 20
所有代码:
import pandas as pd
df1 = {'subst': ['LONDON BRIDGE', 'TRUE GRIT', 'FIVE TIMES FIVE', 'THREE TIME DEAD', 'TRUE IS NOT', 'OH NO', 'LEBRON JAMES']}
df2 = {'strng': ['LEBRON JAMES SCORED 20', 'THREE TIMES DEAD JOHNY WAS HELL OF THE COOK', 'TRUE IS NOT WHAT YOU THINK', 'FIVE TIMES FIVE IS NOT WHAT LEBRON SCORED']}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
for ind1 in df1.index:
df1.loc[ind1, 'strng'] = ', '.join(list(df2[df2['strng'].str.contains(df1['subst'][ind1])]['strng']))
df1[df1['strng'].str.len() > 0]