我的问题与此有些相似:如何在字符串包含上合并熊猫?,但我需要不同的输出,问题本身有点复杂。所以我有 2 个类似于下面的数据帧:
df1 = pd.DataFrame({'ref_name':['city-louisville','city-louisville','city-louisville', 'town-lexington','town-lexington','town-lexington'], 'un_name1':['CPU1','CPU2','GPU1','CPU1','CPU2','GPU1'], 'value1':[10,15,28,12,14,14]})
df2 = pd.DataFrame({'ref_name':['louisville','louisville','lexington','lexington'], 'un_name2':['CPU','GPU','CPU','GPU'], 'value2':[25,28,26,14]})
我需要根据ref_name
加入,并根据其中的子字符串un_name
。它们不会总是像这样干净,但我认为这是一个不错的小例子。因此,在这种情况下,我所需的输出如下所示:
ref_name | un_name1 | un_name2 | value1 | value2
---------------------------------------------------------
louisville| CPU1 | CPU | 10 | 25
louisville| CPU2 | CPU | 15 | 25
louisville| GPU1 | GPU | 28 | 28
lexington | CPU1 | CPU | 12 | 26
lexington | CPU2 | CPU | 14 | 26
lexington | GPU1 | GPU | 14 | 14
提前感谢您对此的任何帮助!
这是我能想到的最通用的版本。如果数据帧很大,则性能可能会有问题。
mask1 = df2['ref_name'].apply(lambda value: df1['ref_name'].str.contains(value))
mask2 = df2['un_name2'].apply(lambda value: df1['un_name1'].str.contains(value))
mask = (mask1 & mask2).stack().rename_axis(['index2', 'index1'])
mask = mask[mask].index.to_frame(False)
result = mask.merge(df2, left_on='index2', right_index=True)
.merge(df1, left_on='index1', right_index=True)
结果:
index2 index1 ref_name_x un_name2 value2 ref_name_y un_name1 value1
0 0 louisville CPU 25 city-louisville CPU1 10
0 1 louisville CPU 25 city-louisville CPU2 15
1 2 louisville GPU 28 city-louisville GPU1 28
2 3 lexington CPU 26 town-lexington CPU1 12
2 4 lexington CPU 26 town-lexington CPU2 14
3 5 lexington GPU 14 town-lexington GPU1 14
修剪/重命名列是留给 OP 的练习。