用匹配的列值替换与另一个数据框中列表元素匹配的数据框中的列表元素



我有一个pandas数据框架df1。我有另一个pandas时间框架df2,其中包含fruits列,我想用df1name列的值替换df1duplicates列中列表中的元素。

df1


name          duplicates                         

0 a.apple      ['b.apple', 'c.apple']              
1 t.orange     ['arr.orange', 'pg.orange']         
2 ts.grape     ['a.grape' , 'test.grape']          
3 u.berryCool  ['X.berryCool', 'cool.berryCool']   

df2


people     fruits                                           
0  jack       ['b.apple', 'c.apple', 'pp.tomato', 'ao.banana' ]
1  mary       ['arr.orange', 'b.apple', 'X.berryCool', 'op.mango']
2  andy       ['cool.berryCool' , 'test.grape', 'yu.papaya']   
3  lawrence   ['jc.orange', 'c.apple']                

预期输出

people      fruits                                                        

0 jack       ['a.apple', 'a.apple', 'pp.tomato', 'ao.banana' ]     
1 mary       ['t.orange', 'a.apple', 'u.berryCool', 'op.mango']
2 andy       ['u.berryCool' , 'ts.grape', 'yu.papaya']             
3 lawrence   ['t.orange' , 'a.apple']        

我怎样才能有效地完成这件事?如有任何建议,欢迎。

先将列duplicates中的列表中的值平放,然后将值映射到dict.get中,如果没有匹配,则返回相同的值:

d = {x: a for a, b in zip(df1['name'], df1['duplicates']) for x in b}
df2['fruits'] = [[d.get(y,y) for y in x] for x in df2['fruits']]
print (df2)
people                                      fruits
0      jack    [a.apple, a.apple, pp.tomato, ao.banana]
1      mary  [t.orange, a.apple, u.berryCool, op.mango]
2      andy          [u.berryCool, ts.grape, yu.papaya]
3  lawrence                        [jc.orange, a.apple]

4k DataFrame的性能:(取决于数据,最好测试真实数据)

df2 = pd.concat([df2] * 1000, ignore_index=True)

In [135]: %%timeit
...: MAPPING = df1.explode('duplicates').set_index('duplicates')['name']
...: df2['fruits1'] = (df2.explode('fruits')['fruits'].replace(MAPPING).groupby(level=0).agg(list))
...: 
128 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [136]: %%timeit
...: d = {x: a for a, b in zip(df1['name'], df1['duplicates']) for x in b}
...: 
...: df2['fruits2'] = [[d.get(y,y) for y in x] for x in df2['fruits']]
...: 
5.27 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

您可以创建一个映射字典(系列):

MAPPING = df1.explode('duplicates').set_index('duplicates')['name']
df2['fruits'] = (df2.explode('fruits')['fruits'].replace(MAPPING)
.groupby(level=0).agg(list))
print(df2)
# Output
people                                      fruits
0      jack    [a.apple, a.apple, pp.tomato, ao.banana]
1      mary  [t.orange, a.apple, u.berryCool, op.mango]
2      andy          [u.berryCool, ts.grape, yu.papaya]
3  lawrence                        [jc.orange, a.apple]

最新更新