我遇到了一个需求,我需要将列中的项列表与其他列值连接起来,如下所示。
输入:
ColumnA ColumnB
0 [x, y, z] Failed
1 [x, y] Passed
输出:
ColumnA ColumnB ColumnC
0 [x, y, z] Failed [x:Failed, y:Failed, z:Failed]
1 [x, y] Passed [x:Passed, y:Passed]
请告诉我如何使用python和pandas来实现这个目标。
如果只使用pandas,您可以explode
、连接和转换为列表:
df['ColumnC'] = (df.explode('ColumnA')
.assign(ColumnC=lambda d: d['ColumnA']+':'+d['ColumnB'])
.groupby(level=0)['ColumnC'].apply(list)
)
一个更快的解决方案是使用itertools.product
:
from itertools import product
df['ColumnC'] = df.apply(lambda r: list(map(':'.join, product(r['ColumnA'], [r['ColumnB']]))), axis=1)
输出:
ColumnA ColumnB ColumnC
0 [x, y, z] Failed [x:Failed, y:Failed, z:Failed]
1 [x, y] Passed [x:Passed, y:Passed]
使用嵌套列表推导对f-string
s的性能很重要:
df = pd.DataFrame({"ColumnA": [list('xyz'), list('xy')],
"ColumnB": ['Failed', 'Passed']})
df['ColumnC'] = [[f'{z}:{y}' for z in x] for x, y in df[['ColumnA', 'ColumnB']].to_numpy()]
print (df)
ColumnA ColumnB ColumnC
0 [x, y, z] Failed [x:Failed, y:Failed, z:Failed]
1 [x, y] Passed [x:Passed, y:Passed]
另一个更慢的想法是使用apply
:
df['ColumnC'] = df.apply(lambda x: [f'{z}:{x["ColumnB"]}' for z in x['ColumnA']], axis=1)
最慢的是使用explode
和groupby
。
#test for 20k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [29]: %timeit (df.explode('ColumnA').assign(ColumnC=lambda d: d['ColumnA']+':'+d['ColumnB']).groupby(level=0)['ColumnC'].apply(list))
600 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [34]: %timeit df.apply(lambda r: list(map(':'.join, product(r['ColumnA'], [r['ColumnB']]))), axis=1)
268 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [30]: %timeit [[f'{z}:{y}' for z in x] for x, y in df[['ColumnA', 'ColumnB']].to_numpy()]
36.4 ms ± 894 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [31]: %timeit df.apply(lambda x: [f'{z}:{x["ColumnB"]}' for z in x['ColumnA']], axis=1)
363 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)