将列表项与其他列值连接起来



我遇到了一个需求,我需要将列中的项列表与其他列值连接起来,如下所示。

输入:

ColumnA ColumnB
0  [x, y, z]  Failed
1     [x, y]  Passed

输出:

ColumnA ColumnB                         ColumnC
0  [x, y, z]  Failed  [x:Failed, y:Failed, z:Failed]
1     [x, y]  Passed            [x:Passed, y:Passed]

请告诉我如何使用python和pandas来实现这个目标。

如果只使用pandas,您可以explode、连接和转换为列表:

df['ColumnC'] = (df.explode('ColumnA')
.assign(ColumnC=lambda d: d['ColumnA']+':'+d['ColumnB'])
.groupby(level=0)['ColumnC'].apply(list)
)

一个更快的解决方案是使用itertools.product:

from itertools import product
df['ColumnC'] = df.apply(lambda r: list(map(':'.join, product(r['ColumnA'], [r['ColumnB']]))), axis=1)

输出:

ColumnA ColumnB                         ColumnC
0  [x, y, z]  Failed  [x:Failed, y:Failed, z:Failed]
1     [x, y]  Passed            [x:Passed, y:Passed]

使用嵌套列表推导对f-strings的性能很重要:

df = pd.DataFrame({"ColumnA": [list('xyz'), list('xy')],
"ColumnB": ['Failed', 'Passed']})

df['ColumnC'] = [[f'{z}:{y}' for z in x] for x, y in df[['ColumnA', 'ColumnB']].to_numpy()]
print (df)
ColumnA ColumnB                         ColumnC
0  [x, y, z]  Failed  [x:Failed, y:Failed, z:Failed]
1     [x, y]  Passed            [x:Passed, y:Passed]

另一个更慢的想法是使用apply:

df['ColumnC'] = df.apply(lambda x: [f'{z}:{x["ColumnB"]}' for z in x['ColumnA']], axis=1)

最慢的是使用explodegroupby


#test for 20k rows
df = pd.concat([df] * 10000, ignore_index=True)

In [29]: %timeit (df.explode('ColumnA').assign(ColumnC=lambda d: d['ColumnA']+':'+d['ColumnB']).groupby(level=0)['ColumnC'].apply(list))
600 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [34]: %timeit df.apply(lambda r: list(map(':'.join, product(r['ColumnA'], [r['ColumnB']]))), axis=1)
268 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [30]: %timeit [[f'{z}:{y}' for z in x] for x, y in df[['ColumnA', 'ColumnB']].to_numpy()]
36.4 ms ± 894 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [31]: %timeit df.apply(lambda x: [f'{z}:{x["ColumnB"]}' for z in x['ColumnA']], axis=1)
363 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

最新更新