拆分混合数据帧的列



有一个数据帧df:

import pandas as pd
import numpy as np
df=pd.DataFrame(np.array([('x', 'y')] + [('y', 'x')] + 
list([0, np.nan]*2)), columns=['Col'])
df

如何将df拆分为两列,如下所示?:

Col1 Col2
0   x   y
1   y   x
2   0   0
3   NaN NaN
4   0   0
5   NaN NaN

使用list comprehension将标量转换为元组:

df1 = pd.DataFrame([x if isinstance(x, tuple) else (x,x) for x in df['Col']], 
columns=['Col1','Col2'])
print (df1)
Col1 Col2
0    x    y
1    y    x
2    0    0
3  NaN  NaN
4    0    0
5  NaN  NaN

更通用的解决方案:

lens = int(df['Col'].str.len().max())
df1 = pd.DataFrame([x if isinstance(x, tuple) else [x] * lens for x in df['Col']])

另一种解决方案,在大数据中速度较慢:

df1 = df['Col'].apply(pd.Series).ffill(axis=1)

性能

df = pd.concat([df] * 1000, ignore_index=True)
In [51]: %%timeit
...: df1 = pd.DataFrame([x if isinstance(x, tuple) else (x,x) for x in df['Col']], 
...:                    columns=['Col1','Col2'])
...: 
2.42 ms ± 45.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %%timeit
...: df['Col'].apply(pd.Series).ffill(axis=1)
...: 
1 s ± 23.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#coldspeed solution
In [53]: %%timeit
...: v = pd.to_numeric(df.Col, errors='coerce')
...: pd.DataFrame({
...:     'Col1': v.fillna(df.Col.str[0]), 
...:     'Col2': v.fillna(df.Col.str[-1])})
...: 
15.8 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

一个不错、简洁的解决方案是使用pd.to_numeric将非数字数据转换为NaN,然后使用fillna

v = pd.to_numeric(df.Col, errors='coerce')
pd.DataFrame({
'Col1': v.fillna(df.Col.str[0]), 
'Col2': v.fillna(df.Col.str[-1])})
Col1 Col2
0    x    y
1    y    x
2    0    0
3  NaN  NaN
4    0    0
5  NaN  NaN

解决方案,适用于多个可能的列:

pd.DataFrame({
f'Col{i+1}': v.fillna(df.Col.str[i]) 
for i in range(int(df.Col.str.len().max()))})
Col1 Col2
0    x    y
1    y    x
2    0    0
3  NaN  NaN
4    0    0
5  NaN  NaN

最新更新