数据帧与组上应用的多索引结果的索引兼容性

我们必须对数据帧中的列应用算法，数据必须按键分组，结果将在数据帧中形成一个新列。由于这是一个常见的用例，我们想知道我们是否选择了正确的方法。

下面的代码以简化的方式反映了我们处理该问题的方法。

import numpy as np
import pandas as pd
np.random.seed(42)
N = 100
key = np.random.randint(0, 2, N).cumsum()
x = np.random.rand(N)
data = dict(key=key, x=x)
df = pd.DataFrame(data)

这将生成一个DataFrame，如下所示。

key         x
0     0  0.969585
1     1  0.775133
2     1  0.939499
3     1  0.894827
4     1  0.597900
..  ...       ...
95   53  0.036887
96   54  0.609564
97   55  0.502679
98   56  0.051479
99   56  0.278646

示例性方法在DataFrame组上的应用。

def magic(x, const):
return (x + np.abs(np.random.rand(len(x))) + float(const)).round(1)
def pandas_confrom_magic(df_per_key, const=1):
index = df_per_key['x'].index # preserve index
x = df_per_key['x'].to_numpy()
y = magic(x, const) # perform some pandas incompatible magic
return pd.Series(y, index=index) # reconstruct index
g = df.groupby('key')
y_per_g = g.apply(lambda df: pandas_confrom_magic(df, const=5))

当为结果df['y'] = y_per_g分配一个新列时，它将抛出一个TypeError。

TypeError：插入列的索引与帧索引不兼容

因此，需要首先引入兼容的多索引。

df.index.name = 'index'
df = df.set_index('key', append=True).reorder_levels(['key', 'index'])
df['y'] = y_per_g
df.reset_index('key', inplace=True)

这产生了预期的结果。

key         x    y
index                    
0        0  0.969585  6.9
1        1  0.775133  6.0
2        1  0.939499  6.1
3        1  0.894827  6.4
4        1  0.597900  6.6
...    ...       ...  ...
95      53  0.036887  6.0
96      54  0.609564  6.0
97      55  0.502679  6.5
98      56  0.051479  6.0
99      56  0.278646  6.1

现在，我们想知道是否有更直接的方法来处理该指数，以及我们是否普遍选择了一种有利的方法。

使用Series.droplevel删除第一级MultiIndex，使其具有与df相同的索引，然后分配将正常工作：

g = df.groupby('key')
df['y']  = g.apply(lambda df: pandas_confrom_magic(df, const=5)).droplevel('key')
print (df)
key         x    y
0     0  0.969585  6.9
1     1  0.775133  6.0
2     1  0.939499  6.1
3     1  0.894827  6.4
4     1  0.597900  6.6
..  ...       ...  ...
95   53  0.036887  6.0
96   54  0.609564  6.0
97   55  0.502679  6.5
98   56  0.051479  6.0
99   56  0.278646  6.1
[100 rows x 3 columns]

相关内容

最新更新

热门标签：