基于下一个值更新熊猫数据帧



我正在使用pandas数据帧,并将趋势应用于最早的数据点,以尽可能好地填充缺失的历史数据。我知道在pandas数据帧上迭代是错误的,但我还没有找到替代方法,因为新值取决于下一个值。如果有人知道更好的方法来实现这一点,那就太好了!

示例df:

Week no  Data  Trend
0        1   0.0    1.5
1        2   0.0    1.5
2        3   0.0    1.0
3        4   0.0    0.5
4        5  10.0    0.6

我目前使用的代码:

for wk in range(len(df)-1, 0, -1):
if (df.loc[wk, 'Data'] != 0 and df.loc[wk-1, 'Data'] == 0
and not math.isnan(df.loc[wk, 'Trend'])):
df.loc[wk-1, 'Data'] = (df.loc[wk, 'Data']
*df.loc[wk, 'Trend'])

结果:

Week no  Data  Trend
0        1   4.5    1.5
1        2   3.0    1.5
2        3   3.0    1.0
3        4   6.0    0.5
4        5  10.0    0.6

递归计算不可向量化,用于提高性能numba:

from numba import jit
@jit(nopython=True)
def f(a, b):
for i in range(a.shape[0]-1, 0, -1):
if (a[i] != 0) and (a[i-1] == 0) and not np.isnan(b[i]):
a[i-1] = a[i] * b[i]
return a
df['Data'] = f(df['Data'].to_numpy(), df['Trend'].to_numpy())
print (df)
Week no  Data  Trend
0        1   4.5    1.5
1        2   3.0    1.5
2        3   3.0    1.0
3        4   6.0    0.5
4        5  10.0    0.6

第一次测试没有遗漏值,如样本中的数据:

df = pd.concat([df] * 40, ignore_index=True)
print (df)
Week  no  Data  Trend
0       0   1   4.5    1.5
1       1   2   3.0    1.5
2       2   3   3.0    1.0
3       3   4   6.0    0.5
4       4   5  10.0    0.6
..    ...  ..   ...    ...
195     0   1   4.5    1.5
196     1   2   3.0    1.5
197     2   3   3.0    1.0
198     3   4   6.0    0.5
199     4   5  10.0    0.6
[200 rows x 4 columns]
In [114]: %%timeit
...: df['Data'] = f(df['Data'].to_numpy(), df['Trend'].to_numpy())
...: 
...: 
121 µs ± 2.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

df = pd.concat([df] * 40, ignore_index=True)
print (df.shape)
(200, 4)

In [115]: %%timeit
...: for wk in range(len(df)-1, 0, -1):
...:         if (df.loc[wk, 'Data'] != 0 and df.loc[wk-1, 'Data'] == 0
...:                 and not math.isnan(df.loc[wk, 'Trend'])):
...:             df.loc[wk-1, 'Data'] = (df.loc[wk, 'Data']
...:                                           *df.loc[wk, 'Trend'])
...:                                           
3.3 ms ± 414 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我用2*40缺失值进行测试,性能类似:

print (df)
Week  no  Data  Trend
0     0   1   0.0    NaN
1     1   2   0.0    NaN
2     2   3   0.0    1.0
3     3   4   0.0    0.5
4     4   5  10.0    0.6

df = pd.concat([df] * 40, ignore_index=True)
print (df.shape)
(200, 4)

In [117]: %%timeit
...: df['Data'] = f(df['Data'].to_numpy(), df['Trend'].to_numpy())
...: 
119 µs ± 480 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

df = pd.concat([df] * 40, ignore_index=True)
print (df.shape)
(200, 4)

In [121]: %%timeit
...: for wk in range(len(df)-1, 0, -1):
...:         if (df.loc[wk, 'Data'] != 0 and df.loc[wk-1, 'Data'] == 0
...:                 and not math.isnan(df.loc[wk, 'Trend'])):
...:             df.loc[wk-1, 'Data'] = (df.loc[wk, 'Data']
...:                                           *df.loc[wk, 'Trend'])
...:                                           
3.12 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

最新更新