使用公式和每日平均值规范化熊猫数据帧列



我有一个熊猫数据帧,例如:

df = pd.DataFrame({
            'time' : pd.date_range('2017-07-18 00:00:00', '2017-07-21 00:00:00', freq='3H'),
            'val1' : np.random.random(25)*300,
            'val2' : np.random.random(25)*30})
df.set_index('time', inplace=True)

以及值的数据帧:

real_values = pd.DataFrame({
    'day' : [18, 19, 20],
    'values' : [500, 600, 700]})

我想使用如下公式规范化列val1的值:

new_value = old_value*real_value_that_day/daily_average

也就是说,每个值乘以当天实际值与每日平均值之间的分数。

我尝试使用 .map ,但我无法在数据帧中包含index.day条件。我尝试使用groupby(df.index.day),但我不知道热得到最终结果。

非常感谢

我认为你需要:

np.random.seed(45)
df = pd.DataFrame({
            'time' : pd.date_range('2017-07-18 00:00:00', '2017-07-21 00:00:00', freq='3H'),
            'val1' : np.random.random(25)*300,
            'val2' : np.random.random(25)*30})
df.set_index('time', inplace=True)
real_values = pd.DataFrame({
    'day' : [18, 19, 20],
    'values' : [500, 600, 700]})

#map real_values to Series with same length as df by days
a = pd.Series(df.index.day, index=df.index).map(real_values.set_index('day')['values'])
print (a.head())
time
2017-07-18 00:00:00    500.0
2017-07-18 03:00:00    500.0
2017-07-18 06:00:00    500.0
2017-07-18 09:00:00    500.0
2017-07-18 12:00:00    500.0
Name: time, dtype: float64

#original multiple by Series a and divide by daily average by transform
df1 = df.mul(a, 0).div(df.groupby(df.index.day).transform('mean'))
print (df1)
                            val1         val2
time                                         
2017-07-18 00:00:00  1307.171491   403.372865
2017-07-18 03:00:00   726.330473   851.356196
2017-07-18 06:00:00   371.987469    77.497641
2017-07-18 09:00:00   102.153227   959.768694
2017-07-18 12:00:00   587.453074   233.817177
2017-07-18 15:00:00   624.907891   734.391568
2017-07-18 18:00:00    64.131282   114.951326
2017-07-18 21:00:00   215.865093   624.844533
2017-07-19 00:00:00   120.686108   542.744066
2017-07-19 03:00:00   653.014193  1116.500860
2017-07-19 06:00:00   891.148297   333.591495
2017-07-19 09:00:00   676.652432   610.715673
2017-07-19 12:00:00  1031.182496   743.728715
2017-07-19 15:00:00   489.559748   336.152862
2017-07-19 18:00:00   643.545466   147.084368
2017-07-19 21:00:00   294.211260   969.481959
2017-07-20 00:00:00  1474.421809   404.910284
2017-07-20 03:00:00  1016.785621  1078.311435
2017-07-20 06:00:00   665.498098   589.809072
2017-07-20 09:00:00   437.622829   122.931391
2017-07-20 12:00:00   769.989526  1158.555013
2017-07-20 15:00:00   169.891633   968.620184
2017-07-20 18:00:00   342.854461   159.225353
2017-07-20 21:00:00   722.936022  1117.637269
2017-07-21 00:00:00          NaN          NaN

细节:

print (df.groupby(df.index.day).transform('mean'))
                           val1       val2
time                                      
2017-07-18 00:00:00  113.490638  14.427688
2017-07-18 03:00:00  113.490638  14.427688
2017-07-18 06:00:00  113.490638  14.427688
2017-07-18 09:00:00  113.490638  14.427688
2017-07-18 12:00:00  113.490638  14.427688
2017-07-18 15:00:00  113.490638  14.427688
2017-07-18 18:00:00  113.490638  14.427688
2017-07-18 21:00:00  113.490638  14.427688
2017-07-19 00:00:00  172.937287  13.491194
2017-07-19 03:00:00  172.937287  13.491194
2017-07-19 06:00:00  172.937287  13.491194
2017-07-19 09:00:00  172.937287  13.491194
2017-07-19 12:00:00  172.937287  13.491194
2017-07-19 15:00:00  172.937287  13.491194
2017-07-19 18:00:00  172.937287  13.491194
2017-07-19 21:00:00  172.937287  13.491194
2017-07-20 00:00:00  139.010896  16.081470
2017-07-20 03:00:00  139.010896  16.081470
2017-07-20 06:00:00  139.010896  16.081470
2017-07-20 09:00:00  139.010896  16.081470
2017-07-20 12:00:00  139.010896  16.081470
2017-07-20 15:00:00  139.010896  16.081470
2017-07-20 18:00:00  139.010896  16.081470
2017-07-20 21:00:00  139.010896  16.081470
2017-07-21 00:00:00   72.827447   2.008148

这应该可以做到:

import pandas as pd
import numpy as np
df = pd.DataFrame({
            'time' : pd.date_range('2017-07-18 00:00:00', '2017-07-21 00:00:00', freq='3H'),
            'val1' : np.random.random(25)*300,
            'val2' : np.random.random(25)*30})
real_values = pd.DataFrame({
    'day' : [18, 19, 20],
    'values' : [500, 600, 700]})
df['day'] = df['time'].apply(lambda x: x.day)
df = df.merge(real_values, how='left', on='day')
df['mean'] = df.groupby('day')['val1'].transform('mean')
df['val1'] = df['val1'] * df['values'] / df['mean']
df.set_index('time', inplace=True)

您只是缺少 21 日的映射。

最新更新