复杂变换熊猫



假设我有以下数据集

table = [[datetime.datetime(2015, 1, 1), 1, 0.5],
         [datetime.datetime(2015, 1, 27), 1, 0.5],
         [datetime.datetime(2015, 1, 31), 1, 0.5],
         [datetime.datetime(2015, 2, 1), 1, 2],
         [datetime.datetime(2015, 2, 3), 1, 2],
         [datetime.datetime(2015, 2, 15), 1, 2], 
         [datetime.datetime(2015, 2, 28), 1, 2],
         [datetime.datetime(2015, 3, 1), 1, 3],
         [datetime.datetime(2015, 3, 17), 1, 3],
         [datetime.datetime(2015, 3, 31), 1, 3]]
df = pd.DataFrame(table, columns=['Date', 'Id', 'Value'])

现在,我想找到每个月的最后一个值,将其按月移动到下一个月的值,最后取这些值的累积乘积。对上述数据执行此过程应该会得到(执行每个步骤):

  1. 查找每个月的最后一个条目并按月移动它们将产生

                Date       Id   Value  Temp
             0 2015-01-01   1    0.5    NaN
             1 2015-01-27   1    0.5    NaN
             2 2015-01-31   1    0.5    NaN
             3 2015-02-01   1    2.0    0.5
             4 2015-02-03   1    2.0    0.5
             5 2015-02-15   1    2.0    0.5
             6 2015-02-28   1    2.0    0.5
             7 2015-03-01   1    3.0    2.0
             8 2015-03-17   1    3.0    2.0
             9 2015-03-31   1    3.0    2.0
    
  2. 1填充NaN,取累积积,去掉temp,得到

                 Date      Id   Value  Result
             0 2015-01-01   1    0.5     1
             1 2015-01-27   1    0.5     1
             2 2015-01-31   1    0.5     1
             3 2015-02-01   1    2.0    0.5
             4 2015-02-03   1    2.0    0.5
             5 2015-02-15   1    2.0    0.5
             6 2015-02-28   1    2.0    0.5
             7 2015-03-01   1    3.0    1.0
             8 2015-03-17   1    3.0    1.0
             9 2015-03-31   1    3.0    1.0
    

我希望这足够清楚。如果有人想知道我为什么要这么做是因为我有mtd数据,需要重新采样。谢谢,Tingis。

编辑每个月的条目数是"随机的",因为它们可以与月一样长,也可以更短(业务数据…)

下面的代码没有假设每个月只有两行。我们的想法是首先进行分组计算,然后使用.reindex()填充一些NaN,并使用向后填充填充这些NaN,因为我们已经获得了每个月最后一个条目的值。

# your data
# ==================================
import pandas as pd
import datetime
table = [[datetime.datetime(2015, 1, 1), 1, 0.5],
         [datetime.datetime(2015, 1, 31), 1, 0.5],
         [datetime.datetime(2015, 2, 1), 1, 2], 
         [datetime.datetime(2015, 2, 28), 1, 2],
         [datetime.datetime(2015, 3, 1), 1, 3],
         [datetime.datetime(2015, 3, 31), 1, 3]]
df = pd.DataFrame(table, columns=['Date', 'Id', 'Value'])
# better to set Date column to index
df = df.set_index('Date')
print(df)
            Id  Value
Date                 
2015-01-01   1    0.5
2015-01-31   1    0.5
2015-02-01   1    2.0
2015-02-28   1    2.0
2015-03-01   1    3.0
2015-03-31   1    3.0
# processing
# =================================================
# get last entry from each month
df_temp = df.groupby(lambda idx: idx.month).tail(1)
# do the cumprod, reindex to have the same index as original df, backward fill
df['Result'] = df_temp['Value'].shift(1).fillna(1).cumprod().reindex(df.index).fillna(method='bfill')
print(df)
            Id  Value  Result
Date                         
2015-01-01   1    0.5     1.0
2015-01-31   1    0.5     1.0
2015-02-01   1    2.0     0.5
2015-02-28   1    2.0     0.5
2015-03-01   1    3.0     1.0
2015-03-31   1    3.0     1.0

后续问题:

# your data
# ==================================
import pandas as pd
import datetime
table = [[datetime.datetime(2015, 1, 1), 1, 0.5],
         [datetime.datetime(2015, 1, 27), 1, 0.5],
         [datetime.datetime(2015, 1, 31), 1, 0.5],
         [datetime.datetime(2015, 2, 1), 1, 2],
         [datetime.datetime(2015, 2, 3), 1, 2],
         [datetime.datetime(2015, 2, 15), 1, 2], 
         [datetime.datetime(2015, 2, 28), 1, 2],
         [datetime.datetime(2015, 3, 1), 1, 3],
         [datetime.datetime(2015, 3, 17), 1, 3],
         [datetime.datetime(2015, 3, 31), 1, 3]]
df1 = pd.DataFrame(table, columns=['Date', 'Id', 'Value'])
df2 = df1.copy()
df2.Id = 2
df = df1.append(df2)
# better to set Date column to index
df = df.set_index('Date')
print(df)

            Id  Value
Date                 
2015-01-01   1    0.5
2015-01-27   1    0.5
2015-01-31   1    0.5
2015-02-01   1    2.0
2015-02-03   1    2.0
2015-02-15   1    2.0
2015-02-28   1    2.0
2015-03-01   1    3.0
2015-03-17   1    3.0
2015-03-31   1    3.0
2015-01-01   2    0.5
2015-01-27   2    0.5
2015-01-31   2    0.5
2015-02-01   2    2.0
2015-02-03   2    2.0
2015-02-15   2    2.0
2015-02-28   2    2.0
2015-03-01   2    3.0
2015-03-17   2    3.0
2015-03-31   2    3.0
def my_func(group):
    # get last entry from each month
    df_temp = group.groupby(lambda idx: idx.month).tail(1)
    # do the cumprod, reindex to have the same index as original df
    group['Result'] = df_temp['Value'].shift(1).fillna(1).cumprod().reindex(group.index).fillna(method='bfill')
    return group
df.groupby('Id').apply(my_func)

            Id  Value  Result
Date                         
2015-01-01   1    0.5     1.0
2015-01-27   1    0.5     1.0
2015-01-31   1    0.5     1.0
2015-02-01   1    2.0     0.5
2015-02-03   1    2.0     0.5
2015-02-15   1    2.0     0.5
2015-02-28   1    2.0     0.5
2015-03-01   1    3.0     1.0
2015-03-17   1    3.0     1.0
2015-03-31   1    3.0     1.0
2015-01-01   2    0.5     1.0
2015-01-27   2    0.5     1.0
2015-01-31   2    0.5     1.0
2015-02-01   2    2.0     0.5
2015-02-03   2    2.0     0.5
2015-02-15   2    2.0     0.5
2015-02-28   2    2.0     0.5
2015-03-01   2    3.0     1.0
2015-03-17   2    3.0     1.0
2015-03-31   2    3.0     1.0

最新更新