计算数据帧中固定数量的迭代连续行的总和和平均值



我有一个数据帧如下:

|  ID  |    Date    |    Value     |
------------------------------------
|  A   | 01-01-2020 |    0.4854    |
|  A   | 02-01-2020 |    0.4856    |
|  A   | 03-01-2020 |    0.3982    |
---
|  A   | 29-12-2020 |    0.2139    |
|  A   | 30-12-2020 |    0.6290    |
|  A   | 31-12-2020 |    1.3921    |
---
|  B   | 01-01-2020 |    2.198     |
|  B   | 02-01-2020 |    1.4856    |
|  B   | 03-01-2020 |    2.3982    |
---

对于给定的ID,我需要找到";值";对于所有14天的时间段,然后返回总和和平均值以及开始日期和结束日期。假设2020年1月1日至2020年1日是14天的时间段;值";为3.27;值";则2020年2月1日至2020年1月15日是另一个14天的时间段;值";为3.34;值";是0.4456。同样,我需要找到所有可能的连续14天周期的总和和平均值。14天的期限必须是连续的。

我的输出应该是:

|  ID  | Start Date |  End Date  |   Sum   |  Average  |
--------------------------------------------------------
|  A   | 01-01-2020 | 14-01-2020 | 3.2685  |   0.4239  |
|  A   | 02-01-2020 | 15-01-2020 | 3.3371  |   0.4456  |
|  A   | 03-01-2020 | 16-01-2020 | 3.1982  |   0.3987  |
---
|  B   | 01-01-2020 | 14-01-2020 | 4.2685  |   0.6321  |
|  B   | 02-01-2020 | 15-01-2020 | 5.3371  |   0.7892  |
|  B   | 03-01-2020 | 16-01-2020 | 4.1982  |   0.6210  |

我的方法是为数据帧添加一个新的结束日期列。我使用iterows()逐行提取数据帧中的行,以计算总和和平均值。

import pandas as pd
import numpy as np
import random
import datetime
date_rng = pd.date_range('2020-01-01', '2020-01-31', freq='1D')
date_rng = date_rng.append(date_rng)
# value = np.random.uniform(0, 5, 62)
value = np.random.randint(0, 5, (62,))
Id = ['A']*31+['B']*31
df = pd.DataFrame({'ID':Id,'date':date_rng,'Value':value+value[::-1]})
df['date'] = pd.to_datetime(df['date'])
df['End Date'] = df['date']+ datetime.timedelta(days=13)
df.columns = ['ID', 'Start Date', 'Value', 'End Date']
import itertools
for idx,row in df.iterrows():
start = row['Start Date']
end = row['End Date']
i = row['ID']
d = df[(df['ID'] == i) & (df['Start Date'] >= df.loc[idx,'Start Date']) & (df['Start Date'] <= df.loc[idx,'End Date'])]['Value']
df.loc[idx,'sum'] = d.sum()
df.loc[idx,'mean'] = d.mean()
df.head(15)
ID  Start Date  Value   End Date    sum mean
0   A   2020-01-01  6   2020-01-14  56.0    4.000000
1   A   2020-01-02  7   2020-01-15  53.0    3.785714
2   A   2020-01-03  1   2020-01-16  50.0    3.571429
3   A   2020-01-04  1   2020-01-17  55.0    3.928571
4   A   2020-01-05  0   2020-01-18  57.0    4.071429
5   A   2020-01-06  5   2020-01-19  60.0    4.285714
6   A   2020-01-07  3   2020-01-20  61.0    4.357143
7   A   2020-01-08  8   2020-01-21  61.0    4.357143
8   A   2020-01-09  4   2020-01-22  55.0    3.928571
9   A   2020-01-10  5   2020-01-23  54.0    3.857143
10  A   2020-01-11  6   2020-01-24  53.0    3.785714
11  A   2020-01-12  6   2020-01-25  52.0    3.714286
12  A   2020-01-13  0   2020-01-26  50.0    3.571429
13  A   2020-01-14  4   2020-01-27  51.0    3.642857
14  A   2020-01-15  3   2020-01-28  51.0    3.642857

我有一个类似于@r-初学者的方法,但这个方法使用transform并调用函数来计算和值和平均值。

import pandas as pd
date1 = '2011-05-03'
df = pd.DataFrame()
df['start_date'] = pd.date_range(date1, periods=100,freq='D')
df['end_date'] = df['start_date'] + pd.to_timedelta(13, unit='D')
df['score'] = range(1, 1+len(df))
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
def sum_scores(d):
return df[(df['start_date'] <= d) &
(df['end_date'] >= d)]['score'].sum()
def mean_scores(d):
return df[(df['start_date'] <= d) &
(df['end_date'] >= d)]['score'].mean()
df['sum'] = df['end_date'].transform(sum_scores)
df['mean'] = df['end_date'].transform(mean_scores)
print (df)

结果是:

start_date   end_date  score  sum  mean
0  2011-05-03 2011-05-16      1  105   7.5
1  2011-05-04 2011-05-17      2  119   8.5
2  2011-05-05 2011-05-18      3  133   9.5
3  2011-05-06 2011-05-19      4  147  10.5
4  2011-05-07 2011-05-20      5  161  11.5
5  2011-05-08 2011-05-21      6  175  12.5
6  2011-05-09 2011-05-22      7  189  13.5
7  2011-05-10 2011-05-23      8  203  14.5
8  2011-05-11 2011-05-24      9  217  15.5
9  2011-05-12 2011-05-25     10  231  16.5
10 2011-05-13 2011-05-26     11  245  17.5
11 2011-05-14 2011-05-27     12  259  18.5
12 2011-05-15 2011-05-28     13  273  19.5
13 2011-05-16 2011-05-29     14  287  20.5
14 2011-05-17 2011-05-30     15  301  21.5
15 2011-05-18 2011-05-31     16  315  22.5
16 2011-05-19 2011-06-01     17  329  23.5
17 2011-05-20 2011-06-02     18  343  24.5
18 2011-05-21 2011-06-03     19  357  25.5
19 2011-05-22 2011-06-04     20  371  26.5

您可以按ID进行分组,以按每个ID进行细分。

最新更新