我有一个数据帧如下:
| ID | Date | Value |
------------------------------------
| A | 01-01-2020 | 0.4854 |
| A | 02-01-2020 | 0.4856 |
| A | 03-01-2020 | 0.3982 |
---
| A | 29-12-2020 | 0.2139 |
| A | 30-12-2020 | 0.6290 |
| A | 31-12-2020 | 1.3921 |
---
| B | 01-01-2020 | 2.198 |
| B | 02-01-2020 | 1.4856 |
| B | 03-01-2020 | 2.3982 |
---
对于给定的ID,我需要找到";值";对于所有14天的时间段,然后返回总和和平均值以及开始日期和结束日期。假设2020年1月1日至2020年1日是14天的时间段;值";为3.27;值";则2020年2月1日至2020年1月15日是另一个14天的时间段;值";为3.34;值";是0.4456。同样,我需要找到所有可能的连续14天周期的总和和平均值。14天的期限必须是连续的。
我的输出应该是:
| ID | Start Date | End Date | Sum | Average |
--------------------------------------------------------
| A | 01-01-2020 | 14-01-2020 | 3.2685 | 0.4239 |
| A | 02-01-2020 | 15-01-2020 | 3.3371 | 0.4456 |
| A | 03-01-2020 | 16-01-2020 | 3.1982 | 0.3987 |
---
| B | 01-01-2020 | 14-01-2020 | 4.2685 | 0.6321 |
| B | 02-01-2020 | 15-01-2020 | 5.3371 | 0.7892 |
| B | 03-01-2020 | 16-01-2020 | 4.1982 | 0.6210 |
我的方法是为数据帧添加一个新的结束日期列。我使用iterows()
逐行提取数据帧中的行,以计算总和和平均值。
import pandas as pd
import numpy as np
import random
import datetime
date_rng = pd.date_range('2020-01-01', '2020-01-31', freq='1D')
date_rng = date_rng.append(date_rng)
# value = np.random.uniform(0, 5, 62)
value = np.random.randint(0, 5, (62,))
Id = ['A']*31+['B']*31
df = pd.DataFrame({'ID':Id,'date':date_rng,'Value':value+value[::-1]})
df['date'] = pd.to_datetime(df['date'])
df['End Date'] = df['date']+ datetime.timedelta(days=13)
df.columns = ['ID', 'Start Date', 'Value', 'End Date']
import itertools
for idx,row in df.iterrows():
start = row['Start Date']
end = row['End Date']
i = row['ID']
d = df[(df['ID'] == i) & (df['Start Date'] >= df.loc[idx,'Start Date']) & (df['Start Date'] <= df.loc[idx,'End Date'])]['Value']
df.loc[idx,'sum'] = d.sum()
df.loc[idx,'mean'] = d.mean()
df.head(15)
ID Start Date Value End Date sum mean
0 A 2020-01-01 6 2020-01-14 56.0 4.000000
1 A 2020-01-02 7 2020-01-15 53.0 3.785714
2 A 2020-01-03 1 2020-01-16 50.0 3.571429
3 A 2020-01-04 1 2020-01-17 55.0 3.928571
4 A 2020-01-05 0 2020-01-18 57.0 4.071429
5 A 2020-01-06 5 2020-01-19 60.0 4.285714
6 A 2020-01-07 3 2020-01-20 61.0 4.357143
7 A 2020-01-08 8 2020-01-21 61.0 4.357143
8 A 2020-01-09 4 2020-01-22 55.0 3.928571
9 A 2020-01-10 5 2020-01-23 54.0 3.857143
10 A 2020-01-11 6 2020-01-24 53.0 3.785714
11 A 2020-01-12 6 2020-01-25 52.0 3.714286
12 A 2020-01-13 0 2020-01-26 50.0 3.571429
13 A 2020-01-14 4 2020-01-27 51.0 3.642857
14 A 2020-01-15 3 2020-01-28 51.0 3.642857
我有一个类似于@r-初学者的方法,但这个方法使用transform
并调用函数来计算和值和平均值。
import pandas as pd
date1 = '2011-05-03'
df = pd.DataFrame()
df['start_date'] = pd.date_range(date1, periods=100,freq='D')
df['end_date'] = df['start_date'] + pd.to_timedelta(13, unit='D')
df['score'] = range(1, 1+len(df))
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
def sum_scores(d):
return df[(df['start_date'] <= d) &
(df['end_date'] >= d)]['score'].sum()
def mean_scores(d):
return df[(df['start_date'] <= d) &
(df['end_date'] >= d)]['score'].mean()
df['sum'] = df['end_date'].transform(sum_scores)
df['mean'] = df['end_date'].transform(mean_scores)
print (df)
结果是:
start_date end_date score sum mean
0 2011-05-03 2011-05-16 1 105 7.5
1 2011-05-04 2011-05-17 2 119 8.5
2 2011-05-05 2011-05-18 3 133 9.5
3 2011-05-06 2011-05-19 4 147 10.5
4 2011-05-07 2011-05-20 5 161 11.5
5 2011-05-08 2011-05-21 6 175 12.5
6 2011-05-09 2011-05-22 7 189 13.5
7 2011-05-10 2011-05-23 8 203 14.5
8 2011-05-11 2011-05-24 9 217 15.5
9 2011-05-12 2011-05-25 10 231 16.5
10 2011-05-13 2011-05-26 11 245 17.5
11 2011-05-14 2011-05-27 12 259 18.5
12 2011-05-15 2011-05-28 13 273 19.5
13 2011-05-16 2011-05-29 14 287 20.5
14 2011-05-17 2011-05-30 15 301 21.5
15 2011-05-18 2011-05-31 16 315 22.5
16 2011-05-19 2011-06-01 17 329 23.5
17 2011-05-20 2011-06-02 18 343 24.5
18 2011-05-21 2011-06-03 19 357 25.5
19 2011-05-22 2011-06-04 20 371 26.5
您可以按ID
进行分组,以按每个ID进行细分。