我有一个数据框架,里面有2014-2018年客户的ID和他们的费用。我想要的是每个ID的费用平均值,但在计算平均值时,只能考虑某个日期之前的年份(因此"日期"列指定了可以考虑哪些列作为平均值(。
示例:对于索引0(ID:12(,日期为"2016-03-08",则平均值应取自列"y_2014"one_answers"y_2015",因此对于此索引,平均值为111.0。如果日期太早(例如,在本例中为2014年或更早(,则应返回NaN(见索引6和9(。
初始数据帧:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID
0 100.0 122.0 324 632 NaN 2016-03-08 12
1 120.0 159.0 54 452 541.0 2015-04-09 96
2 NaN 164.0 687 165 245.0 2016-02-15 20
3 180.0 421.0 512 184 953.0 2018-05-01 73
4 110.0 654.0 913 173 103.0 2017-08-04 84
5 130.0 NaN 754 124 207.0 2016-07-03 26
6 170.0 256.0 843 97 806.0 2013-02-04 87
7 140.0 754.0 95 101 541.0 2016-06-08 64
8 80.0 985.0 184 84 90.0 2019-03-05 11
9 96.0 65.0 127 130 421.0 2014-05-14 34
所需输出:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.0
1 120.0 159.0 54 452 541.0 2015-04-09 96 120.0
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.0
3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25
4 110.0 654.0 913 173 103.0 2017-08-04 84 559.0
5 130.0 NaN 754 124 207.0 2016-07-03 26 130.0
6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN
7 140.0 754.0 95 101 541.0 2016-06-08 64 447
8 80.0 985.0 184 84 90.0 2019-03-05 11 284.6
9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
尝试过的代码:->我仍在处理它,因为我真的不知道如何开始,到目前为止我只上传了数据帧,可能必须使用"datetime"包才能获得所需的数据帧?
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
print(df)
由于您的命名约定,为了进行比较,需要从列名中提取年份。然后你可以屏蔽数据并取平均值:
# the years from columns
data = df.filter(like='y_')
data_years = data.columns.str.extract('(d+)')[0].astype(int)
# the years from Date
years = pd.to_datetime(df.Date).dt.year.values
df['mean'] = data.where(data_years<years[:,None]).mean(1)
输出:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.00
1 120.0 159.0 54 452 541.0 2015-04-09 96 120.00
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.00
3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25
4 110.0 654.0 913 173 103.0 2017-08-04 84 559.00
5 130.0 NaN 754 124 207.0 2016-07-03 26 130.00
6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN
7 140.0 754.0 95 101 541.0 2016-06-08 64 447.00
8 80.0 985.0 184 84 90.0 2019-03-05 11 284.60
9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
还有一个答案:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']]
#an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']
s = subset.columns[0:].values < df.Date.values[:,None]
t = s.astype(float)
t[t == 0] = np.nan
df['mean'] = (subset.iloc[:,0:]*t).mean(1)
print(df)
#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)
print(df)