我有下面的dataframe
:
import pandas as pd
data = pd.DataFrame({
'ID': ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25',
'2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
'Payment_Term': [7,8,3,6,4,7,8,5,3,6],
'Payment_Date': ['2020-07-05', '2020-07-05','2020-07-03', '2020-07-21', '2020-07-31',
'2020-08-15', '2020-08-22', '2020-06-16', '2020-06-23', '2020-07-05'],
})
df = pd.DataFrame(data, columns = ['ID', 'Invoice_Date', 'Payment_Term', 'Payment_Date'])
df['Invoice_Date'] = pd.to_datetime(df['Invoice_Date'].astype(str), format='%Y-%m-%d')
df['Payment_Date'] = pd.to_datetime(df['Payment_Date'].astype(str), format='%Y-%m-%d')
df['Due_Date'] = df['Invoice_Date'] + pd.to_timedelta(df['Payment_Term'], unit = 'd')
df['Delay'] = df['Payment_Date'] - df['Due_Date']
df['Delay'] = df['Delay'].dt.days
df['diff'] = df.groupby('ID')['Invoice_Date'].diff() / np.timedelta64(1, 'D')
df['diff'] = df['diff'].fillna(0)
df
我需要使days
列在30处停止,并开始计算新列。为了使生活变得困难,应按ID
分组计算days
。我以前使用过:
df.loc[0, 'days'] = df.loc[0, 'diff']
for i in range(1, len(df)):
df.loc[i, 'days'] = df.loc[i-1, 'days'] + df.loc[i, 'diff']
print(df)
因此输出:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay diff days
0 27459 2020-06-26 7 2020-07-05 2020-07-03 2 0.0 0.0
1 27459 2020-06-29 8 2020-07-05 2020-07-07 -2 3.0 3.0
2 27459 2020-06-30 3 2020-07-03 2020-07-03 0 1.0 4.0
3 27459 2020-07-14 6 2020-07-21 2020-07-20 1 14.0 18.0
4 27459 2020-07-25 4 2020-07-31 2020-07-29 2 11.0 29.0
5 27459 2020-07-30 7 2020-08-15 2020-08-06 9 5.0 34.0
6 27459 2020-08-02 8 2020-08-22 2020-08-10 12 3.0 37.0
7 48002 2020-05-13 5 2020-06-16 2020-05-18 29 0.0 37.0
8 48002 2020-06-20 3 2020-06-23 2020-06-23 0 38.0 75.0
9 48002 2020-06-28 6 2020-07-05 2020-07-04 1 8.0 83.0
我需要的结果是:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay diff days
0 27459 2020-06-26 7 2020-07-05 2020-07-03 2 0.0 0.0
1 27459 2020-06-29 8 2020-07-05 2020-07-07 -2 3.0 3.0
2 27459 2020-06-30 3 2020-07-03 2020-07-03 0 1.0 4.0
3 27459 2020-07-14 6 2020-07-21 2020-07-20 1 14.0 18.0
4 27459 2020-07-25 4 2020-07-31 2020-07-29 2 11.0 29.0
5 27459 2020-07-30 7 2020-08-15 2020-08-06 9 5.0 0.0
6 27459 2020-08-02 8 2020-08-22 2020-08-10 12 3.0 8.0
7 48002 2020-05-13 5 2020-06-16 2020-05-18 29 0.0 0.0
8 48002 2020-06-20 3 2020-06-23 2020-06-23 0 38.0 0.0
9 48002 2020-06-28 6 2020-07-05 2020-07-04 1 8.0 0.0
差异是第4行中停止的天数的总和,因为当它添加到下一行时,它将大于30。因此,它在第5行开始新的计算。
它仍然无法捕捉到您想要的确切输出,因为即使没有总和(例如索引8(,一些相应的diff值也大于30,因此它们将保留在数据上。
def func(x):
x = x.values
values = [x[0]]
for i in range(1, len(x)):
value = values[i-1] + x[i]
values.append(value if value < 30 else x[i])
return values
df['days'] = df.groupby("ID")["diff"].transform(func)
print(df[['ID', 'diff', 'days']])
输出:
ID diff days
0 27459 0.0 0.0
1 27459 3.0 3.0
2 27459 1.0 4.0
3 27459 14.0 18.0
4 27459 11.0 29.0
5 27459 5.0 5.0
6 27459 3.0 8.0
7 48002 0.0 0.0
8 48002 38.0 38.0
9 48002 8.0 8.0
编辑:
更改功能定义应该有效:
def func(x):
x = x.values
values = [x[0]]
for i in range(1, len(x)):
value = values[i-1] + x[i]
if value < 30:
values.append(value)
elif x[i] >= 30:
values.append(x[i-1])
else:
values.append(x[i])
return values
df['days'] = df.groupby("ID")["diff"].transform(func)
print(df[['ID', 'diff', 'days']])
ID diff days
0 27459 0.0 0.0
1 27459 3.0 3.0
2 27459 1.0 4.0
3 27459 14.0 18.0
4 27459 11.0 29.0
5 27459 5.0 5.0
6 27459 3.0 8.0
7 48002 0.0 0.0
8 48002 38.0 0.0
9 48002 8.0 8.0