我试图在现有的Pandas数据帧中添加一列,但其中一些值没有意义(或者至少是未知格式(。我希望根据下一阶段的开始日期获得某个阶段的结束日期,或者如果流程已经结束,或者如果程序仍在进行,则我希望结束日期等于开始日期"今天"。
我的代码是:
import pandas as pd
import numpy as np
from datetime import timedelta, date
my_list = [[1, '5-6-2019', 1, 4], [1, '15-7-2019', 2, 4], [1, '20-10-2019', 3, 4], [1, '26-10-2019', 4, 4], [2, '5-6-2019', 1, 5], [2, '15-7-2019', 2, 5], [2, '20-10-2019', 3, 5], [2, '26-10-2019', 5, 5], [3, '3-6-2020', 1, 1], [4, '6-6-2020', 1, 1], [5, '7-6-2020', 1, 3], [5, '7-7-2020', 2, 3], [5, '31-7-2020', 3, 3], [6, '10-7-2020', 1, 2], [6, '13-6-2020', 2, 2], [7, '18-6-2020', 1, 2], [7, '21-7-2020', 2, 2]]
df = pd.DataFrame(my_list, columns=['ID', 'Start_date', 'Stage_ID', 'Max_stage'])
df['Start_date'] = pd.to_datetime(df['Start_date'])
today = date.today()
conditions = [
(df['Stage_ID'] == 1) & (df['Max_stage'] > 1),
(df['Stage_ID'] == 2) & (df['Max_stage'] > 2),
(df['Stage_ID'] == 3) & (df['Max_stage'] > 3),
(df['Stage_ID'] == 4) | (df['Stage_ID'] == 5)]
solutions = [
(df['Start_date'] - timedelta(days=1)).shift(-1),
(df['Start_date'] - timedelta(days=1)).shift(-1),
(df['Start_date'] - timedelta(days=1)).shift(-1),
df['Start_date']]
df['End_date'] = np.select(conditions, solutions, default=today)
df
因此,如果阶段是1、2或3,而这不是最后一个阶段,那么结束日期是下一个阶段的开始日期减去一天(因此减去timedelta(days=1(和shift(-1((。如果阶段是4或5,则过程已经结束,因此该阶段的结束日期与开始日期相同。在所有其他情况下,流程尚未结束,包括今天的日期(因此默认值为今天(。
我的代码的结果是:
ID Start_date Stage_ID Max_stage End_date
0 1 2019-05-06 1 4 1563062400000000000
1 1 2019-07-15 2 4 1571443200000000000
2 1 2019-10-20 3 4 1571961600000000000
3 1 2019-10-26 4 4 1572048000000000000
4 2 2019-05-06 1 5 1563062400000000000
5 2 2019-07-15 2 5 1571443200000000000
6 2 2019-10-20 3 5 1571961600000000000
7 2 2019-10-26 5 5 1572048000000000000
8 3 2020-03-06 1 1 2020-09-30
9 4 2020-06-06 1 1 2020-09-30
10 5 2020-07-06 1 3 1593993600000000000
11 5 2020-07-07 2 3 1596067200000000000
12 5 2020-07-31 3 3 2020-09-30
13 6 2020-10-07 1 2 1591920000000000000
14 6 2020-06-13 2 2 2020-09-30
15 7 2020-06-18 1 2 1595203200000000000
16 7 2020-07-21 2 2 2020-09-30
我期望的结束日期是:
ID Start_date Stage_ID Max_stage End_date
0 1 2019-05-06 1 4 2019-07-14 (= start date of next row minus 1 day)
1 1 2019-07-15 2 4 2019-10-19 (= start date of next row minus 1 day)
2 1 2019-10-20 3 4 2019-10-25 (= start date of next row minus 1 day)
3 1 2019-10-26 4 4 2019-10-26 (stage is 4, so start date=end date)
4 2 2019-05-06 1 5 2019-07-14 (= start date of next row minus 1 day)
5 2 2019-07-15 2 5 2019-10-19 (= start date of next row minus 1 day)
6 2 2019-10-20 3 5 2019-10-25 (= start date of next row minus 1 day)
7 2 2019-10-26 5 5 2019-10-26 (stage is 5, so start date=end date)
8 3 2020-03-06 1 1 2020-09-30 (no condition is met, so default = today)
9 4 2020-06-06 1 1 2020-09-30 (no condition is met, so default = today)
10 5 2020-07-06 1 3 2020-07-06 (= start date of next row minus 1 day)
11 5 2020-07-07 2 3 2020-07-30 (= start date of next row minus 1 day)
12 5 2020-07-31 3 3 2020-09-30 (no condition is met, so default = today)
13 6 2020-10-07 1 2 2020-06-12 (= start date of next row minus 1 day)
14 6 2020-06-13 2 2 2020-09-30 (no condition is met, so default = today)
15 7 2020-06-18 1 2 2020-07-20 (= start date of next row minus 1 day)
16 7 2020-07-21 2 2 2020-09-30 (no condition is met, so default = today)
所以,只有"今天"这个日期才是正确的。其他"日期"的结果/格式很奇怪。这是某种我可以转换为日期的绝对数字吗(有点像Excel(。或者我的代码错了。请告知。谢谢
我做了以下实验:我更换了
df['Start_date'] = pd.to_datetime(df['Start_date'])
带有
df['Start_date'] = df['Start_date'].apply(lambda s:
date(int(s.split('-')[2]),
int(s.split('-')[1]),
int(s.split('-')[0])))
因此实际上只处理CCD_ 1对象。输出看起来更好。但我不确定它是否产生了预期的结果?我看起来和你的有点不一样。
结果:
ID Start_date Stage_ID Max_stage End_date
0 1 2019-06-05 1 4 2019-07-14
1 1 2019-07-15 2 4 2019-10-19
2 1 2019-10-20 3 4 2019-10-25
3 1 2019-10-26 4 4 2019-10-26
4 2 2019-06-05 1 5 2019-07-14
5 2 2019-07-15 2 5 2019-10-19
6 2 2019-10-20 3 5 2019-10-25
7 2 2019-10-26 5 5 2019-10-26
8 3 2020-06-03 1 1 2020-09-30
9 4 2020-06-06 1 1 2020-09-30
10 5 2020-06-07 1 3 2020-07-06
11 5 2020-07-07 2 3 2020-07-30
12 5 2020-07-31 3 3 2020-09-30
13 6 2020-07-10 1 2 2020-06-12
14 6 2020-06-13 2 2 2020-09-30
15 7 2020-06-18 1 2 2020-07-20
16 7 2020-07-21 2 2 2020-09-30
EDIT:代码中也有一些奇怪的东西:5-6-2019
变成2019-05-06
(这似乎是错误的(,而21-7-2020
变成2020-07-21
?您可以通过使用显式格式字符串pd.to_datetime(df['Start_date'], format='%d-%m-%Y')
来修复此问题。
编辑2:如果你取奇数1595203200000000000
,看看:
print(date.fromtimestamp(1595203200))
你得到了CCD_ 8,这似乎是预期的结果。所以,正确的结果就在那里,只是格式不对。