Pandas df中日期列的未知结果



我试图在现有的Pandas数据帧中添加一列,但其中一些值没有意义(或者至少是未知格式(。我希望根据下一阶段的开始日期获得某个阶段的结束日期,或者如果流程已经结束,或者如果程序仍在进行,则我希望结束日期等于开始日期"今天"。

我的代码是:

import pandas as pd
import numpy as np
from datetime import timedelta, date
my_list = [[1, '5-6-2019', 1, 4], [1, '15-7-2019', 2, 4], [1, '20-10-2019', 3, 4], [1, '26-10-2019', 4, 4], [2, '5-6-2019', 1, 5], [2, '15-7-2019', 2, 5], [2, '20-10-2019', 3, 5], [2, '26-10-2019', 5, 5], [3, '3-6-2020', 1, 1], [4, '6-6-2020', 1, 1], [5, '7-6-2020', 1, 3], [5, '7-7-2020', 2, 3], [5, '31-7-2020', 3, 3], [6, '10-7-2020', 1, 2], [6, '13-6-2020', 2, 2], [7, '18-6-2020', 1, 2], [7, '21-7-2020', 2, 2]]
df = pd.DataFrame(my_list, columns=['ID', 'Start_date', 'Stage_ID', 'Max_stage'])
df['Start_date'] = pd.to_datetime(df['Start_date'])
today = date.today()

conditions = [
(df['Stage_ID'] == 1) & (df['Max_stage'] > 1),
(df['Stage_ID'] == 2) & (df['Max_stage'] > 2),
(df['Stage_ID'] == 3) & (df['Max_stage'] > 3),
(df['Stage_ID'] == 4) | (df['Stage_ID'] == 5)]
solutions = [
(df['Start_date'] - timedelta(days=1)).shift(-1),
(df['Start_date'] - timedelta(days=1)).shift(-1),
(df['Start_date'] - timedelta(days=1)).shift(-1),
df['Start_date']]   
df['End_date'] = np.select(conditions, solutions, default=today)
df

因此,如果阶段是1、2或3,而这不是最后一个阶段,那么结束日期是下一个阶段的开始日期减去一天(因此减去timedelta(days=1(和shift(-1((。如果阶段是4或5,则过程已经结束,因此该阶段的结束日期与开始日期相同。在所有其他情况下,流程尚未结束,包括今天的日期(因此默认值为今天(。

我的代码的结果是:

ID  Start_date  Stage_ID    Max_stage   End_date             
0   1   2019-05-06  1           4           1563062400000000000  
1   1   2019-07-15  2           4           1571443200000000000
2   1   2019-10-20  3           4           1571961600000000000
3   1   2019-10-26  4           4           1572048000000000000
4   2   2019-05-06  1           5           1563062400000000000
5   2   2019-07-15  2           5           1571443200000000000
6   2   2019-10-20  3           5           1571961600000000000
7   2   2019-10-26  5           5           1572048000000000000
8   3   2020-03-06  1           1           2020-09-30
9   4   2020-06-06  1           1           2020-09-30
10  5   2020-07-06  1           3           1593993600000000000
11  5   2020-07-07  2           3           1596067200000000000
12  5   2020-07-31  3           3           2020-09-30
13  6   2020-10-07  1           2           1591920000000000000
14  6   2020-06-13  2           2           2020-09-30
15  7   2020-06-18  1           2           1595203200000000000
16  7   2020-07-21  2           2           2020-09-30

我期望的结束日期是:

ID  Start_date  Stage_ID    Max_stage   End_date             
0   1   2019-05-06  1           4           2019-07-14 (= start date of next row minus 1 day)
1   1   2019-07-15  2           4           2019-10-19 (= start date of next row minus 1 day)
2   1   2019-10-20  3           4           2019-10-25 (= start date of next row minus 1 day)
3   1   2019-10-26  4           4           2019-10-26 (stage is 4, so start date=end date)
4   2   2019-05-06  1           5           2019-07-14 (= start date of next row minus 1 day)
5   2   2019-07-15  2           5           2019-10-19 (= start date of next row minus 1 day)
6   2   2019-10-20  3           5           2019-10-25 (= start date of next row minus 1 day)
7   2   2019-10-26  5           5           2019-10-26 (stage is 5, so start date=end date)
8   3   2020-03-06  1           1           2020-09-30 (no condition is met, so default = today)
9   4   2020-06-06  1           1           2020-09-30 (no condition is met, so default = today)
10  5   2020-07-06  1           3           2020-07-06 (= start date of next row minus 1 day)
11  5   2020-07-07  2           3           2020-07-30 (= start date of next row minus 1 day)
12  5   2020-07-31  3           3           2020-09-30 (no condition is met, so default = today)
13  6   2020-10-07  1           2           2020-06-12 (= start date of next row minus 1 day)
14  6   2020-06-13  2           2           2020-09-30 (no condition is met, so default = today)
15  7   2020-06-18  1           2           2020-07-20 (= start date of next row minus 1 day)
16  7   2020-07-21  2           2           2020-09-30 (no condition is met, so default = today)

所以,只有"今天"这个日期才是正确的。其他"日期"的结果/格式很奇怪。这是某种我可以转换为日期的绝对数字吗(有点像Excel(。或者我的代码错了。请告知。谢谢

我做了以下实验:我更换了

df['Start_date'] = pd.to_datetime(df['Start_date'])

带有

df['Start_date'] = df['Start_date'].apply(lambda s:
date(int(s.split('-')[2]),
int(s.split('-')[1]),
int(s.split('-')[0])))

因此实际上只处理CCD_ 1对象。输出看起来更好。但我不确定它是否产生了预期的结果?我看起来和你的有点不一样。

结果:

ID  Start_date  Stage_ID  Max_stage    End_date
0    1  2019-06-05         1          4  2019-07-14
1    1  2019-07-15         2          4  2019-10-19
2    1  2019-10-20         3          4  2019-10-25
3    1  2019-10-26         4          4  2019-10-26
4    2  2019-06-05         1          5  2019-07-14
5    2  2019-07-15         2          5  2019-10-19
6    2  2019-10-20         3          5  2019-10-25
7    2  2019-10-26         5          5  2019-10-26
8    3  2020-06-03         1          1  2020-09-30
9    4  2020-06-06         1          1  2020-09-30
10   5  2020-06-07         1          3  2020-07-06
11   5  2020-07-07         2          3  2020-07-30
12   5  2020-07-31         3          3  2020-09-30
13   6  2020-07-10         1          2  2020-06-12
14   6  2020-06-13         2          2  2020-09-30
15   7  2020-06-18         1          2  2020-07-20
16   7  2020-07-21         2          2  2020-09-30

EDIT:代码中也有一些奇怪的东西:5-6-2019变成2019-05-06(这似乎是错误的(,而21-7-2020变成2020-07-21?您可以通过使用显式格式字符串pd.to_datetime(df['Start_date'], format='%d-%m-%Y')来修复此问题。

编辑2:如果你取奇数1595203200000000000,看看:

print(date.fromtimestamp(1595203200))

你得到了CCD_ 8,这似乎是预期的结果。所以,正确的结果就在那里,只是格式不对。

最新更新