潘达斯那一年有几个月的人在那里



我们考虑这个表示工人合同的数据帧。我想列出一个工人在某一年工作了多少个月。

df = pd.DataFrame{'id': {0: 19019,
1: 17160, 2: 21593, 3: 3146, 4: 21593, 5: 3146, 6: 22737, 7: 25311, 8: 25740,
9: 3289, 10: 26312, 11: 28028, 12: 17017, 13: 27742, 14: 26884,
15: 31174, 16: 31889, 17: 33319, 18: 35178, 19: 35464},
'start_date': {0: Timestamp('2016-06-01 00:00:00'),
1: Timestamp('2016-09-01 00:00:00'), 2: Timestamp('2016-11-01 00:00:00'),     
3: Timestamp('2017-01-01 00:00:00'), 4: Timestamp('2017-03-01 00:00:00'),    
5: Timestamp('2017-08-01 00:00:00'), 6: Timestamp('2018-09-01 00:00:00'),      
7: Timestamp('2018-09-01 00:00:00'),8: Timestamp('2018-10-01 00:00:00'),      
9: Timestamp('1999-11-01 00:00:00'),10: Timestamp('2018-10-01 00:00:00'),      
11: Timestamp('2019-01-01 00:00:00'),12: Timestamp('2009-11-01 00:00:00'),      
13: Timestamp('2019-09-01 00:00:00'),14: Timestamp('2020-03-01 00:00:00'),      
15: Timestamp('2020-03-01 00:00:00'),16: Timestamp('2020-04-14 00:00:00'),      
17: Timestamp('2020-10-01 00:00:00'),18: Timestamp('2021-03-01 00:00:00'),      
19: Timestamp('2021-03-08 00:00:00')},
'end_date': {0: Timestamp('2017-01-31 00:00:00'),
1: Timestamp('2018-07-31 00:00:00'),2: Timestamp('2017-02-28 00:00:00'),      
3: Timestamp('2017-07-31 00:00:00'),4: Timestamp('2017-12-31 00:00:00'),      
5: Timestamp('2017-12-31 00:00:00'),6: Timestamp('2021-12-31 00:00:00'),      
7: Timestamp('2019-08-16 00:00:00'),8: Timestamp('2019-11-30 00:00:00'),      
9: Timestamp('2022-12-31 00:00:00'),10: Timestamp('2020-09-30 00:00:00'),      
11: Timestamp('2021-02-28 00:00:00'),12: Timestamp('2022-10-31 00:00:00'),      
13: Timestamp('2022-02-28 00:00:00'),14: Timestamp('2022-02-28 00:00:00'),      
15: Timestamp('2022-02-28 00:00:00'),16: Timestamp('2021-06-30 00:00:00'),      
17: Timestamp('2022-09-30 00:00:00'),18: Timestamp('2022-02-28 00:00:00'),      
19: Timestamp('2022-03-07 00:00:00')}})

因此,如果我们考虑2020年:

year = 2020
after = df.index[df.start_date.dt.year >= year] # Started late in that year
before = df.index[df.end_date.dt.year <= year] # Left early in that year
df['after'] = df.iloc[after].start_date.dt.month
df['before'] = df.iloc[before].end_date.dt.month
df = df.fillna(0)
df['months'] = 12
df['months'] -= df['after']
df[df.before > 0]['months'] -= 12 - df['before']
df = df.drop(['before', 'after'], axis=1)
dm = df[(df.start_date.dt.year <= year) & (df.end_date.dt.year >= year)]
dm

我得到了2020年工作n个月的员工名单:

id      start_date  end_date    months
13  22737   2018-09-01  2021-12-31  12.0
16  3289    1999-11-01  2022-12-31  12.0
17  26312   2018-10-01  2020-09-30  12.0
18  28028   2019-01-01  2021-02-28  12.0
19  17017   2009-11-01  2022-10-31  12.0
20  27742   2019-09-01  2022-02-28  12.0
21  26884   2020-03-01  2022-02-28  9.0
22  31174   2020-03-01  2022-02-28  9.0
23  31889   2020-04-14  2021-06-30  8.0
24  33319   2020-10-01  2022-09-30  2.0

是否有更好的pandaish方法来实现同样的效果?

(请随意重命名这个问题,我相信它的名字很糟糕(

您可以定义两个日期范围:第一个是2020中具有月频率的目标期间,第二个是每行的start_dateend_date的月频率期间。然后通过np.intersect1d()找到它们的共同交集月份,并通过共同月份数组的长度找到匹配月份的数量:

rng2020 = pd.date_range(start='2020-01-01', end='2020-12-31', freq='M')
df['months'] = df.apply(lambda x: len(np.intersect1d(pd.date_range(start=x['start_date'], end=x['end_date'], freq='M'), rng2020)), axis=1)
df.loc[df['months'] !=0]

结果:

id start_date   end_date  months
6   22737 2018-09-01 2021-12-31      12
9    3289 1999-11-01 2022-12-31      12
10  26312 2018-10-01 2020-09-30       9
11  28028 2019-01-01 2021-02-28      12
12  17017 2009-11-01 2022-10-31      12
13  27742 2019-09-01 2022-02-28      12
14  26884 2020-03-01 2022-02-28      10
15  31174 2020-03-01 2022-02-28      10
16  31889 2020-04-14 2021-06-30       9
17  33319 2020-10-01 2022-09-30       3

这是一种方法:

start = '1/1/2020'
end = '12/31/2020'
s = (df['end_date'].clip(upper = pd.to_datetime(end)) - 
df['start_date'].clip(lower = pd.to_datetime(start))).floordiv(pd.to_timedelta(30,'d'))
df = df.assign(months = s.where(s.gt(0)))

通过np.select:单向

year = 2020
condlist = [
(df.start_date.dt.year < year) & (df.end_date.dt.year > year),
(df.start_date.dt.year == year) & (df.end_date.dt.year == year),
df.start_date.dt.year == year,
df.end_date.dt.year == year,
]
choicelist = [
12,
df.end_date.dt.month - df.start_date.dt.month,
12 - df.start_date.dt.month,
df.end_date.dt.month,
]
df['work_hours'] = np.select(condlist, choicelist)

注意:如果需要,请删除work_hours为0的行。

最新更新