我们考虑这个表示工人合同的数据帧。我想列出一个工人在某一年工作了多少个月。
df = pd.DataFrame{'id': {0: 19019,
1: 17160, 2: 21593, 3: 3146, 4: 21593, 5: 3146, 6: 22737, 7: 25311, 8: 25740,
9: 3289, 10: 26312, 11: 28028, 12: 17017, 13: 27742, 14: 26884,
15: 31174, 16: 31889, 17: 33319, 18: 35178, 19: 35464},
'start_date': {0: Timestamp('2016-06-01 00:00:00'),
1: Timestamp('2016-09-01 00:00:00'), 2: Timestamp('2016-11-01 00:00:00'),
3: Timestamp('2017-01-01 00:00:00'), 4: Timestamp('2017-03-01 00:00:00'),
5: Timestamp('2017-08-01 00:00:00'), 6: Timestamp('2018-09-01 00:00:00'),
7: Timestamp('2018-09-01 00:00:00'),8: Timestamp('2018-10-01 00:00:00'),
9: Timestamp('1999-11-01 00:00:00'),10: Timestamp('2018-10-01 00:00:00'),
11: Timestamp('2019-01-01 00:00:00'),12: Timestamp('2009-11-01 00:00:00'),
13: Timestamp('2019-09-01 00:00:00'),14: Timestamp('2020-03-01 00:00:00'),
15: Timestamp('2020-03-01 00:00:00'),16: Timestamp('2020-04-14 00:00:00'),
17: Timestamp('2020-10-01 00:00:00'),18: Timestamp('2021-03-01 00:00:00'),
19: Timestamp('2021-03-08 00:00:00')},
'end_date': {0: Timestamp('2017-01-31 00:00:00'),
1: Timestamp('2018-07-31 00:00:00'),2: Timestamp('2017-02-28 00:00:00'),
3: Timestamp('2017-07-31 00:00:00'),4: Timestamp('2017-12-31 00:00:00'),
5: Timestamp('2017-12-31 00:00:00'),6: Timestamp('2021-12-31 00:00:00'),
7: Timestamp('2019-08-16 00:00:00'),8: Timestamp('2019-11-30 00:00:00'),
9: Timestamp('2022-12-31 00:00:00'),10: Timestamp('2020-09-30 00:00:00'),
11: Timestamp('2021-02-28 00:00:00'),12: Timestamp('2022-10-31 00:00:00'),
13: Timestamp('2022-02-28 00:00:00'),14: Timestamp('2022-02-28 00:00:00'),
15: Timestamp('2022-02-28 00:00:00'),16: Timestamp('2021-06-30 00:00:00'),
17: Timestamp('2022-09-30 00:00:00'),18: Timestamp('2022-02-28 00:00:00'),
19: Timestamp('2022-03-07 00:00:00')}})
因此,如果我们考虑2020年:
year = 2020
after = df.index[df.start_date.dt.year >= year] # Started late in that year
before = df.index[df.end_date.dt.year <= year] # Left early in that year
df['after'] = df.iloc[after].start_date.dt.month
df['before'] = df.iloc[before].end_date.dt.month
df = df.fillna(0)
df['months'] = 12
df['months'] -= df['after']
df[df.before > 0]['months'] -= 12 - df['before']
df = df.drop(['before', 'after'], axis=1)
dm = df[(df.start_date.dt.year <= year) & (df.end_date.dt.year >= year)]
dm
我得到了2020年工作n个月的员工名单:
id start_date end_date months
13 22737 2018-09-01 2021-12-31 12.0
16 3289 1999-11-01 2022-12-31 12.0
17 26312 2018-10-01 2020-09-30 12.0
18 28028 2019-01-01 2021-02-28 12.0
19 17017 2009-11-01 2022-10-31 12.0
20 27742 2019-09-01 2022-02-28 12.0
21 26884 2020-03-01 2022-02-28 9.0
22 31174 2020-03-01 2022-02-28 9.0
23 31889 2020-04-14 2021-06-30 8.0
24 33319 2020-10-01 2022-09-30 2.0
是否有更好的pandaish方法来实现同样的效果?
(请随意重命名这个问题,我相信它的名字很糟糕(
您可以定义两个日期范围:第一个是2020
中具有月频率的目标期间,第二个是每行的start_date
到end_date
的月频率期间。然后通过np.intersect1d()
找到它们的共同交集月份,并通过共同月份数组的长度找到匹配月份的数量:
rng2020 = pd.date_range(start='2020-01-01', end='2020-12-31', freq='M')
df['months'] = df.apply(lambda x: len(np.intersect1d(pd.date_range(start=x['start_date'], end=x['end_date'], freq='M'), rng2020)), axis=1)
df.loc[df['months'] !=0]
结果:
id start_date end_date months
6 22737 2018-09-01 2021-12-31 12
9 3289 1999-11-01 2022-12-31 12
10 26312 2018-10-01 2020-09-30 9
11 28028 2019-01-01 2021-02-28 12
12 17017 2009-11-01 2022-10-31 12
13 27742 2019-09-01 2022-02-28 12
14 26884 2020-03-01 2022-02-28 10
15 31174 2020-03-01 2022-02-28 10
16 31889 2020-04-14 2021-06-30 9
17 33319 2020-10-01 2022-09-30 3
这是一种方法:
start = '1/1/2020'
end = '12/31/2020'
s = (df['end_date'].clip(upper = pd.to_datetime(end)) -
df['start_date'].clip(lower = pd.to_datetime(start))).floordiv(pd.to_timedelta(30,'d'))
df = df.assign(months = s.where(s.gt(0)))
通过np.select
:单向
year = 2020
condlist = [
(df.start_date.dt.year < year) & (df.end_date.dt.year > year),
(df.start_date.dt.year == year) & (df.end_date.dt.year == year),
df.start_date.dt.year == year,
df.end_date.dt.year == year,
]
choicelist = [
12,
df.end_date.dt.month - df.start_date.dt.month,
12 - df.start_date.dt.month,
df.end_date.dt.month,
]
df['work_hours'] = np.select(condlist, choicelist)
注意:如果需要,请删除work_hours为0的行。