假设我有以下数据:
ID Leave Type Start Date End Date
1 Sick 2022-01-01 2022-01-01
1 Holiday 2023-03-28
2 Holiday 2023-01-01 2023-01-02
3 Work 2023-01-01 2023-01-01
我需要找到一种方法来确认开始日期和结束日期具有相同的值。如果不是,它需要计算结束日期提前的天数,并为每一天创建一个添加1天的行,并始终匹配开始日期和结束日期。如果End Date为空,则应该创建行,直到2023-03-30。
ID Leave Type Start Date End Date
1 Sick 2022-01-01 2022-01-01
1 Holiday 2023-03-28 2023-03-28
1 Holiday 2023-03-29 2023-03-29
1 Holiday 2023-03-30 2023-03-30
1 Holiday 2023-03-31 2023-03-31
2 Holiday 2023-01-01 2023-01-01
2 Holiday 2023-01-02 2023-01-02
3 Work 2023-01-01 2023-01-01
谢谢!
import pandas as pd
from pandas.tseries.offsets import MonthEnd
df = pd.DataFrame({'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
'End Date': ['2022-01-01', '', '2023-01-02', '2023-01-01'],
})
# Converts columns 'Leave Type' and 'Start Date' to datetime
df[['Start Date', 'End Date']] =
df[['Start Date', 'End Date']].apply(pd.to_datetime, errors='coerce')
# Fill NaT values with the last day of the month
df['End Date'] = df['End Date'].fillna(df['Start Date'] + MonthEnd(0))
# Replace 'Start Date' values with list of date ranges
df['End Date'] =
[pd.date_range(s, e, freq='D').tolist() for s,e in zip(df['Start Date'], df['End Date'])]
# Explode the list
df = df.explode('End Date')
df['Start Date'] = df['End Date']
print(df)
结果
Leave Type Start Date End Date
0 Sick 2022-01-01 2022-01-01
1 Holiday 2023-03-28 2023-03-28
1 Holiday 2023-03-29 2023-03-29
1 Holiday 2023-03-30 2023-03-30
1 Holiday 2023-03-31 2023-03-31
2 Holiday 2023-01-01 2023-01-01
2 Holiday 2023-01-02 2023-01-02
3 Work 2023-01-01 2023-01-01
假设您错误地在输出中粘贴了额外的行(第5行)。你也可以试试这个:
import pandas as pd
from datetime import timedelta, datetime
# create the dataframe
df = pd.DataFrame({'ID': [1, 1, 2, 3],
'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
'End Date': ['2022-01-01', '', '2023-01-02', '2023-01-01']})
# convert date columns to datetime format
df['Start Date'] = pd.to_datetime(df['Start Date'])
df['End Date'] = pd.to_datetime(df['End Date'])
# fill in blank end dates with a maximum date value
df['End Date'] = df['End Date'].fillna(datetime(2023, 3, 30))
# create an empty list to store new rows
create_rows = []
# loop through each row in the dataframe
for index, row in df.iterrows():
# if the start and end dates are not the same, add rows for each day in between
if row['Start Date'] != row['End Date']:
delta = row['End Date'] - row['Start Date']
for i in range(delta.days + 1):
date = row['Start Date'] + timedelta(days=i)
create_rows.append({'ID': row['ID'], 'Leave Type': row['Leave Type'], 'Start Date': date, 'End Date': date})
# if the start and end dates are the same, append the original row
else:
create_rows.append({'ID': row['ID'], 'Leave Type': row['Leave Type'], 'Start Date': row['Start Date'], 'End Date': row['End Date']})
# create a new dataframe with the original rows and the new rows
output_df = pd.DataFrame(create_rows)
# sort the dataframe by ID and Start Date
output_df = output_df.sort_values(['ID', 'Start Date'])
# reset the index
output_df = output_df.reset_index(drop=True)
print(output_df)
您可以使用:
# ensure datetime and fill NA with default date
df[['Start Date', 'End Date']] = df[['Start Date', 'End Date']].apply(pd.to_datetime)
df['End Date'] = df['End Date'].fillna('2023-03-30')
# repeat index and create output
idx = df.index.repeat(df['End Date'].sub(df['Start Date']).dt.days.add(1))
out = df.loc[idx]
# increment days
out['Start Date'] += pd.TimedeltaIndex(out.groupby(level=0).cumcount(), unit='D')
out['End Date'] = out['Start Date']
输出:
ID Leave Type Start Date End Date
0 1 Sick 2022-01-01 2022-01-01
1 1 Holiday 2023-03-28 2023-03-28
1 1 Holiday 2023-03-29 2023-03-29
1 1 Holiday 2023-03-30 2023-03-30
2 2 Holiday 2023-01-01 2023-01-01
2 2 Holiday 2023-01-02 2023-01-02
3 3 Work 2023-01-01 2023-01-01
可再生的输入:
df = pd.DataFrame({'ID': [1, 1, 2, 3],
'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
'End Date': ['2022-01-01', None, '2023-01-02', '2023-01-01']})