使用日期列pandas dataframe创建额外的行



假设我有以下数据:

ID  Leave Type  Start Date    End Date
1   Sick        2022-01-01    2022-01-01
1   Holiday     2023-03-28  
2   Holiday     2023-01-01    2023-01-02
3   Work        2023-01-01    2023-01-01

我需要找到一种方法来确认开始日期和结束日期具有相同的值。如果不是,它需要计算结束日期提前的天数,并为每一天创建一个添加1天的行,并始终匹配开始日期和结束日期。如果End Date为空,则应该创建行,直到2023-03-30。

ID  Leave Type  Start Date    End Date
1   Sick        2022-01-01    2022-01-01
1   Holiday     2023-03-28    2023-03-28
1   Holiday     2023-03-29    2023-03-29
1   Holiday     2023-03-30    2023-03-30
1   Holiday     2023-03-31    2023-03-31
2   Holiday     2023-01-01    2023-01-01
2   Holiday     2023-01-02    2023-01-02
3   Work        2023-01-01    2023-01-01

谢谢!

import pandas as pd
from pandas.tseries.offsets import MonthEnd
df = pd.DataFrame({'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
'End Date': ['2022-01-01', '', '2023-01-02', '2023-01-01'],
})
# Converts columns 'Leave Type' and 'Start Date' to datetime
df[['Start Date', 'End Date']] = 
df[['Start Date', 'End Date']].apply(pd.to_datetime, errors='coerce')
# Fill NaT values with the last day of the month
df['End Date'] = df['End Date'].fillna(df['Start Date'] + MonthEnd(0))
# Replace 'Start Date' values with list of date ranges
df['End Date'] = 
[pd.date_range(s, e, freq='D').tolist() for s,e in zip(df['Start Date'], df['End Date'])]
# Explode the list
df = df.explode('End Date')
df['Start Date'] = df['End Date']
print(df)

结果

Leave Type Start Date   End Date
0       Sick 2022-01-01 2022-01-01
1    Holiday 2023-03-28 2023-03-28
1    Holiday 2023-03-29 2023-03-29
1    Holiday 2023-03-30 2023-03-30
1    Holiday 2023-03-31 2023-03-31
2    Holiday 2023-01-01 2023-01-01
2    Holiday 2023-01-02 2023-01-02
3       Work 2023-01-01 2023-01-01

假设您错误地在输出中粘贴了额外的行(第5行)。你也可以试试这个:

import pandas as pd
from datetime import timedelta, datetime
# create the dataframe
df = pd.DataFrame({'ID': [1, 1, 2, 3], 
'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'], 
'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'], 
'End Date': ['2022-01-01', '', '2023-01-02', '2023-01-01']})
# convert date columns to datetime format
df['Start Date'] = pd.to_datetime(df['Start Date'])
df['End Date'] = pd.to_datetime(df['End Date'])
# fill in blank end dates with a maximum date value
df['End Date'] = df['End Date'].fillna(datetime(2023, 3, 30))
# create an empty list to store new rows
create_rows = []
# loop through each row in the dataframe
for index, row in df.iterrows():

# if the start and end dates are not the same, add rows for each day in between
if row['Start Date'] != row['End Date']:
delta = row['End Date'] - row['Start Date']
for i in range(delta.days + 1):
date = row['Start Date'] + timedelta(days=i)
create_rows.append({'ID': row['ID'], 'Leave Type': row['Leave Type'], 'Start Date': date, 'End Date': date})

# if the start and end dates are the same, append the original row
else:
create_rows.append({'ID': row['ID'], 'Leave Type': row['Leave Type'], 'Start Date': row['Start Date'], 'End Date': row['End Date']})

# create a new dataframe with the original rows and the new rows
output_df = pd.DataFrame(create_rows)
# sort the dataframe by ID and Start Date
output_df = output_df.sort_values(['ID', 'Start Date'])
# reset the index
output_df = output_df.reset_index(drop=True)
print(output_df)

您可以使用:

# ensure datetime and fill NA with default date
df[['Start Date', 'End Date']] = df[['Start Date', 'End Date']].apply(pd.to_datetime)
df['End Date'] = df['End Date'].fillna('2023-03-30')
# repeat index and create output
idx = df.index.repeat(df['End Date'].sub(df['Start Date']).dt.days.add(1))
out = df.loc[idx]
# increment days
out['Start Date'] += pd.TimedeltaIndex(out.groupby(level=0).cumcount(), unit='D')
out['End Date'] = out['Start Date']

输出:

ID Leave Type Start Date   End Date
0   1       Sick 2022-01-01 2022-01-01
1   1    Holiday 2023-03-28 2023-03-28
1   1    Holiday 2023-03-29 2023-03-29
1   1    Holiday 2023-03-30 2023-03-30
2   2    Holiday 2023-01-01 2023-01-01
2   2    Holiday 2023-01-02 2023-01-02
3   3       Work 2023-01-01 2023-01-01

可再生的输入:

df = pd.DataFrame({'ID': [1, 1, 2, 3],
'Leave Type': ['Sick', 'Holiday', 'Holiday', 'Work'],
'Start Date': ['2022-01-01', '2023-03-28', '2023-01-01', '2023-01-01'],
'End Date': ['2022-01-01', None, '2023-01-02', '2023-01-01']})

相关内容

  • 没有找到相关文章

最新更新