什么是有效的熊猫方法来重新索引此轮班时间表?



我有一个熊猫数据帧,它代表一整年的轮班时间表,给定如下:


January 2019    1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31
Shift A         1   1   1   0   0   0   2   2   0   0   1   1   1   1   0   2   2   0   0   0   0   0   0   0   2   2   2   0   1   1   1
Shift B         0   2   2   0   0   0   0   0   0   0   2   2   2   0   1   1   1   0   0   0   2   2   0   0   1   1   1   1   0   2   2
Shift C         0   0   0   2   2   2   0   1   1   1   0   0   0   2   2   0   0   1   1   1   1   0   2   2   0   0   0   0   0   0   0
Shift D         2   0   0   1   1   1   1   0   2   2   0   0   0   0   0   0   0   2   2   2   0   1   1   1   0   0   0   2   2   0   0
February 2019   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  nan nan nan
Shift A         0   0   0   2   2   0   0   1   1   1   1   0   2   2   0   0   0   0   0   0   0   2   2   2   0   1   1   1   nan nan nan
Shift B         0   0   0   0   0   0   0   2   2   2   0   1   1   1   0   0   0   2   2   0   0   1   1   1   1   0   2   2   nan nan nan
Shift C         2   2   2   0   1   1   1   0   0   0   2   2   0   0   1   1   1   1   0   2   2   0   0   0   0   0   0   0   nan nan nan
Shift D         1   1   1   1   0   2   2   0   0   0   0   0   0   0   2   2   2   0   1   1   1   0   0   0   2   2   0   0   nan nan nan

其中 1 代表白班 (06:00 - 18:00(,2 代表夜班 (18:00 - 06:00(,0 可以忽略。只有一个班次团队将在给定的时间段内工作。

我需要数据格式,其中数据由 DateTime 戳与当前工作班次进行索引,例如:

DateTime Shift
0 2019-01-01 06:00:00 A
1 2019-01-01 18:00:00 D
2 2019-01-02 06:00:00 A
3 2019-01-02 18:00:00 B
4 2019-01-03 06:00:00 A
5 2019-01-03 18:00:00 B
.
.
.

什么是最有效的Pandas方法来重新索引数据来实现这一点,即避免for循环?

使用:

#get first column by position
first = df.iloc[:, 0]
#convert column to datetimes with missing values for no datetimes values
dates = pd.to_datetime(first, errors='coerce')
#mask for data row
mask = dates.isna()
#forward filling missing values and replace first NaNs by first column name
df.index = dates.ffill().fillna(pd.to_datetime(first.name))
#filter out rows with datetimes in first column, add first column to index
df = df[mask.values].set_index(first.name, append=True)
#convert columns names to timedeltas in days, first is 0 days
df.columns = pd.to_timedelta(df.columns.astype(int) - 1, unit='D')
#dictionary for map 1, 2 values
mapp = {1: pd.Timedelta('06:00:00'), 2:pd.Timedelta('18:00:00')}
#remove 0 rows with convert to NaN by mask and reshape by stack
#map by dict and convert MultiIndex to columns
df = (df.mask(df == 0)
.stack()
.map(mapp)
.rename_axis(('Datetime','Shift', 'day'))
.reset_index(name='td')
)
#add days to hours and add to Datetime
df['Datetime'] += (df.pop('td') + df.pop('day'))
#sorting ans create default index
df = df.sort_values(['Datetime','Shift']).reset_index(drop=True)

print (df)
Datetime    Shift
0   2019-01-01 06:00:00  Shift A
1   2019-01-01 18:00:00  Shift D
2   2019-01-02 06:00:00  Shift A
3   2019-01-02 18:00:00  Shift B
4   2019-01-03 06:00:00  Shift A
..                  ...      ...
113 2019-02-26 18:00:00  Shift D
114 2019-02-27 06:00:00  Shift A
115 2019-02-27 18:00:00  Shift B
116 2019-02-28 06:00:00  Shift A
117 2019-02-28 18:00:00  Shift B
[118 rows x 2 columns]

你可以使用numpy来做到这一点,如下所示

生成虚拟数据

df = pd.DataFrame([['January 2019', 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0], ['Shift A', 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0], ['Shift B', 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0], ['Shift C', 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], ['Shift D', 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0], ['February 2019', 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, np.nan, np.nan, np.nan], ['Shift A', 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, np.nan, np.nan, np.nan], ['Shift B', 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, np.nan, np.nan, np.nan], ['Shift C', 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, np.nan, np.nan, np.nan], ['Shift D', 1.0, 1.0, 1.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, np.nan, np.nan, np.nan]], columns=('days', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31'))
df

溶液

import pandas as pd
import numpy as np
## some constants
TEAMS = np.array("A,B,C,D".split(","))
TEAM_COUNT=len(TEAMS)
MAP = {1: pd.Timedelta('06:00:00'), 2:pd.Timedelta('18:00:00')}
## first transpose the table and get numpy array rows except heading
arr = df.copy().T.values[1:, :].astype(np.float)
## break array in chunk for each month and create long array
arr = np.c_[[arr[:,sec+1:sec+(TEAM_COUNT+1)] 
for sec in np.arange(0,arr.shape[1], (TEAM_COUNT+1))]]
## remove data with nan and flatten the array to get one row for each team
arr = arr[~np.isnan(arr)]
## get date range from start to end dates and repeate that for team size
date_index = pd.date_range("jan-1-2019", "feb-28-2019").repeat(TEAM_COUNT)
## create dataframe 
df2 = pd.DataFrame(arr, index=date_index, columns=["shift"])
## set team column for each row
df2["team"] = np.tile(TEAMS, len(df2)//len(TEAMS))
## map shift to time delta
df2 = df2[df2["shift"]!=0]
df2["shift"] = df2["shift"].map(MAP)
df2

结果


shift   team
2019-01-01  06:00:00    A
2019-01-01  18:00:00    D
2019-01-02  06:00:00    A
2019-01-02  18:00:00    B
2019-01-03  06:00:00    A
... ... ...
2019-02-26  18:00:00    D
2019-02-27  06:00:00    A
2019-02-27  18:00:00    B
2019-02-28  06:00:00    A
2019-02-28  18:00:00    B

最新更新