Pandas:如何创建一个算法来帮助我改进结果并创建新的列

它有点复杂，我有这个数据帧：

ID           TimeandDate        Date       Time
10   2020-08-07 07:40:09  2022-08-07   07:40:09
10   2020-08-07 08:50:00  2022-08-07   08:50:00
10   2020-08-07 12:40:09  2022-08-07   12:40:09
10   2020-08-08 07:40:09  2022-08-08   07:40:09
10   2020-08-08 17:40:09  2022-08-08   17:40:09
12   2020-08-07 08:03:09  2022-08-07   08:03:09
12   2020-08-07 10:40:09  2022-08-07   10:40:09
12   2020-08-07 14:40:09  2022-08-07   14:40:09
12   2020-08-07 16:40:09  2022-08-07   16:40:09
13   2020-08-07 09:22:45  2022-08-07   09:22:45
13   2020-08-07 17:57:06  2022-08-07   17:57:06

我想创建一个新的数据帧，其中有两列，第一列是df["Check-in"]，正如你所看到的，我的数据没有任何指标来显示id的签入时间，所以我假设每个id的第一次是签入，下一行是签出，并将插入df["Check-out"]，如果check-in没有check-out时间，则必须将其注册为当天的前一个check-out的check-out

我试过了，但恐怕效率不高，因为它显示了第一个和最后一个。想象一下，如果ID=13在07:40:09进入，他在08:40:09结账，当天晚些时候他在19:20:00返回，然后在接下来的10分钟内离开19:30:00。如果我这样做，它将显示他工作了12个小时

group = df.groupby(['ID', 'Date'])
def TimeDifference(df):
in = df['TimeandDate'].min()
out = df['TimeandDate'].max()
df2 = p.DataFrame([in-out], columns=['TimeDiff'])
return df2
group.apply(TimeDifference)

所需结果

ID         Date   Check-in    Check-out
10   2020-08-07   07:40:09     12:40:09
10   2020-08-08   07:40:09     17:40:09
12   2020-08-07   08:03:09     10:40:09
12   2020-08-07   14:40:09     16:40:09 
13   2020-08-07   09:22:45     17:57:06

谢谢！！！

如果我理解正确，您可以执行以下操作：

import pandas as pd
df["TimeandDate"] = pd.to_datetime(df["TimeandDate"])
df.set_index("TimeandDate", inplace=True)
print(df.groupby([df["ID"], df.index.year, df.index.month, df.index.day]).agg(["min", "max"]).to_markdown())

输出


(2020年10月，8日，7日(	2022-08-07	22022-08-07	07:40:09
(2020年10月，8日，8日(		(2020年12月8日，7日(		(2020年8月13日，7日(

这种方法会很冗长，速度也不快，但目前可能会解决问题。

我首先为每个ID/Date对分配一个后缀对，然后检查是否有未退房的入住(因此，如果长度不相等，则意味着缺少退房(。

输出与您想要的输出相同

new_col = []
for i in df.ID.unique():
for d in df.Date.unique():
p = df.loc[(df.ID==i)&(df.Date==d)]
suffix = sorted(list(range(1,len(p)))*2)[:len(p)]
if len(suffix)%2!=0 and len(suffix)>1:
suffix[-2]=np.nan
suffix[-1]-=1
new_col.extend(suffix)
df['new'] = new_col
df.dropna().groupby(['ID','Date','new'], as_index=False).agg({'Time':[min,max]}).drop('new', axis=1, level=0)
Output:
ID  Date    Time
min         max
0   10  2022-08-07  07:40:09    12:40:09
1   10  2022-08-08  07:40:09    17:40:09
2   12  2022-08-07  08:03:09    10:40:09
3   12  2022-08-07  14:40:09    16:40:09
4   13  2022-08-07  09:22:45    17:57:06

尝试不同的方法：

df=df[['ID','Date','Time']]
def check(x):
x = x.reset_index(drop=True)
if len(x)%2!=0:
x=x.drop(len(x)-2)
return x
df
df.groupby(['ID','Date'], as_index=False).agg(check)
g = df.groupby(['ID','Date'], as_index=False).agg(check).explode('Time').reset_index(drop=True)
g['in'] = np.where(g.index%2==0, g.loc[g.index,'Time'], np.nan)
g['out'] = np.where(g.index%2!=0, g.loc[g.index,'Time'], np.nan)
out = g.groupby(['ID','Date'], as_index=False).agg(list)
out['in'] = out['in'].apply(lambda x: [i for i in x if str(i) != "nan"])
out['out'] = out['out'].apply(lambda x: [i for i in x if str(i) != "nan"])
out[['ID','Date','in','out']].explode(['in','out']).reset_index(drop=True)

输出：

ID  Date        in          out
0   10  2022-08-07  07:40:09    12:40:09
1   10  2022-08-08  07:40:09    17:40:09
2   12  2022-08-07  08:03:09    10:40:09
2   12  2022-08-07  14:40:09    16:40:09
3   13  2022-08-07  09:22:45    17:57:06

输出

相关内容

最新更新

热门标签：