查找具有两个类别的所有ID并计算时间差



我有一个pandas数据帧,看起来像这样:

日期
ID 类别
1 A
1 A
1 B 1/2/22上午10:14:12
2 A 1/1/22上午10:14:12
3 A 1/2/22上午10:14:12
3 B 1/1/22上午10:14:12
3 B

这里有一种方法可以实现

# convert the date to datetime (unless its already datetime)
df['Date'] = pd.to_datetime(df['Date'])
# sort and drop duplicates, keeping latest
df=df.sort_values(['ID', 'Category','Date']).drop_duplicates(subset=['ID','Category','Date'], keep='last')
# pivot to put 'A' and 'B' as two columns
df2=df.pivot(index='ID', columns='Category', values='Date').reset_index()
# eliminate the rows where either of the two column (dates) are null
df2.dropna(inplace=True)

# this to get rid of spaces from Category, unless its already stripped of whitespaces characters
df2.columns = [col.strip() for col in df2.columns]

# calculate the difference
df2['time_difference']= df2['A'].sub(df2['B']).dt.total_seconds()/3600
df2

ID  A   B   time_difference
0   1   2022-01-01 10:14:12     2022-01-02 10:14:12     -24.0
2   3   2022-01-02 10:14:12     2022-01-01 10:14:12     24.0

使用.groupby.diff.assign

df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(["ID", "Date"], ascending=[1, 1])
df = (
df
.assign(DeltaHours=df.groupby(["ID"]).Date.diff().dt.days * 24)
.dropna()
.reset_index(drop=True)[["ID", "DeltaHours"]]
)
print(df)
ID  DeltaHours
1   1  24.0
3   3  24.0

使用pivot:

out = (df
.assign(Date=pd.to_datetime(df['Date']))
.pivot_table(index='ID', columns='Category', values='Date', aggfunc='max')
.pipe(lambda d: d['A'].sub(d['B']).abs().dt.total_seconds().div(3600))
.dropna()
.reset_index(name='Time Difference')
)

输出:

ID  Time Difference
0   1             24.0
1   3             24.0

另一种可能的解决方案,基于pandas.DataFrame.groupby:

df['Date'] = pd.to_datetime(df['Date'])
g = df.groupby('ID')
(g.max()['Date'].sub(g.min()['Date']).dt.total_seconds().div(3600)
[g['ID'].count().ne(1)].reset_index())

输出:

ID  time_diff
0   1       24.0
1   3       24.0

最新更新