我有一个pandas数据帧,看起来像这样:
ID | 类别 | 日期||
---|---|---|---|
1 | A | ||
1 | A | ||
1 | B | 1/2/22上午10:14:12 | |
2 | A | 1/1/22上午10:14:12 | |
3 | A | 1/2/22上午10:14:12 | |
3 | B | 1/1/22上午10:14:12 | |
3 | B |
这里有一种方法可以实现
# convert the date to datetime (unless its already datetime)
df['Date'] = pd.to_datetime(df['Date'])
# sort and drop duplicates, keeping latest
df=df.sort_values(['ID', 'Category','Date']).drop_duplicates(subset=['ID','Category','Date'], keep='last')
# pivot to put 'A' and 'B' as two columns
df2=df.pivot(index='ID', columns='Category', values='Date').reset_index()
# eliminate the rows where either of the two column (dates) are null
df2.dropna(inplace=True)
# this to get rid of spaces from Category, unless its already stripped of whitespaces characters
df2.columns = [col.strip() for col in df2.columns]
# calculate the difference
df2['time_difference']= df2['A'].sub(df2['B']).dt.total_seconds()/3600
df2
ID A B time_difference
0 1 2022-01-01 10:14:12 2022-01-02 10:14:12 -24.0
2 3 2022-01-02 10:14:12 2022-01-01 10:14:12 24.0
使用.groupby.diff.assign
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(["ID", "Date"], ascending=[1, 1])
df = (
df
.assign(DeltaHours=df.groupby(["ID"]).Date.diff().dt.days * 24)
.dropna()
.reset_index(drop=True)[["ID", "DeltaHours"]]
)
print(df)
ID DeltaHours
1 1 24.0
3 3 24.0
使用pivot
:
out = (df
.assign(Date=pd.to_datetime(df['Date']))
.pivot_table(index='ID', columns='Category', values='Date', aggfunc='max')
.pipe(lambda d: d['A'].sub(d['B']).abs().dt.total_seconds().div(3600))
.dropna()
.reset_index(name='Time Difference')
)
输出:
ID Time Difference
0 1 24.0
1 3 24.0
另一种可能的解决方案,基于pandas.DataFrame.groupby
:
df['Date'] = pd.to_datetime(df['Date'])
g = df.groupby('ID')
(g.max()['Date'].sub(g.min()['Date']).dt.total_seconds().div(3600)
[g['ID'].count().ne(1)].reset_index())
输出:
ID time_diff
0 1 24.0
1 3 24.0