对于每个用户,我有一个具有不同时间戳的数据帧,我想计算持续时间。我使用以下代码导入CSV文件:
import pandas as pd
import glob
path = r'C:Users...Desktop'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0,encoding='ISO-8859-1')
li.append(df)
df = pd.concat(li, axis=0, ignore_index=True)
df.head ()
ID timestamp
1828765 31-05-2021 22:27:03
1828765 31-05-2021 22:27:12
1828765 31-05-2021 22:27:13
1828765 31-05-2021 22:27:34
2056557 21-07-2021 10:27:12
2056557 21-07-2021 10:27:20
2056557 21-07-2021 10:27:22
我想要得到这样的东西
ID timestamp duration(s)
1828765 31-05-2021 22:27:03 NAN
1828765 31-05-2021 22:27:12 9
1828765 31-05-2021 22:27:13 1
1828765 31-05-2021 22:27:34 21
2056557 21-07-2021 10:27:12 NAN
2056557 21-07-2021 10:27:20 8
2056557 21-07-2021 10:27:22 2
我用过这段代码,但是不适合我
import datetime
df['timestamp'] = pd.to_datetime(df['timestamp'], format = "%d-%m-%Y %H:%M:%S")
df['time_diff'] = 0
for i in range(df.shape[0] - 1):
df['time_diff'][i+1] = (datetime.datetime.min + (df['timestamp'][i+1] - df['timestamp'][i])).time()
发生在一组值上的操作是pandas
中的GroupBy操作。
pandas
原生支持时间戳上的数学运算。因此,减法将给出任意两个时间戳之间的正确持续时间。
我们已经成功地将timestamp
列转换为datetime64[ns]
df['timestamp'] = pd.to_datetime(df['timestamp'], format="%d-%m-%Y %H:%M:%S")
现在我们可以用Groupby.diff
df['duration'] = df.groupby('ID')['timestamp'].diff()
df
ID timestamp duration
0 1828765 2021-05-31 22:27:03 NaT
1 1828765 2021-05-31 22:27:12 0 days 00:00:09
2 1828765 2021-05-31 22:27:13 0 days 00:00:01
3 1828765 2021-05-31 22:27:34 0 days 00:00:21
4 2056557 2021-07-21 10:27:12 NaT
5 2056557 2021-07-21 10:27:20 0 days 00:00:08
6 2056557 2021-07-21 10:27:22 0 days 00:00:02
如果我们想以秒为单位获得持续时间,我们可以使用Series.dt.total_seconds
:
df['duration (s)'] = df.groupby('ID')['timestamp'].diff().dt.total_seconds()
df
:
ID timestamp duration (s)
0 1828765 2021-05-31 22:27:03 NaN
1 1828765 2021-05-31 22:27:12 9.0
2 1828765 2021-05-31 22:27:13 1.0
3 1828765 2021-05-31 22:27:34 21.0
4 2056557 2021-07-21 10:27:12 NaN
5 2056557 2021-07-21 10:27:20 8.0
6 2056557 2021-07-21 10:27:22 2.0
完整工作示例:
import pandas as pd
df = pd.DataFrame({
'ID': [1828765, 1828765, 1828765, 1828765, 2056557, 2056557, 2056557],
'timestamp': ['31-05-2021 22:27:03', '31-05-2021 22:27:12',
'31-05-2021 22:27:13', '31-05-2021 22:27:34',
'21-07-2021 10:27:12', '21-07-2021 10:27:20',
'21-07-2021 10:27:22']
})
df['timestamp'] = pd.to_datetime(df['timestamp'], format="%d-%m-%Y %H:%M:%S")
df['duration (s)'] = df.groupby('ID')['timestamp'].diff().dt.total_seconds()
print(df)