正在创建事务数据序列



我有一个类似的数据帧

ID  time    Item    Status  Combined
1   4/29/20 20:32   A   OK  A_OK
1   4/29/20 20:32   A   OK  A_OK
1   4/29/20 20:32   A   OK  A_OK
1   4/29/20 20:32   A   OK  A_OK
1   4/29/20 20:32   A   FAIL    A_FAIL
1   4/29/20 20:32   A   FAIL    A_FAIL
1   4/29/20 20:34   B   OK  B_OK
1   4/29/20 20:53   A   OK  A_OK
1   4/29/20 20:53   A   OK  A_OK
1   4/29/20 20:58   C   OK  C_OK
2   5/30/20 22:32   A   OK  A_OK
2   5/30/20 22:32   A   OK  A_OK
2   5/30/20 22:32   A   OK  A_OK
2   5/30/20 22:32   A   FAIL    A_FAIL
2   5/30/20 22:32   B   OK  B_OK
2   5/30/20 22:32   B   OK  B_OK
2   4/29/20 20:53   A   OK  A_OK
2   4/29/20 20:53   C   FAIL    C_FAIL
2   4/29/20 20:53   C   FAIL    C_FAIL
2   4/29/20 20:58   D   OK  D_OK

每个唯一的项目都在Combined col.中

想要获得每个唯一的ID,一系列项目如下:

1 [[A_OK], [A_FAIL, B_OK], [A_OK], [C_OK]]
2 [[A_OK], [A_FAIL,B_OK], [A_OK, C_FAIL], [D_OK]] 

如果不是上面的,下面的文件也可以作为一个txt文件,其中每行代表一个ID,-1表示项目集的结束,-2表示该ID 的行的结束

A_OK -1 A_FAIL B_OK -1 A_OK -1 C_OK -1 -2
A_OK -1 A_FAIL B_OK -1 A_OK C_FAIL -1 D_OK -1 -2

其中,如果Combined中的项目在2分钟的时间范围内,则为其相同的项目集(相同的子列表(,或者被视为相同id的另一个项目集。

不一定能理解你真正想要的东西,但你可以尝试使用apply(list(。例如:

import pandas as pd
df = pd.DataFrame( {'ID':['1','1','1','2','2','2'], 'Combined':['A_OK','A_FAIL','B_OK','A_FAIL','B_FAIL','C_OK']})
df
Out[3]: 
ID Combined
0  1     A_OK
1  1   A_FAIL
2  1     B_OK
3  2   A_FAIL
4  2   B_FAIL
5  2     C_OK
df.groupby('ID')['Combined'].apply(list)
Out[4]: 
ID
1      [A_OK, A_FAIL, B_OK]
2    [A_FAIL, B_FAIL, C_OK]
Name: Combined, dtype: object

好的,所以我理解你想要这个,而不是你之前写的:

1 [[A_OK,A_FAIL, B_OK], [A_OK], [C_OK]]
2 [[A_OK,A_FAIL,B_OK], [A_OK, C_FAIL], [D_OK]]

为此:

import pandas as pd 
from datetime import datetime
"""Create the df"""
fmt = '%d-%m-%Y %H:%M'
d1 = datetime.strptime('17-07-2020 20:32', fmt)
d2 = datetime.strptime('17-07-2020 20:34', fmt)
d3 = datetime.strptime('17-07-2020 20:53', fmt)
d4 = datetime.strptime('17-07-2020 20:58', fmt)
df = pd.DataFrame( {'ID':['1','1','1','1','1','1','1','1','1','1','2','2','2','2','2','2','2','2','2','2'],'Date':[d1,d1,d1,d1,d1,d1,d2,d3,d3,d4,d1,d1,d1,d1,d1,d1,d2,d3,d3,d4],'Combined':['A_OK','A_OK','A_OK','A_OK','A_FAIL','A_FAIL','B_OK','A_OK','A_OK','C_OK','A_OK','A_OK','A_OK','A_OK','A_FAIL','A_FAIL','B_OK','A_OK','A_OK','C_OK']})
print(df)
ID                Date Combined
0   1 2020-07-17 20:32:00     A_OK
1   1 2020-07-17 20:32:00     A_OK
2   1 2020-07-17 20:32:00     A_OK
3   1 2020-07-17 20:32:00     A_OK
4   1 2020-07-17 20:32:00   A_FAIL
5   1 2020-07-17 20:32:00   A_FAIL
6   1 2020-07-17 20:34:00     B_OK
7   1 2020-07-17 20:53:00     A_OK
8   1 2020-07-17 20:53:00     A_OK
9   1 2020-07-17 20:58:00     C_OK
10  2 2020-07-17 20:32:00     A_OK
11  2 2020-07-17 20:32:00     A_OK
12  2 2020-07-17 20:32:00     A_OK
13  2 2020-07-17 20:32:00     A_OK
14  2 2020-07-17 20:32:00   A_FAIL
15  2 2020-07-17 20:32:00   A_FAIL
16  2 2020-07-17 20:34:00     B_OK
17  2 2020-07-17 20:53:00     A_OK
18  2 2020-07-17 20:53:00     A_OK
19  2 2020-07-17 20:58:00     C_OK
"""Create periods"""
df['Date'] = pd.to_datetime(df['Date'])
diffs = df['Date'] - df['Date'].shift()
laps = diffs > pd.Timedelta('2 min')
periods = laps.cumsum().apply(lambda x: 'period_{}'.format(x+1))
df['2min_period'] = periods
print(df)
ID                Date Combined 2min_period
0   1 2020-07-17 20:32:00     A_OK    period_1
1   1 2020-07-17 20:32:00     A_OK    period_1
2   1 2020-07-17 20:32:00     A_OK    period_1
3   1 2020-07-17 20:32:00     A_OK    period_1
4   1 2020-07-17 20:32:00   A_FAIL    period_1
5   1 2020-07-17 20:32:00   A_FAIL    period_1
6   1 2020-07-17 20:34:00     B_OK    period_1
7   1 2020-07-17 20:53:00     A_OK    period_2
8   1 2020-07-17 20:53:00     A_OK    period_2
9   1 2020-07-17 20:58:00     C_OK    period_3
10  2 2020-07-17 20:32:00     A_OK    period_3
11  2 2020-07-17 20:32:00     A_OK    period_3
12  2 2020-07-17 20:32:00     A_OK    period_3
13  2 2020-07-17 20:32:00     A_OK    period_3
14  2 2020-07-17 20:32:00   A_FAIL    period_3
15  2 2020-07-17 20:32:00   A_FAIL    period_3
16  2 2020-07-17 20:34:00     B_OK    period_3
17  2 2020-07-17 20:53:00     A_OK    period_4
18  2 2020-07-17 20:53:00     A_OK    period_4
19  2 2020-07-17 20:58:00     C_OK    period_5
"""finally the groupby"""
groupped=df.groupby(['ID','2min_period'])['Combined'].apply(list).reset_index()
groupped['Combined']=groupped.Combined.map(pd.unique)
groupped=groupped.groupby(['ID'])['Combined'].apply(list).reset_index()
print(groupped)
ID                                Combined
0  1  [[A_OK, A_FAIL, B_OK], [A_OK], [C_OK]]
1  2  [[A_OK, A_FAIL, B_OK], [A_OK], [C_OK]]

最新更新