具有最后一个和先前值的熊猫复杂条件



有一个带有订单号的表(熊猫数据帧(

>>> ord = pd.DataFrame([[241147,'01.01.2016'], [241148,'01.01.2016']], columns=['order_id','created'])
>>> ord
    order_id    created
0   241147  01.01.2016
1   241148  01.01.2016

订单状态有更改历史记录

>>> ord_status_history['ord_id','osh_id','osh_created','osh_status_id','osh_status_reason']
    ord_id  osh_id  osh_id_created  osh_status_id   osh_status_reason
0   241147  124632  01.01.2016  1   None
1   241147  124682  02.01.2016  2   None
2   241147  124719  03.01.2016  10   None
7   241148  124633  01.01.2016  1   None
8   241148  126181  06.01.2016  5   Test_reason

我想将有关订单的最后一个订单状态和订单倒数第二个状态的信息添加到表中(订单由字段"osh_created"确定(。

order_id    created Last_status_id  Last_status_date    Prev_status_id  Prev_status_date    reason
0   241147  01.01.2016  10  03.01.2016  9   02.01.2016  NaN
1   241148  01.01.2016  5   06.01.2016  1   01.01.2016  Test Reason

但是我不明白如何使用np.where或loc条件。因为对于一个订单ord_status_history几行,但我只需要为每个订单选择一个。

我试着这样做(但它非常糟糕(:

for i in range(ord_stat['order_id'].count()-1):
    if (ord_stat.loc[i,'order_id']==ord_stat.loc[i+1,'order_id']):
        if (ord_stat.loc[i,'osh_id_created']<=ord_stat.loc[i+1,'osh_id_created']):
            if (ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=='NAN'):
                ord.loc[ord_stat.loc[i,'order_id'],'Prev_status_date']=ord_stat.loc[i,'osh_id_created']
                ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=ord_stat.loc[i+1,'osh_id_created']
            else: 
                ord.loc[ord_stat.loc[i,'order_id'],'Prev_status_date']=ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']
                ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=ord_stat.loc[i+1,'osh_id_created']
        else:
            if (ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=='NAN'):
                ord.loc[ord_stat.loc[i,'order_id'],'Prev_status_date']=ord_stat.loc[i+1,'osh_id_created']
                ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=ord_stat.loc[i,'osh_id_created']
            else: 
                ord.loc[ord_stat.loc[i,'order_id'],'Prev_status_date']=ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']
                ord.loc[ord_stat.loc[i,'order_id'],'Last_status_date']=ord_stat.loc[i,'osh_id_created']

阅读有关 nlargest 的信息,但我不明白如果我将"osh_created"与 nmax 一起使用,我该如何接受status_id

ord_stat.groupby('order_id')['osh_id_created'].nlargest(2)

假设我们有以下数据帧:

In [291]: ord
Out[291]:
   order_id    created
0    241147 2016-01-01
1    241148 2016-01-01
In [292]: hst
Out[292]:
   ord_id  osh_id osh_id_created  osh_status_id osh_status_reason
0  241147  124632     2016-01-01              1              None
1  241147  124682     2016-02-01              2              None
2  241147  124719     2016-03-01             10              None
7  241148  124633     2016-01-01              1              None
8  241148  126181     2016-06-01              5       Test_reason

我们可以将其聚合如下:

In [293]: funcs = {
     ...:     'osh_status_id':{
     ...:         'Last_status_id':'last',
     ...:         'Prev_status_id':lambda x: x.shift().iloc[-1]
     ...:     },
     ...:     'osh_id_created':{
     ...:         'Last_status_date':'last',
     ...:         'Prev_status_date':lambda x: x.shift().iloc[-1]
     ...:     }
     ...: }
     ...:
In [294]: x = (hst.sort_values('osh_id_created')
     ...:         .groupby('ord_id')['osh_status_id','osh_id_created']
     ...:         .agg(funcs)
     ...: )
     ...:

导致

In [295]: x
Out[295]:
        Last_status_id  Prev_status_id Last_status_date Prev_status_date
ord_id
241147              10               2       2016-03-01       2016-02-01
241148               5               1       2016-06-01       2016-01-01

现在我们可以将其合并回原始ord DF:

In [296]: ord.set_index('order_id').join(x).reset_index()
Out[296]:
   order_id    created  Last_status_id  Prev_status_id Last_status_date Prev_status_date
0    241147 2016-01-01              10               2       2016-03-01       2016-02-01
1    241148 2016-01-01               5               1       2016-06-01       2016-01-01

或使用merge()方法:

In [297]: pd.merge(ord, x, left_on='order_id', right_index=True)
Out[297]:
   order_id    created  Last_status_id  Prev_status_id Last_status_date Prev_status_date
0    241147 2016-01-01              10               2       2016-03-01       2016-02-01
1    241148 2016-01-01               5               1       2016-06-01       2016-01-01

最新更新