使用Pandas数据帧查找多个索引上所有列之间的差异(delta)



以以下为例:

import datetime
import pandas as pd
data = [
{"date": datetime.date(2020, 1, 1), "product": "product_1", "price": 90, "quantity": 100, "code": "code-1.1"},
{"date": datetime.date(2020, 1, 1), "product": "product_2", "price": 80, "quantity": 80, "code": "code-2.1"},
{"date": datetime.date(2020, 1, 1), "product": "product_3", "price": 80, "quantity": 80, "code": "code-3.1"},
{"date": datetime.date(2020, 1, 2), "product": "product_1", "price": 90, "quantity": 80, "code": "code-1.1"},
{"date": datetime.date(2020, 1, 2), "product": "product_2", "price": 80, "quantity": 80, "code": "code-2.1"},
{"date": datetime.date(2020, 1, 2), "product": "product_3", "price": 80, "quantity": 80, "code": "code-3.1"},
{"date": datetime.date(2020, 1, 3), "product": "product_1", "price": 90, "quantity": 80, "code": "code-1.1"},
{"date": datetime.date(2020, 1, 3), "product": "product_2", "price": 80, "quantity": 80, "code": "code-2.2"},
{"date": datetime.date(2020, 1, 4), "product": "product_1", "price": 80, "quantity": 70, "code": "code-1.1"},
{"date": datetime.date(2020, 1, 4), "product": "product_2", "price": 70, "quantity": 80, "code": "code-2.2"},
{"date": datetime.date(2020, 1, 4), "product": "product_3", "price": 80, "quantity": 80, "code": "code-3.1"},
]
df = pd.DataFrame(data)
df = df.set_index(["date", "product"])

我想做的是在每个无索引列上产生一个差值/增量,只输出更改的列,不输出没有更改的天数,但还要确定错过的天数,然后在第二天输出该日期/产品的所有列。

一个示例输出是:

{"date": datetime.date(2020, 1, 1), "product": "product_1", "price": 90, "quantity": 100, "code": "code-1.1"},
{"date": datetime.date(2020, 1, 1), "product": "product_2", "price": 80, "quantity": 80, "code": "code-2.1"},
{"date": datetime.date(2020, 1, 1), "product": "product_3", "price": 80, "quantity": 80, "code": "code-3.1"},
{"date": datetime.date(2020, 1, 2), "product": "product_1", "price": None, "quantity": 80, "code": None},
{"date": datetime.date(2020, 1, 3), "product": "product_2", "price": None, "quantity": None, "code": "code-2.2"},
{"date": datetime.date(2020, 1, 4), "product": "product_1", "price": 80, "quantity": 70, "code": None},
{"date": datetime.date(2020, 1, 4), "product": "product_2", "price": 70, "quantity": None, "code": None},
{"date": datetime.date(2020, 1, 4), "product": "product_3", "price": 80, "quantity": 80, "code": "code-3.1"},

澄清以上内容:

对于product_1:由于列(价格、数量(中没有任何变化,因此缺少日期2020-01-03,而对于日期2020-01:02,由于价格没有变化,因此价格为None

对于product_2:日期2020-01-01由于是第一个数据点而存在,代码在2020-01-03更改,价格在2020-01:04更新。

对于product_3:日期2020-01-01由于是第一个数据点而存在,日期2020-01:04由于上一个日期(2020-01-03(而存在,我们没有当天的任何数据。

我已经尝试过对数据帧进行迭代,但担心这会导致扩展缓慢。我觉得pct_change之类的东西会起作用,但从阅读中可以看出,这对多索引不起作用。

同样,由于答案反映了diff将在int列上工作,但理想情况下,我也需要它在str列上工作。

对于单个产品

diffs = df.diff()
df[diffs == 0] = None
df[~(diffs == 0).all(axis=1)]
price   quantity
date    product     
2020-01-01  product_1   90.0    100.0
2020-01-02  product_1   NaN     80.0
2020-01-04  product_1   80.0    70.0

对于多种产品

def show_diffs(df):
diffs = df.diff()
df[diffs == 0] = None
return df[~(diffs == 0).all(axis=1)]
df.groupby('product').apply(show_diffs).sort_values('date')  
price   quantity
product     date        product     
product_1   2020-01-01  product_1   90.0    100.0
product_2   2020-01-01  product_2   80.0    80.0
product_1   2020-01-02  product_1   NaN     80.0
2020-01-04  product_1   80.0    70.0
product_2   2020-01-04  product_2   70.0    NaN

最新更新