以以下为例:
import datetime
import pandas as pd
data = [
{"date": datetime.date(2020, 1, 1), "product": "product_1", "price": 90, "quantity": 100, "code": "code-1.1"},
{"date": datetime.date(2020, 1, 1), "product": "product_2", "price": 80, "quantity": 80, "code": "code-2.1"},
{"date": datetime.date(2020, 1, 1), "product": "product_3", "price": 80, "quantity": 80, "code": "code-3.1"},
{"date": datetime.date(2020, 1, 2), "product": "product_1", "price": 90, "quantity": 80, "code": "code-1.1"},
{"date": datetime.date(2020, 1, 2), "product": "product_2", "price": 80, "quantity": 80, "code": "code-2.1"},
{"date": datetime.date(2020, 1, 2), "product": "product_3", "price": 80, "quantity": 80, "code": "code-3.1"},
{"date": datetime.date(2020, 1, 3), "product": "product_1", "price": 90, "quantity": 80, "code": "code-1.1"},
{"date": datetime.date(2020, 1, 3), "product": "product_2", "price": 80, "quantity": 80, "code": "code-2.2"},
{"date": datetime.date(2020, 1, 4), "product": "product_1", "price": 80, "quantity": 70, "code": "code-1.1"},
{"date": datetime.date(2020, 1, 4), "product": "product_2", "price": 70, "quantity": 80, "code": "code-2.2"},
{"date": datetime.date(2020, 1, 4), "product": "product_3", "price": 80, "quantity": 80, "code": "code-3.1"},
]
df = pd.DataFrame(data)
df = df.set_index(["date", "product"])
我想做的是在每个无索引列上产生一个差值/增量,只输出更改的列,不输出没有更改的天数,但还要确定错过的天数,然后在第二天输出该日期/产品的所有列。
一个示例输出是:
{"date": datetime.date(2020, 1, 1), "product": "product_1", "price": 90, "quantity": 100, "code": "code-1.1"},
{"date": datetime.date(2020, 1, 1), "product": "product_2", "price": 80, "quantity": 80, "code": "code-2.1"},
{"date": datetime.date(2020, 1, 1), "product": "product_3", "price": 80, "quantity": 80, "code": "code-3.1"},
{"date": datetime.date(2020, 1, 2), "product": "product_1", "price": None, "quantity": 80, "code": None},
{"date": datetime.date(2020, 1, 3), "product": "product_2", "price": None, "quantity": None, "code": "code-2.2"},
{"date": datetime.date(2020, 1, 4), "product": "product_1", "price": 80, "quantity": 70, "code": None},
{"date": datetime.date(2020, 1, 4), "product": "product_2", "price": 70, "quantity": None, "code": None},
{"date": datetime.date(2020, 1, 4), "product": "product_3", "price": 80, "quantity": 80, "code": "code-3.1"},
澄清以上内容:
对于product_1
:由于列(价格、数量(中没有任何变化,因此缺少日期2020-01-03,而对于日期2020-01:02,由于价格没有变化,因此价格为None
。
对于product_2
:日期2020-01-01由于是第一个数据点而存在,代码在2020-01-03更改,价格在2020-01:04更新。
对于product_3
:日期2020-01-01由于是第一个数据点而存在,日期2020-01:04由于上一个日期(2020-01-03(而存在,我们没有当天的任何数据。
我已经尝试过对数据帧进行迭代,但担心这会导致扩展缓慢。我觉得pct_change
之类的东西会起作用,但从阅读中可以看出,这对多索引不起作用。
同样,由于答案反映了diff将在int
列上工作,但理想情况下,我也需要它在str
列上工作。
对于单个产品
diffs = df.diff()
df[diffs == 0] = None
df[~(diffs == 0).all(axis=1)]
price quantity
date product
2020-01-01 product_1 90.0 100.0
2020-01-02 product_1 NaN 80.0
2020-01-04 product_1 80.0 70.0
对于多种产品
def show_diffs(df):
diffs = df.diff()
df[diffs == 0] = None
return df[~(diffs == 0).all(axis=1)]
df.groupby('product').apply(show_diffs).sort_values('date')
price quantity
product date product
product_1 2020-01-01 product_1 90.0 100.0
product_2 2020-01-01 product_2 80.0 80.0
product_1 2020-01-02 product_1 NaN 80.0
2020-01-04 product_1 80.0 70.0
product_2 2020-01-04 product_2 70.0 NaN