我有两个按月计算的保险索赔数据透视表,供应商和另一个来源,一个数据透视表有2个月的数据(假设这是"旧的");表),另一个有相同的2个月+ 1个月的数据(称之为"新";表)。我想做的是把它们并排放在一起,并添加两个额外的列,以月为单位减去新减去旧的差,然后计算变化的百分比。这两个表基于两个数据帧,df和hdf。
表是这样的(旧表,新表只包括202003):
PAID
SOURCE IP OP PR
CLAIM_VENDOR PYYMM
VENDOR 1 202001 NaN 0.00 NaN
202002 NaN 0.00 3150.00
VENDOR 2 202001 NaN NaN 1197812.00
202002 NaN NaN 1837917.00
VENDOR 3 202001 9005941.61 8987012.39 16890979.44
202002 5158238.02 7796918.67 16888689.98
Old = pd.pivot_table(df,
values = ["PAID"],
index = ["CLAIM_VENDOR","PYYMM"],
columns = ["SOURCE"],
aggfunc=np.sum
)
New = pd.pivot_table(hdf,
values = ["PAID"],
index = ["CLAIM_VENDOR","PYYMM"],
columns = ["SOURCE"],
aggfunc=np.sum
)
所以最终的目标是让它们并排,然后添加6个额外的列,取每个"源"的差异。按月计算,然后是百分比差异。对于这个项目,即使在相同的月份,预计new也会有不同的值。
对不起,如果我做了不好的stackoverflow格式,以前没有使用过这个。
import pandas as pd
# Create old table
old_df_index = pd.MultiIndex(
[["VENDOR 1", "VENDOR 2", "VENDOR 3"], ["202001", "202002"]],
[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=["CLAIM_VENDOR", "PYYMM"]
)
old_df_index.name = "SOURCE"
old_df = pd.DataFrame(
{
"IP": [None, None, None, None, 9005941.61, 5158238.02],
"OP": [0, 0, None, None, 8987012.39, 7796918.67],
"PR": [None, 3150, 1197812, 1837917, 16890979.44, 16888689.98]
},
index=old_df_index,
columns=pd.Index(["IP", "OP", "PR"], name="PAID")
)
# Create new table
new_df_index = pd.MultiIndex.from_product(
[["VENDOR 1", "VENDOR 2", "VENDOR 3"], ["202001", '202002', "202003"]],
names=["CLAIM_VENDOR", "PYYMM"]
)
new_df_index.name = "SOURCE"
new_df = pd.DataFrame(
{
"IP": [None, None, None, None, None, 8675309*1.5, 9005941.61*1.7, 5158238.02*1.9, 2020567*2.1],
"OP": [0, 0, 0, None, None, None, 8987012.39*1.25, 7796918.67*1.45, 125964816*1.65],
"PR": [None, 3150*0.75, 2210*0.7, 1197812*0.9, 1837917*0.99, 1818956*0.6, 16890979.44*1.2, 16888689.98*1.1, 74747687*3.1]
},
index=new_df_index,
columns=pd.Index(["IP", "OP", "PR"], name="PAID")
)
# We want the index to include CLAIM_VENDOR, PYYMM and PAID for the join below
old_df_for_join = (
old_df.melt(ignore_index=False)
.reset_index()
.set_index(["CLAIM_VENDOR", "PYYMM", "PAID"])
)
new_df_for_join = (
new_df.melt(ignore_index=False)
.reset_index()
.set_index(["CLAIM_VENDOR", "PYYMM", "PAID"])
)
# Use .join to combine based on the index
final_table = (
new_df_for_join.join(
old_df_for_join,
lsuffix="new",
rsuffix="old"
)
.assign(
difference=lambda DF: DF.valuenew - DF.valueold,
pct_diff = lambda DF: DF.difference / DF.valueold
)
.unstack(level="PAID")
)
print(final_table)
结果
valuenew valueold difference pct_diff
PAID IP OP PR IP OP PR IP OP PR IP OP PR
CLAIM_VENDOR PYYMM
VENDOR 1 202001 NaN 0.000000e+00 NaN NaN 0.00 NaN NaN 0.000000e+00 NaN NaN NaN NaN
202002 NaN 0.000000e+00 2.362500e+03 NaN 0.00 3150.00 NaN 0.000000e+00 -787.500 NaN NaN -0.25
202003 NaN 0.000000e+00 1.547000e+03 NaN NaN NaN NaN NaN NaN NaN NaN NaN
VENDOR 2 202001 NaN NaN 1.078031e+06 NaN NaN 1197812.00 NaN NaN -119781.200 NaN NaN -0.10
202002 NaN NaN 1.819538e+06 NaN NaN 1837917.00 NaN NaN -18379.170 NaN NaN -0.01
202003 1.301296e+07 NaN 1.091374e+06 NaN NaN NaN NaN NaN NaN NaN NaN NaN
VENDOR 3 202001 1.531010e+07 1.123377e+07 2.026918e+07 9005941.61 8987012.39 16890979.44 6304159.127 2.246753e+06 3378195.888 0.7 0.25 0.20
202002 9.800652e+06 1.130553e+07 1.857756e+07 5158238.02 7796918.67 16888689.98 4642414.218 3.508613e+06 1688868.998 0.9 0.45 0.10
202003 4.243191e+06 2.078419e+08 2.317178e+08 NaN NaN NaN NaN NaN NaN NaN NaN NaN