连接复杂的熊猫表



我正在尝试将来自统计模型 GLM 的结果数据帧连接到一个数据帧,该数据帧旨在在迭代模型时保存单变量数据和模型结果。 我无法弄清楚如何在语法上连接两个数据集。

我已经查阅了下面找到的熊猫文档,但没有运气:

https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging

这很困难,因为与包含每个唯一变量的每个唯一级别的最终表相比,模型的输出

。使用以下代码查看数据外观的示例:

import pandas as pd
df = {'variable': ['CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
,'channel_model','channel_model','channel_model']
, 'level': [0,100,200,250,500,750,1000, 'DIR', 'EA', 'IA']
,'value': [460955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36
,36388470.44,31805316.37]}
final_table = pd.DataFrame(df)

df2 = {'variable': ['intercept','C(channel_model)[T.EA]','C(channel_model)[T.IA]', 'CLded_model']
, 'coefficient': [-2.36E-14,-0.091195797,-0.244225888, 0.00174356]}
model_results = pd.DataFrame(df2)

运行此值后,您可以看到,对于分类变量,与final_table相比,该值被封装在几层中。 数值(如CLded_model)需要与其关联的一个系数连接。

这有很多,我不知道从哪里开始。

更新:以下代码生成所需的结果:

d3 = {'variable': ['intercept', 'CLded_model','CLded_model','CLded_model','CLded_model','CLded_model','CLded_model'
,'CLded_model','channel_model','channel_model','channel_model']
, 'level': [None, 0,100,200,250,500,750,1000, 'DIR', 'EA', 'IA']
,'value': [None, 60955.7793,955735.0532,586308.4028,12216916.67,48401773.87,1477842.472,14587994.92,10493740.36
,36388470.44,31805316.37]
, 'coefficient': [ -2.36E-14, 0.00174356,  0.00174356,  0.00174356,  0.00174356,  0.00174356 ,0.00174356
, 0.00174356,None, -0.091195797,-0.244225888, ]}
desired_result = pd.DataFrame(d3)

首先,你必须清理 df2:

df2['variable'] = df2['variable'].str.replace("C(","")
.str.replace(")[T.", "-")
.str.strip("]")
df2

variable          coefficient
0   intercept           -2.360000e-14
1   channel_model-EA    -9.119580e-02
2   channel_model-IA    -2.442259e-01
3   CLded_model          1.743560e-03

因为您想合并级别列上的一些 df1 而其他则不合并,我们需要稍微更改 df1 以匹配 df2:

df1.loc[df1['variable'] == 'channel_model', 'variable'] = "channel_model-"+df1.loc[df1['variable'] == 'channel_model', 'level']
df1
#snippet of what changed
variable         level     value
6   CLded_model        1000   1.458799e+07
7   channel_model-DIR   DIR   1.049374e+07
8   channel_model-EA    EA    3.638847e+07
9   channel_model-IA    IA    3.180532e+07

然后我们合并它们:

df4 = df1.merge(df2, how = 'outer', left_on =['variable'], right_on = ['variable'])

我们得到您的结果(变量名称的微小变化除外)

最新更新