Pandas用来自不同数据框架的数据替换所有NaN值



我对Pandas很陌生,我有点陷入了用不同数据框架的中值替换nan值的问题。中位数数据帧具有不同的形式,因为我必须将原始df分组以获得中位数。

我的主数据框架df1看起来像这样:

permno    yyyymm  BookLeverage Cash   RoE        ShareIss1Y   ShareIss5Y   SP         date        industry_id     STreversal  Price         Size      ret
541     10006   197101  -1.907577   NaN     0.114616    0.000000    0.051689    1.197606    1971-01-29  37              -4.383562   -3.863358   -12.496377  0.043836
542     10006   197102  -1.907577   NaN     0.114616    0.000000    0.051689    1.220021    1971-02-26  37              0.577428    -3.844814   -12.477833  -0.005774
543     10006   197103  -1.907577   NaN     0.114616    0.000000    0.051689    1.118353    1971-03-31  37              -9.090909   -3.931826   -12.564844  0.090909
544     10006   197104  -1.907577   NaN     0.114616    0.000000    0.051689    NaN         1971-04-30  37              -16.176471  -4.081766   -12.714785  0.161765
545     10006   197105  -1.907577   NaN     0.114616    0.000000    0.051689    1.025366    1971-05-28  37              5.105485    -4.018633   -12.651651  -0.051055  

然后我创建了一个新的数据框架df2,其中我将前df按yyyymmindustry_id列分组,并得到每个时间-行业面板的中位数。

中位数df2看起来像这样:

permno  BookLeverage  Cash       RoE  ShareIss1Y  
yyyymm industry_id                                                      
197101 01           40957.5     -2.451327   NaN  0.015212   -0.306936   
10           19254.0     -1.300565   NaN  0.123353   -0.002747   
12           33081.5     -2.102402   NaN -0.001043   -0.255756   
13           26470.0     -2.028418   NaN  0.116907   -0.005262   
14           17830.0     -1.266574   NaN  0.110059   -0.000193   
...                     ...           ...   ...       ...         ...   
202112 80           78633.0     -3.037694   NaN  0.195342         NaN   
82           52123.0     -3.093551   NaN  0.017580         NaN   
83           13739.0     -2.802522   NaN  0.021025         NaN   
87           78667.5     -3.103168   NaN  0.104524         NaN   
97           91547.0     -3.054443   NaN  0.162610         NaN   
ShareIss5Y        SP  STreversal     Price       Size  
yyyymm industry_id                                                          
197101 01            -7.591944  5.439985   -9.998244 -2.684046 -11.483201   
10            -1.432833  0.517484   -4.504504 -3.367296 -11.826440   
12           -20.622667  2.264890  -22.648810 -2.873900 -11.501783   
13            -0.257821  0.752112   -5.429864 -3.607534 -12.362360   
14            -0.223948  0.636665  -16.075773 -2.729726 -11.386150   
...                        ...       ...         ...       ...        ...   
202112 80                  NaN       NaN  -10.960198 -4.539740 -16.024733   
82                  NaN       NaN   -1.664319 -2.740474 -13.882130   
83                  NaN       NaN   -2.383083 -4.835329 -15.843560   
87                  NaN       NaN   -5.109321 -4.585741 -15.844537   
97                  NaN       NaN   -1.535659 -4.487512 -16.339328   
ret  
yyyymm industry_id            
197101 01           0.099982  
10           0.045045  
12           0.226488  
13           0.054299  
14           0.160758  
...                      ...  
202112 80           0.109602  
82           0.016643  
83           0.023831  
87           0.051093  
97           0.015357

我现在想要实现的是用df2的相应值填充df1中的nan值。因此,例如,第544行的SP列将获得yyyymm197104和industry_id37在df2中的值。

我试图映射到所有行和内部的所有列,并替换nan值,但这打破了我的数据框架:

def fill_nan_with_median(row):
date = int(row['yyyymm'])
industry = row['industry_id']

for label, column in row.items():
if column == np.nan:
median = df_median.loc[(date, industry), label]
df_1.loc[index, label] = median

for index, row in df_1.iterrows():
fill_nan_with_median(row)

这个答案采用了表查找方法。对于SP列中的NaNs,它在df2中查找SP值的中位数。这个答案还假设yyyymmindustry_id是字符串而不是数字。

df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1)
541    1.197606
542    1.220021
543    1.118353
544    0.636665
545    1.025366

请注意,非nan被奇怪的x['SP']==x['SP']检测到,利用NaN != NaN.

您的df1与我创建的df2一起使用:

SP
yyyymm industry_id          
197104 01           5.439985
10           0.517484
12           2.264890
13           0.752112
37           0.636665

之后你需要做的就是把它赋值给df1帧:

df1.assign(SP=df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1))
permno  yyyymm  BookLeverage  Cash       RoE  ShareIss1Y  ShareIss5Y  
541   10006  197101     -1.907577   NaN  0.114616         0.0    0.051689   
542   10006  197102     -1.907577   NaN  0.114616         0.0    0.051689   
543   10006  197103     -1.907577   NaN  0.114616         0.0    0.051689   
544   10006  197104     -1.907577   NaN  0.114616         0.0    0.051689   
545   10006  197105     -1.907577   NaN  0.114616         0.0    0.051689   
SP        date industry_id  STreversal     Price       Size  
541  1.197606  1971-01-29          37   -4.383562 -3.863358 -12.496377   
542  1.220021  1971-02-26          37    0.577428 -3.844814 -12.477833   
543  1.118353  1971-03-31          37   -9.090909 -3.931826 -12.564844   
544  0.636665  1971-04-30          37  -16.176471 -4.081766 -12.714785   
545  1.025366  1971-05-28          37    5.105485 -4.018633 -12.651651   
ret  
541  0.043836  
542 -0.005774  
543  0.090909  
544  0.161765  
545 -0.051055  

由:或

df1['SP'] = df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1)

这些都是在没有数据的情况下完成的,因此您可能需要更改一些东西(希望不是),

df_grouped_median = df1.groupby(['yyyymm', 'industry_id'], as_index=False).SP.median().rename(
columns={"SP":"median"})
df = df.merge(df_grouped_median, on=['yyyymm', 'industry_id'], how='left')
df['SP'].fillna(df['median'])

相关内容

最新更新