我对Pandas很陌生,我有点陷入了用不同数据框架的中值替换nan值的问题。中位数数据帧具有不同的形式,因为我必须将原始df分组以获得中位数。
我的主数据框架df1看起来像这样:
permno yyyymm BookLeverage Cash RoE ShareIss1Y ShareIss5Y SP date industry_id STreversal Price Size ret
541 10006 197101 -1.907577 NaN 0.114616 0.000000 0.051689 1.197606 1971-01-29 37 -4.383562 -3.863358 -12.496377 0.043836
542 10006 197102 -1.907577 NaN 0.114616 0.000000 0.051689 1.220021 1971-02-26 37 0.577428 -3.844814 -12.477833 -0.005774
543 10006 197103 -1.907577 NaN 0.114616 0.000000 0.051689 1.118353 1971-03-31 37 -9.090909 -3.931826 -12.564844 0.090909
544 10006 197104 -1.907577 NaN 0.114616 0.000000 0.051689 NaN 1971-04-30 37 -16.176471 -4.081766 -12.714785 0.161765
545 10006 197105 -1.907577 NaN 0.114616 0.000000 0.051689 1.025366 1971-05-28 37 5.105485 -4.018633 -12.651651 -0.051055
然后我创建了一个新的数据框架df2,其中我将前df按yyyymm
和industry_id
列分组,并得到每个时间-行业面板的中位数。
中位数df2看起来像这样:
permno BookLeverage Cash RoE ShareIss1Y
yyyymm industry_id
197101 01 40957.5 -2.451327 NaN 0.015212 -0.306936
10 19254.0 -1.300565 NaN 0.123353 -0.002747
12 33081.5 -2.102402 NaN -0.001043 -0.255756
13 26470.0 -2.028418 NaN 0.116907 -0.005262
14 17830.0 -1.266574 NaN 0.110059 -0.000193
... ... ... ... ... ...
202112 80 78633.0 -3.037694 NaN 0.195342 NaN
82 52123.0 -3.093551 NaN 0.017580 NaN
83 13739.0 -2.802522 NaN 0.021025 NaN
87 78667.5 -3.103168 NaN 0.104524 NaN
97 91547.0 -3.054443 NaN 0.162610 NaN
ShareIss5Y SP STreversal Price Size
yyyymm industry_id
197101 01 -7.591944 5.439985 -9.998244 -2.684046 -11.483201
10 -1.432833 0.517484 -4.504504 -3.367296 -11.826440
12 -20.622667 2.264890 -22.648810 -2.873900 -11.501783
13 -0.257821 0.752112 -5.429864 -3.607534 -12.362360
14 -0.223948 0.636665 -16.075773 -2.729726 -11.386150
... ... ... ... ... ...
202112 80 NaN NaN -10.960198 -4.539740 -16.024733
82 NaN NaN -1.664319 -2.740474 -13.882130
83 NaN NaN -2.383083 -4.835329 -15.843560
87 NaN NaN -5.109321 -4.585741 -15.844537
97 NaN NaN -1.535659 -4.487512 -16.339328
ret
yyyymm industry_id
197101 01 0.099982
10 0.045045
12 0.226488
13 0.054299
14 0.160758
... ...
202112 80 0.109602
82 0.016643
83 0.023831
87 0.051093
97 0.015357
我现在想要实现的是用df2的相应值填充df1中的nan值。因此,例如,第544行的SP列将获得yyyymm
197104和industry_id
37在df2中的值。
我试图映射到所有行和内部的所有列,并替换nan值,但这打破了我的数据框架:
def fill_nan_with_median(row):
date = int(row['yyyymm'])
industry = row['industry_id']
for label, column in row.items():
if column == np.nan:
median = df_median.loc[(date, industry), label]
df_1.loc[index, label] = median
for index, row in df_1.iterrows():
fill_nan_with_median(row)
这个答案采用了表查找方法。对于SP
列中的NaNs
,它在df2
中查找SP
值的中位数。这个答案还假设yyyymm
和industry_id
是字符串而不是数字。
df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1)
541 1.197606
542 1.220021
543 1.118353
544 0.636665
545 1.025366
请注意,非nan被奇怪的x['SP']==x['SP']
检测到,利用NaN != NaN
.
您的df1
与我创建的df2
一起使用:
SP
yyyymm industry_id
197104 01 5.439985
10 0.517484
12 2.264890
13 0.752112
37 0.636665
之后你需要做的就是把它赋值给df1
帧:
df1.assign(SP=df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1))
permno yyyymm BookLeverage Cash RoE ShareIss1Y ShareIss5Y
541 10006 197101 -1.907577 NaN 0.114616 0.0 0.051689
542 10006 197102 -1.907577 NaN 0.114616 0.0 0.051689
543 10006 197103 -1.907577 NaN 0.114616 0.0 0.051689
544 10006 197104 -1.907577 NaN 0.114616 0.0 0.051689
545 10006 197105 -1.907577 NaN 0.114616 0.0 0.051689
SP date industry_id STreversal Price Size
541 1.197606 1971-01-29 37 -4.383562 -3.863358 -12.496377
542 1.220021 1971-02-26 37 0.577428 -3.844814 -12.477833
543 1.118353 1971-03-31 37 -9.090909 -3.931826 -12.564844
544 0.636665 1971-04-30 37 -16.176471 -4.081766 -12.714785
545 1.025366 1971-05-28 37 5.105485 -4.018633 -12.651651
ret
541 0.043836
542 -0.005774
543 0.090909
544 0.161765
545 -0.051055
由:或
df1['SP'] = df1.apply(lambda x: x['SP'] if x['SP']==x['SP'] else df2.at[(x['yyyymm'],x['industry_id']),'SP'] , axis=1)
这些都是在没有数据的情况下完成的,因此您可能需要更改一些东西(希望不是),
df_grouped_median = df1.groupby(['yyyymm', 'industry_id'], as_index=False).SP.median().rename(
columns={"SP":"median"})
df = df.merge(df_grouped_median, on=['yyyymm', 'industry_id'], how='left')
df['SP'].fillna(df['median'])