用 NaN 在熊猫中按列收集数据



我想在熊猫数据框中赢取几列数据。每列都有一些 NaN,这会影响获胜,因此需要删除它们。我知道如何做到这一点的唯一方法是为所有数据删除它们,而不是仅逐列删除它们。

女工程师:

import numpy as np
import pandas as pd
from scipy.stats.mstats import winsorize
# Create Dataframe
N, M, P = 10**5, 4, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
, index=dates)
df.index.names = ['DATE']
df.columns = ['one','two','three','four']
# Now scale them differently so you can see the winsorization
df['four'] = df['four']*(10**5)
df['three'] = df['three']*(10**2)
df['two'] = df['two']*(10**-1)
df['one'] = df['one']*(10**-4)
# Create NaN
df.loc[df.index.get_level_values(0).year == 2002,'three'] = np.nan
df.loc[df.index.get_level_values(0).month == 2,'two'] = np.nan
df.loc[df.index.get_level_values(0).month == 1,'one'] = np.nan

以下是基线分布:

df.quantile([0, 0.01, 0.5, 0.99, 1])

输出:

one           two      three          four
0.00  2.336618e-10  2.294259e-07   0.002437      2.305353
0.01  9.862626e-07  9.742568e-04   0.975807   1003.814520
0.50  4.975859e-05  4.981049e-02  50.290946  50374.548980
0.99  9.897463e-05  9.898590e-02  98.978263  98991.438985
1.00  9.999983e-05  9.999966e-02  99.996793  99999.437779

这就是我获胜的方式:

def using_mstats(s):
return winsorize(s, limits=[0.01, 0.01])
wins = df.apply(using_mstats, axis=0)
wins.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

这给出了这个:

Out[356]:
one       two      three          four
0.00  0.000001  0.001060   1.536882   1003.820149
0.01  0.000001  0.001060   1.536882   1003.820149
0.25  0.000025  0.024975  25.200378  25099.994780
0.50  0.000050  0.049810  50.290946  50374.548980
0.75  0.000075  0.074842  74.794537  75217.343920
0.99  0.000099  0.098986  98.978263  98991.436957
1.00  0.000100  0.100000  99.996793  98991.436957

four是正确的,因为它没有NaN但其他列不正确。第 99 个百分位数和最大值应该相同。两者的观测值计数相同:

In [357]: df.count()
Out[357]:
one       90700
two       91600
three     63500
four     100000
dtype: int64
In [358]: wins.count()
Out[358]:
one       90700
two       91600
three     63500
four     100000
dtype: int64

这就是我"解决"它的方式,但代价是丢失了大量数据:

wins2 = df.loc[df.notnull().all(axis=1)].apply(using_mstats, axis=0)
wins2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

输出:

Out[360]:
one       two      three          four
0.00  9.686203e-07  0.000928   0.965702   1005.209503
0.01  9.686203e-07  0.000928   0.965702   1005.209503
0.25  2.486052e-05  0.024829  25.204032  25210.837443
0.50  4.980946e-05  0.049894  50.299004  50622.227179
0.75  7.492750e-05  0.075059  74.837900  75299.906415
0.99  9.895563e-05  0.099014  98.972310  99014.311761
1.00  9.895563e-05  0.099014  98.972310  99014.311761
In [361]: wins2.count()
Out[361]:
one      51700
two      51700
three    51700
four     51700
dtype: int64

如何在保持数据形状(即不删除行(的同时按列对非 NaN 的数据进行 winsorization?

正如经常发生的那样,简单地创建 MWE 有助于澄清。我需要将 clip(( 与 quantile(( 结合使用,如下所示:

df2 = df.clip(lower=df.quantile(0.01), upper=df.quantile(0.99), axis=1)
df2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])

输出:

one       two      three          four
0.00  9.862626e-07  0.000974   0.975807   1003.814520
0.01  9.862666e-07  0.000974   0.975816   1003.820092
0.25  2.485043e-05  0.024975  25.200378  25099.994780
0.50  4.975859e-05  0.049810  50.290946  50374.548980
0.75  7.486737e-05  0.074842  74.794537  75217.343920
0.99  9.897462e-05  0.098986  98.978245  98991.436977
1.00  9.897463e-05  0.098986  98.978263  98991.438985
In [384]: df2.count()
Out[384]:
one       90700
two       91600
three     63500
four     100000
dtype: int64

这些数字与上面的数字不同,因为我维护了每列中未丢失的所有数据 (NaN(。

最新更新