Python填充不完全



我正在处理一个缺少数据的列的数据集。我打算用数据的比例来填充它,我想我已经涵盖了这一部分。然而,在运行fillna之后,值计数表明我在新的数据帧中仍然缺少值。(输出95仍然显示nan值(

有什么想法吗?

valuecounts = df['parentocclist'].value_counts(dropna=False)
valuecountsinsample = df['parentocclist'].value_counts(normalize=True)
df['parentocclist2']=df['parentocclist'].fillna(pd.Series(np.random.choice([0.0,1.0,2.0],p=[0.656,0.268,0.076],size=len(df)))) # assign the value with probabiliy of insample sizes
valuecountsnew = df['parentocclist2'].value_counts(dropna=False)
valuecounts
Out[93]: 
0.0    3559
NaN    2162
1.0    1456
2.0     411
Name: parentocclist, dtype: int64
valuecountsinsample
Out[94]: 
0.0    0.655916
1.0    0.268338
2.0    0.075746
Name: parentocclist, dtype: float64
valuecountsnew
Out[95]: 
0.0    4372
1.0    1838
NaN     854
2.0     524
Name: parentocclist2, dtype: int64

问题出在fillna方法上,如果您指定了dict/Series/DataFrame对象,那么您需要澄清应该填充哪些索引,例如:

np.random.seed(10)
df = pd.DataFrame(np.random.choice([np.nan, 0, 1, 2, 3], 100000, replace=True),
columns=['sample_column'])
df.sample_column.value_counts(dropna=False)
# 2.0    20142
# 3.0    20140
# 0.0    19979
# 1.0    19978
# NaN    19761

现在,生成一系列值来替换nan,并将索引设置为nan值在df:上的位置

nan_index = df.index.values[df.sample_column.isnull()]
serie_na = pd.Series(np.random.choice([0.0, 1.0, 2.0],
p=[0.656, 0.268, 0.076],
size=len(nan_index)),
index=nan_index)
df.sample_column.fillna(serie_na).value_counts(dropna=False)
# 0.0    32974
# 1.0    25248
# 2.0    21638
# 3.0    20140

最新更新