我正在处理一个缺少数据的列的数据集。我打算用数据的比例来填充它,我想我已经涵盖了这一部分。然而,在运行fillna之后,值计数表明我在新的数据帧中仍然缺少值。(输出95仍然显示nan值(
有什么想法吗?
valuecounts = df['parentocclist'].value_counts(dropna=False)
valuecountsinsample = df['parentocclist'].value_counts(normalize=True)
df['parentocclist2']=df['parentocclist'].fillna(pd.Series(np.random.choice([0.0,1.0,2.0],p=[0.656,0.268,0.076],size=len(df)))) # assign the value with probabiliy of insample sizes
valuecountsnew = df['parentocclist2'].value_counts(dropna=False)
valuecounts
Out[93]:
0.0 3559
NaN 2162
1.0 1456
2.0 411
Name: parentocclist, dtype: int64
valuecountsinsample
Out[94]:
0.0 0.655916
1.0 0.268338
2.0 0.075746
Name: parentocclist, dtype: float64
valuecountsnew
Out[95]:
0.0 4372
1.0 1838
NaN 854
2.0 524
Name: parentocclist2, dtype: int64
问题出在fillna
方法上,如果您指定了dict/Series/DataFrame对象,那么您需要澄清应该填充哪些索引,例如:
np.random.seed(10)
df = pd.DataFrame(np.random.choice([np.nan, 0, 1, 2, 3], 100000, replace=True),
columns=['sample_column'])
df.sample_column.value_counts(dropna=False)
# 2.0 20142
# 3.0 20140
# 0.0 19979
# 1.0 19978
# NaN 19761
现在,生成一系列值来替换nan,并将索引设置为nan值在df
:上的位置
nan_index = df.index.values[df.sample_column.isnull()]
serie_na = pd.Series(np.random.choice([0.0, 1.0, 2.0],
p=[0.656, 0.268, 0.076],
size=len(nan_index)),
index=nan_index)
df.sample_column.fillna(serie_na).value_counts(dropna=False)
# 0.0 32974
# 1.0 25248
# 2.0 21638
# 3.0 20140