pandas工具箱中非缺失列平均值的条件插补

这个问题的重点是熊猫自己的功能。仍然有一些解决方案(pandas-DataFrame：用列的平均值替换nan值(，但有自己的编写函数。

在SPSS中有一个函数MEAN.n，它只在列表中的n元素有效时(not pandas.NA(才为您提供数字列表的平均值。使用该函数，只有在最小数量的项有效的情况下，才能对缺失的值进行冲击。

有熊猫的作用吗？

示例

值[1, 2, 3, 4, NA]。有效值的平均值为2.5。生成的列表应该是[1, 2, 3, 4, 2.5]。

假设规则是，在5项列表中，3项应具有有效的插补值。否则结果为NA。

值[1, 2, NA, NA, NA]。有效值的平均值是1.5，但这并不重要。生成的列表不应更改[1, 2, NA, NA, NA]，因为不允许插补。

假设您想使用pandas，则只有在最少项目数不是NA:的情况下，您才能使用mean为fillna定义自定义包装器(仅使用panda函数(

from pandas import NA
s1 = pd.Series([1, 2, 3, 4, NA])
s2 = pd.Series([1, 2, NA, NA, NA])
def fillna_mean(s, N=4):
return s if s.notna().sum() < N else s.fillna(s.mean())
fillna_mean(s1)
# 0    1.0
# 1    2.0
# 2    3.0
# 3    4.0
# 4    2.5
# dtype: float64
fillna_mean(s2)
# 0       1
# 1       2
# 2    <NA>
# 3    <NA>
# 4    <NA>
# dtype: object
fillna_mean(s2, N=2)
# 0    1.0
# 1    2.0
# 2    1.5
# 3    1.5
# 4    1.5
# dtype: float64

让我们尝试列表理解，尽管这将是一个混乱的

选项1

您可以使用pd.Series和numpy

s= [x if np.isnan(lst).sum()>=3 else pd.Series(lst).mean(skipna=True) if x is np.nan else x for x in lst]

选项2通过使用numpy

s=[x if np.isnan(lst).sum()>=3 else np.mean([x for x in lst if str(x) != 'nan']) if x is np.nan else x for x in lst]

案例1

lst=[1, 2, 3, 4, np.nan]

结果

[1, 2, 3, 4, 2.5]

案例2

lst=[1, 2, np.nan, np.nan, np.nan]

结果

[1, 2, nan, nan, nan]

如果你想把它作为一个pd系列，只需

pd.Series(s, name='lst')

它的工作原理

s=[x if np.isnan(lst).sum()>=3  #give me element x if the sum of nans in the list is greater than or equal to 3

else pd.Series(lst).mean(skipna=True) if x is np.nan else x # Otherwise replace the Nan in list with the mean of non NaN elements in the list

for x in lst#For every element in lst
]

相关内容

最新更新

热门标签：