如何找到一个类别(包括NaN)代表Python Pandas中所有类别变量至少80%的分类数据? &g



我在Python中有Pandas DataFrame,如下所示:

<表类> COL1 COL2 COL3 tbody><<tr>ABC11南南10南ABC11南ABC11南DDD12南ABC南游戏

如果想使用一行代码,可以应用lambda函数:

df = pd.DataFrame({'COL1': ['ABC', 'NaN', 'ABC', 'ABC', 'DDD', 'ABC'],
'COL2': [11, 10, 11, 11, 12, 'NaN'],
'COL3': ['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'GAME']})
df.apply(lambda x: (100*x.value_counts(dropna=False).iloc[0]/x.shape[0])).to_frame('% One Value')

虽然@Jan答案有效,但我想提出一个避免使用for循环的pandas实现:

#Selects the categories
categories = df.select_dtypes(include='object')
#Fill the nan values with a placeholder 
#(assuming the nans are np.nans and not string representation of nan)
categories = categories.fillna("NO DATA")
#Describe the df to compute the frequencies
description = categories.describe().transpose()
#Select only the ones with highest freq superior to 0.8
my_list = description[description["freq"] > description["count"]*0.8].index.values

输出(在my_list中):

array(['COL3'], dtype=object)

最新更新