我在Python中有Pandas DataFrame,如下所示:
<表类>
COL1
COL2
COL3
tbody><<tr>ABC 11 南 南 10 南 ABC11 南 ABC11 南 DDD 12 南 ABC南 游戏 表类>
如果想使用一行代码,可以应用lambda函数:
df = pd.DataFrame({'COL1': ['ABC', 'NaN', 'ABC', 'ABC', 'DDD', 'ABC'],
'COL2': [11, 10, 11, 11, 12, 'NaN'],
'COL3': ['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'GAME']})
df.apply(lambda x: (100*x.value_counts(dropna=False).iloc[0]/x.shape[0])).to_frame('% One Value')
虽然@Jan答案有效,但我想提出一个避免使用for循环的pandas实现:
#Selects the categories
categories = df.select_dtypes(include='object')
#Fill the nan values with a placeholder
#(assuming the nans are np.nans and not string representation of nan)
categories = categories.fillna("NO DATA")
#Describe the df to compute the frequencies
description = categories.describe().transpose()
#Select only the ones with highest freq superior to 0.8
my_list = description[description["freq"] > description["count"]*0.8].index.values
输出(在my_list中):
array(['COL3'], dtype=object)