我需要一种方法来完成以下操作:
- 我希望能够循环通过
dtype == object
的数据帧中的每一列 - 当我循环浏览每个所需的列时,我获得了以下摘要统计信息:
- 属性的名称
- 每个属性具有的唯一值的数量
- 每个属性最频繁出现的值
- 最频繁出现的值的出现次数
- 最频繁出现的值占值总数的百分比
假设我有以下数据:
import pandas as pd
data = {"Sex":["M", "M", "F", "F", "F", "F", "M", "F", "F", "M"],
"Product": ["X", "Y", "Y", "Z","Y", "Y", "Y", "X", "Z", "Y"],
"Answer":["Yes", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "Yes", "Yes"],
"Numeric":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Binary":[True, True, False, False, False, True, False, True, True, False]}
df = pd.DataFrame(data)
我需要一些方法来完成以下操作(我知道代码不正确——希望我的伪代码能传达我的意思(
for column in df:
if df[column].dtype != object:
continue
else:
#pseudo-code begins
attributes = df[column].name #attribute names
unique_values = df[column].unique() #number of unique values
most_frequent = df[column].mode() #most frequently occurring
occurrences = df[column][most_frequent].nunique() #occurrences of the mode
proportions = df[column][most_frequent].value_counts(normalize = True) #to get
#proportions
然后我需要一些方法将所有这些信息制成表格,形成某种汇总统计表
summaryStats = tabluate([attributes, unique_values, most_frequent, occrrences, proportions])
结果输出应该类似于
| Attribute | Unique Values | Mode | Occurrences | % of Total |
|----------- --------------- ------ ------------- ------------|
| Sex F, M, F 6 60% |
| Product X, Y, Z Y 7 70% |
| ... ... ... ... ... |
#...and so on and so forth for the other attributes
我本质上是在为离散数据创建一个汇总表。
任何帮助都将不胜感激:(
试试这个:
data = []
# Loop through columns of dtype `object`
for col in df.select_dtypes("object"):
# Do a count of each unique value in the column
freq = df[col].value_counts()
data.append({
"Attribute": col,
# The unique values are the index of the frequency count
"Unique Values": ", ".join(freq.index.sort_values()),
# The frequency dataframe is sorted by default, with the most-frequent
# value appearing first
"Mode": freq.index[0],
"Occurrence": freq[0],
"% of Total": freq[0] / freq.sum() * 100
})
summary = pd.DataFrame(data)
创建一个包含列和索引的空数据帧。"属性"列将立即填充一个列表。创建一个列表aaa,将其值替换为loc(行索引在左边,列名在右边(。你可以在这里阅读loc(显式索引(。
产生"Unique Values"的行使用",".join将列表连接为字符串。在"Occurrences"到isin中,我们得到了一个布尔掩码,用它我们得到了所有唯一的值和计数。在"%of Total"中,我们提取第一个元素,因为数据经过排序并乘以100。此外,你应该有6在产品发生,当然,和%的总数将是不同的。
df1 = pd.DataFrame(index=[0, 1], columns=['Attribute', 'Unique Values', 'Mode', 'Occurrences', '% of Total'])
df1['Attribute'] = ['Sex', 'Product']
aaa = ['Sex', 'Product']
for i in range(len(aaa)):
df1.loc[i, 'Unique Values'] = ','.join(df[aaa[i]].unique())
mod = df[aaa[i]].mode()
df1.loc[i, 'Mode'] = mod[0]
df1.loc[i, 'Occurrences'] = df.loc[df[aaa[i]].isin(mod), aaa[i]].count()
df1.loc[i, '% of Total'] = df[aaa[i]].value_counts(normalize=True)[0] * 100
输出
Attribute Unique Values Mode Occurrences % of Total
0 Sex M,F F 6 60.0
1 Product X,Y,Z Y 6 60.0