如何在数据帧列中循环以获取统计信息



我需要一种方法来完成以下操作:

  • 我希望能够循环通过dtype == object的数据帧中的每一列
  • 当我循环浏览每个所需的列时,我获得了以下摘要统计信息:
    • 属性的名称
    • 每个属性具有的唯一值的数量
    • 每个属性最频繁出现的值
    • 最频繁出现的值的出现次数
    • 最频繁出现的值占值总数的百分比

假设我有以下数据:

import pandas as pd
data = {"Sex":["M", "M", "F", "F", "F", "F", "M", "F", "F",  "M"],
"Product": ["X", "Y", "Y", "Z","Y", "Y", "Y", "X", "Z", "Y"],
"Answer":["Yes", "Yes", "Yes", "No",  "No", "Yes", "Yes", "No",  "Yes", "Yes"],
"Numeric":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Binary":[True, True, False, False, False, True, False, True, True, False]}
df = pd.DataFrame(data)

我需要一些方法来完成以下操作(我知道代码不正确——希望我的伪代码能传达我的意思(

for column in df:
if df[column].dtype != object:
continue
else:
#pseudo-code begins
attributes = df[column].name #attribute names
unique_values = df[column].unique() #number of unique values
most_frequent = df[column].mode() #most frequently occurring
occurrences = df[column][most_frequent].nunique() #occurrences of the mode
proportions = df[column][most_frequent].value_counts(normalize = True) #to get 
#proportions

然后我需要一些方法将所有这些信息制成表格,形成某种汇总统计表

summaryStats = tabluate([attributes, unique_values, most_frequent, occrrences, proportions])

结果输出应该类似于

| Attribute | Unique Values | Mode | Occurrences | % of Total |
|----------- --------------- ------ ------------- ------------|
| Sex          F, M,          F         6             60%     |
| Product      X, Y, Z        Y         7             70%     |
| ...            ...          ...      ...           ...      |
#...and so on and so forth for the other attributes

我本质上是在为离散数据创建一个汇总表。

任何帮助都将不胜感激:(

试试这个:

data = []
# Loop through columns of dtype `object`
for col in df.select_dtypes("object"):
# Do a count of each unique value in the column
freq = df[col].value_counts()
data.append({
"Attribute": col,
# The unique values are the index of the frequency count
"Unique Values": ", ".join(freq.index.sort_values()),
# The frequency dataframe is sorted by default, with the most-frequent
# value appearing first
"Mode": freq.index[0],
"Occurrence": freq[0],
"% of Total": freq[0] / freq.sum() * 100
})
summary = pd.DataFrame(data)

创建一个包含列和索引的空数据帧。"属性"列将立即填充一个列表。创建一个列表aaa,将其值替换为loc(行索引在左边,列名在右边(。你可以在这里阅读loc(显式索引(。

产生"Unique Values"的行使用",".join将列表连接为字符串。在"Occurrences"到isin中,我们得到了一个布尔掩码,用它我们得到了所有唯一的值和计数。在"%of Total"中,我们提取第一个元素,因为数据经过排序并乘以100。此外,你应该有6在产品发生,当然,和%的总数将是不同的。

df1 = pd.DataFrame(index=[0, 1], columns=['Attribute', 'Unique Values', 'Mode', 'Occurrences', '% of Total'])
df1['Attribute'] = ['Sex', 'Product']
aaa = ['Sex', 'Product']
for i in range(len(aaa)):
df1.loc[i, 'Unique Values'] = ','.join(df[aaa[i]].unique())
mod = df[aaa[i]].mode()
df1.loc[i, 'Mode'] = mod[0]
df1.loc[i, 'Occurrences'] = df.loc[df[aaa[i]].isin(mod), aaa[i]].count()
df1.loc[i, '% of Total'] = df[aaa[i]].value_counts(normalize=True)[0] * 100

输出

Attribute Unique Values Mode Occurrences % of Total
0       Sex           M,F    F           6       60.0
1   Product         X,Y,Z    Y           6       60.0

最新更新