r语言 - 在Python中创建聚合数据集用于分析



我是R用户,正在学习Python,我试图在Python中创建一个聚合数据集,就像我在R或SQL中做的那样。然而,Python的行为与我预期的不同——我不确定如何以我工作所需的格式创建数据集。

R

library(dplyr)
# Create sample data
team <- c("Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees")
pos <- c("Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher", "Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher")
age <- c(24, 28, 40, 22, 29, 33, 31, 26, 21, 36, 25, 31)
baseball_example <- data.frame(team, pos, age)

average_age_by_team_position <- baseball_example %>% group_by(team, pos) %>% summarise(mean_age = mean(age))
print(average_age_by_team_position)

输出如下:

team pos mean_age1红袜队非投手28
2红袜队投手30.7洋基队不是投手4扬基队投手26

当我尝试在Python中这样做时,分组的列看起来不同。这意味着我不能使用输出作为进一步分析的基础,或导出为csv文件

Python

import pandas as pd
baseball_example = {"team": ["Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Red Sox", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees", "Yankees"],
"pos": ["Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher", "Pitcher", "Pitcher", "Pitcher", "Not Pitcher", "Not Pitcher", "Not Pitcher"],
"age": [24, 28, 40, 22, 29, 33, 31, 26, 21, 36, 25, 31]}

baseball_example=pd.DataFrame(baseball_example)
average_age_by_team_position = baseball_example.groupby(['team', 'pos']).agg("mean")
print(average_age_by_team_position)
age

队位
红袜队非投手投手30.666667洋基不是投手30.666667

26.000000投手谁能建议如何创建一个版本的Python代码,有输出看起来像R?

谢谢!:)

托尼

在我的问题之后,我做了更多的研究并找到了答案。这看起来像是Python使用索引的方式。

可以通过重置索引来解决,如下所示:

average_age_by_team_position = average_age_by_team_position.reset_index()
print(average_age_by_team_position)

我在以下网站上找到了这个:https://jamesrledoux.com/code/group-by-aggregate-pandas

最新更新