我有一个很大的数据集,有多列,但只会选择2列:父母教育水平和性别。
parent_edu gender n
<chr> <chr> <int>
1 associate's degree female 116
2 associate's degree male 106
3 bachelor's degree female 63
4 bachelor's degree male 55
5 high school female 94
6 high school male 102
7 master's degree female 36
8 master's degree male 23
9 some college female 118
10 some college male 108
11 some high school female 91
12 some high school male 88
从这里开始,我需要使用count
函数生成一个新列n,该列统计有多少女性的父母具有该教育水平,有多少男性的父母具有此教育水平。
student1 %>%
count(parent_edu, gender) %>%
最后一步是试图获得最后一列,列出不同性别的不同教育水平的平均值。例如,我们有"一些大学",有52%的女性和48%的男性,然后可能是"高中",47%的女性和53%的男性。到目前为止,我以以下方式无效地使用mutate
函数:
student1 %>%
count(parent_edu, gender) %>%
mutate(percentage =
有人能告诉我应该把什么样的等式放在那里吗?或者使用pipe
添加任何其他功能?最终结果应该是这样的:
parent_edu gender n percentage
<chr> <chr> <int> <dbl>
associate's degree female 116 0.52
associate's degree male 106 0.48
bachelor's degree female 63 0.53
bachelor's degree male 55 0.47
high school female 94 0.48
high school male 102 0.52
master's degree female 36 0.61
master's degree male 23 0.39
some college female 118 0.52
some college male 108 0.48
包括dput:
df <- structure(list(parent_edu = c("associate's degree", "associate's degree",
"bachelor's degree", "bachelor's degree", "high school", "high school",
"master's degree", "master's degree", "some college", "some college"
), gender = c("female", "male", "female", "male", "female", "male",
"female", "male", "female", "male"), n = c(116, 106, 63, 55,
94, 102, 36, 23, 118, 108)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
更新版本:
dput
df <- structure(list(parent_edu = c("associate's degree", "associate's degree",
"bachelor's degree", "bachelor's degree", "high school", "high school",
"master's degree", "master's degree", "some college", "some college"
), gender = c("female", "male", "female", "male", "female", "male",
"female", "male", "female", "male"), n = c(116, 106, 63, 55,
94, 102, 36, 23, 118, 108)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
解决方案:
df <- df %>%
group_by(parent_edu) %>% # grouping by parent education
mutate(total = sum(n)) %>% # total within groups
mutate(percentage = (n/total)) %>% # calculating percentage
mutate(percentage = round(percentage, 2)) %>% # rounding to match your example
select(-total) # dropping the total column
最终答案是:
student1 %>%
count(parent_edu, gender) %>%
group_by(parent_edu) %>% # grouping by parent education
mutate(total = sum(n)) %>% # total within groups
mutate(percentage = (n/total)) %>% # calculating percentage
mutate(percentage = round(percentage, 2)) %>% # rounding to match your example
select(-total) # dropping the total column