r-dplyr_generate新列,该列占用一定百分比的布尔行



我有一个很大的数据集,有多列,但只会选择2列:父母教育水平和性别。

parent_edu             gender     n
<chr>              <chr>  <int>
1 associate's degree female   116
2 associate's degree male     106
3 bachelor's degree  female    63
4 bachelor's degree  male      55
5 high school        female    94
6 high school        male     102
7 master's degree    female    36
8 master's degree    male      23
9 some college       female   118
10 some college       male     108
11 some high school   female    91
12 some high school   male      88

从这里开始,我需要使用count函数生成一个新列n,该列统计有多少女性的父母具有该教育水平,有多少男性的父母具有此教育水平。

student1 %>%
count(parent_edu, gender) %>%

最后一步是试图获得最后一列,列出不同性别的不同教育水平的平均值。例如,我们有"一些大学",有52%的女性和48%的男性,然后可能是"高中",47%的女性和53%的男性。到目前为止,我以以下方式无效地使用mutate函数:

student1 %>%
count(parent_edu, gender) %>%
mutate(percentage = 

有人能告诉我应该把什么样的等式放在那里吗?或者使用pipe添加任何其他功能?最终结果应该是这样的:

parent_edu         gender      n      percentage
<chr>              <chr>      <int>    <dbl>
associate's degree  female    116      0.52
associate's degree  male      106      0.48
bachelor's degree   female    63       0.53
bachelor's degree   male      55       0.47
high school         female    94       0.48
high school         male      102      0.52
master's degree     female    36       0.61
master's degree     male      23       0.39
some college        female    118      0.52
some college        male      108      0.48

包括dput:

df <- structure(list(parent_edu = c("associate's degree", "associate's degree", 
"bachelor's degree", "bachelor's degree", "high school", "high school", 
"master's degree", "master's degree", "some college", "some college"
), gender = c("female", "male", "female", "male", "female", "male", 
"female", "male", "female", "male"), n = c(116, 106, 63, 55, 
94, 102, 36, 23, 118, 108)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")) 

更新版本:

dput

df <- structure(list(parent_edu = c("associate's degree", "associate's degree", 
"bachelor's degree", "bachelor's degree", "high school", "high school", 
"master's degree", "master's degree", "some college", "some college"
), gender = c("female", "male", "female", "male", "female", "male", 
"female", "male", "female", "male"), n = c(116, 106, 63, 55, 
94, 102, 36, 23, 118, 108)), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame")) 

解决方案:

df <- df %>%
group_by(parent_edu) %>% # grouping by parent education 
mutate(total = sum(n)) %>% # total within groups
mutate(percentage = (n/total)) %>% # calculating percentage
mutate(percentage = round(percentage, 2)) %>% # rounding to match your example
select(-total) # dropping the total column

最终答案是:

student1 %>%
count(parent_edu, gender) %>%
group_by(parent_edu) %>% # grouping by parent education 
mutate(total = sum(n)) %>% # total within groups
mutate(percentage = (n/total)) %>% # calculating percentage
mutate(percentage = round(percentage, 2)) %>% # rounding to match your example
select(-total) # dropping the total column

最新更新