r语言 - 如何在dplyr中使用group_by与mean和sum ?



我有一个像下面这样的数据集,有100行。我想按年级和学校计算学生的平均分数。因此,我将在2019年获得ISD 1分,在2020年获得ISD 1分,等等。我正在使用这段代码来完成。

df <- df %>%
group_by(Year, `School Name`) %>% 
summarise(across(everything(), .f = list(mean = mean), na.rm = TRUE))

但是我需要n个计数作为平均值。我想知道有多少学生得到了平均值。如何处理NAs呢?

tbody> <<tr>
年级学校名称学生成绩学生成绩
2019ISD 11NA
2020ISD 442
2020ISD 3NA3
2018ISD 14NA
2019ISD 425
2020ISD 432
2019ISD 3NA1
2018ISD124

如果我理解正确的话,这可能对你有帮助

#Libraries
library(tidyverse)
library(lubridate)
#Data
df <-
tibble::tribble(
~Year, ~School.Name, ~Student.Score1, ~Student.Score2,
2019L,      "ISD 1",             1L,             NA,
2020L,      "ISD 4",             4L,             2L,
2020L,      "ISD 3",             NA,             3L,
2018L,      "ISD 1",             4L,             NA,
2019L,      "ISD 4",             2L,             5L,
2020L,      "ISD 4",             3L,             2L,
2019L,      "ISD 3",             NA,             1L,
2018L,      "ISD 1",             2L,             4L
)
#How to
df %>% 
group_by(Year,School.Name) %>% 
summarise(
n = n(),
across(.cols = contains(".Score"),.fns = function(x)mean(x,na.rm = TRUE))
)
# A tibble: 6 x 5
# Groups:   Year [3]
Year School.Name     n Student.Score1 Student.Score2
<int> <chr>       <int>          <dbl>          <dbl>
1  2018 ISD 1           2            3                4
2  2019 ISD 1           1            1              NaN
3  2019 ISD 3           1          NaN                1
4  2019 ISD 4           1            2                5
5  2020 ISD 3           1          NaN                3
6  2020 ISD 4           2            3.5              2

我猜Student Score列代表单独的学生,应该与来自同一学校和同一学年的其他学生结合起来看。如果是这种情况,那么可能应该首先将数据重塑为长格式,如下所示:

library(dplyr); library(tidyr)
df %>% 
# reshape, keeping Year and School Name as keys
pivot_longer(-c(Year, `School.Name`)) %>%
group_by(Year, `School.Name`) %>% 
filter(!is.na(value)) %>%
summarise(mean = mean(value),
n = n(), .groups = "drop")

结果

Year School.Name  mean     n
<int> <chr>       <dbl> <int>
1  2018 ISD 1        4        1
2  2018 ISD1         3        2
3  2019 ISD 1        1        1
4  2019 ISD 3        1        1
5  2019 ISD 4        3.5      2
6  2020 ISD 3        3        1
7  2020 ISD 4        2.75     4

(注意,我是按原样使用数据的,但我怀疑" isd1 "one_answers";ISD1"应该是一样的,在这种情况下,您可能需要先做一些数据清理。)

起始数据:(注意,非唯一命名的Student Score列被data.frame函数重命名为具有唯一名称,例如Student。Score and Student.Score.1)

df <- data.frame(
stringsAsFactors = FALSE,
Year = c(2019L, 2020L, 2020L, 2018L, 2019L, 2020L, 2019L, 2018L),
`School Name` = c("ISD 1","ISD 4","ISD 3",
"ISD 1","ISD 4","ISD 4","ISD 3","ISD1"),
`Student Score` = c(1L, 4L, NA, 4L, 2L, 3L, NA, 2L),
`Student Score` = c(NA, 2L, 3L, NA, 5L, 2L, 1L, 4L)
) 

更长时间的旋转可能是一个好方法。

df %>% 
pivot_longer(cols = c(-Year, -`School Name`)) %>% 
group_by(Year, `School Name`) %>%
summarise(mean = mean(value, na.rm = T))


# A tibble: 6 x 3
# Groups:   Year [3]
Year `School Name`  mean
<int> <chr>         <dbl>
1  2018 ISD 1          3.33
2  2019 ISD 1          1   
3  2019 ISD 3          1   
4  2019 ISD 4          3.5 
5  2020 ISD 3          3   
6  2020 ISD 4          2.75

最新更新