r语言 - dplyr 将多个数据集管道到 summarize()



我正在使用dplyr制作表格。 我想在多个数据集上执行相同的"汇总"命令。 我知道在 ggplot2 中,您可以更改数据集并重新运行绘图,这很酷。

以下是我想避免的:

table_1 <- 
group_by(df_1, boro) %>%
  summarize(n_units = n(),
            mean_rent = mean(rent_numeric, na.rm = TRUE),
            sd_rend = sd(rent_numeric,na.rm = TRUE),
            median_rent = median(rent_numeric, na.rm = TRUE),
            mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
            sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
            mean_sqft = mean(sqft, na.rm = TRUE),
            sd_sqft = sd(sqft, na.rm = TRUE),
            n_broker = sum(ob=="broker"),
            pr_broker = n_broker/n_units)
table_2 <- 
group_by(df_2, boro) %>%
  summarize(n_units = n(),
            mean_rent = mean(rent_numeric, na.rm = TRUE),
            sd_rend = sd(rent_numeric,na.rm = TRUE),
            median_rent = median(rent_numeric, na.rm = TRUE),
            mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
            sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
            mean_sqft = mean(sqft, na.rm = TRUE),
            sd_sqft = sd(sqft, na.rm = TRUE),
            n_broker = sum(ob=="broker"),
            pr_broker = n_broker/n_units)

基本上,有没有办法将 summa 命令设置为一个函数或其他东西,这样我就可以倒入df_1和df_2?

如果您事先知道所有变量名称,并且它们在要查看的所有数据集中都相同,则可以执行以下操作:

myfunc <- function(df) {
  df %>% 
  group_by(cyl) %>%
    summarize(n = n(),
              mean_hp = mean(hp))
}
myfunc(mtcars)
#Source: local data frame [3 x 3]
#
#  cyl  n   mean_hp
#1   4 11  82.63636
#2   6  7 122.28571
#3   8 14 209.21429

然后将其与不同的数据集(具有相同的结构和变量名称(一起使用。如果您需要灵活性,即您事先不知道所有变量以及能够将它们指定为函数中的输入的内容,请查看 dplyr 非标准评估小插图。

这里只是一个很小的例子,说明如何在函数中实现"标准评估",以实现更大的灵活性。考虑一下,如果要允许函数的用户指定应按哪一列对数据进行分组,则可以执行以下操作:

myfunc <- function(df, grp) {
      df %>% 
      group_by_(grp) %>%        # notice that I use "group_by_" instead of "group_by"
        summarize(n = n(),
                  mean_hp = mean(hp))
}
and then use it:
myfunc(mtcars, "gear")
#Source: local data frame [3 x 3]
#
#  gear  n  mean_hp
#1    3 15 176.1333
#2    4 12  89.5000
#3    5  5 195.6000
myfunc(mtcars, "cyl")
#Source: local data frame [3 x 3]
#
#  cyl  n   mean_hp
#1   4 11  82.63636
#2   6  7 122.28571
#3   8 14 209.21429

%>%运算符只是将 tbl 对象作为第一个参数传递给下一个函数。summarize只是期待一个tbl。所以你可以定义

mysummary <- function(.data) {
  summarize(.data, n_units = n(),
            mean_rent = mean(rent_numeric, na.rm = TRUE),
            sd_rend = sd(rent_numeric,na.rm = TRUE),
            median_rent = median(rent_numeric, na.rm = TRUE),
            mean_bedrooms = mean(bedrooms_numeric, na.rm = TRUE),
            sd_bedrooms = sd(bedrooms_numeric, na.rm = TRUE),
            mean_sqft = mean(sqft, na.rm = TRUE),
            sd_sqft = sd(sqft, na.rm = TRUE),
            n_broker = sum(ob=="broker"),
            pr_broker = n_broker/n_units)
}

然后打电话

table_1 <- group_by(df_1, boro) %>% mysummary
table_2 <- group_by(df_2, boro) %>% mysummary

附实际工作示例

mysummary <- function(.data) {
  summarize(.data, 
      ave.mpg=mean(mpg),
      ave.hp=mean(hp)
  )
}
mtcars %>% group_by(cyl) %>% mysummary
mtcars %>% group_by(gear) %>% mysummary

最新更新