r语言 - dplyr在分组数据中用突变计算变量.框架值时,当该值引用另一个行(但包含在同一组中)时



我正在尝试学习dplyr,但仍然有问题。

这是我的数据框架的一小部分(它具有数百种,而不是我复制的两个物种(" sp"(,每个物种的另外几行(:

> sp.df <- structure(list(sp = c("Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa"), scenario = c("pre", "rcp45", "rcp85", "pre", "rcp45", "rcp85", "pre", "rcp45", "rcp85", "pre", "rcp45", "rcp85", "pre", "rcp45", "rcp85", "pre", "rcp45", "rcp85"), year = c("totalPAT", "totalPAT", "totalPAT", "2010", "2010", "2010", "2015", "2015", "2015", "totalPAT", "totalPAT", "totalPAT", "2010", "2010", "2010", "2015", "2015", "2015"), area = c(27393.5432893358, 26302.7931114686, 23767.0566182264, 1132.11815818819, 1409.95821237362, 1367.22415806142, 1132.11815818819, 1431.32621046934, 1452.69684644667, 276.54858281478, 0, 0, 234.014708239003, 0, 0, 234.014708239003, 0, 0), area.period = c(NA, NA, NA, 0, 0, 0, 0, 21.3679980957127, 85.4726883852542, NA, NA, NA, 0, 0, 0, 0, 0, 0), group = c("anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf")), .Names = c("sp", "scenario", "year", "area", "area.period", "group"), row.names = c(1L, 2L, 3L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 97L, 98L, 99L, 100L, 101L, 102L), class = "data.frame")
> sp.df
                           sp scenario     year       area area.period group
1   Adelophryne adiastola      pre totalPAT 27393.5433          NA   anf
2   Adelophryne adiastola    rcp45 totalPAT 26302.7931          NA   anf
3   Adelophryne adiastola    rcp85 totalPAT 23767.0566          NA   anf
46  Adelophryne adiastola      pre     2010  1132.1182     0.00000   anf
47  Adelophryne adiastola    rcp45     2010  1409.9582     0.00000   anf
48  Adelophryne adiastola    rcp85     2010  1367.2242     0.00000   anf
49  Adelophryne adiastola      pre     2015  1132.1182     0.00000   anf
50  Adelophryne adiastola    rcp45     2015  1431.3262    21.36800   anf
51  Adelophryne adiastola    rcp85     2015  1452.6968    85.47269   anf
52  Adelophryne gutturosa      pre totalPAT   276.5486          NA   anf
53  Adelophryne gutturosa    rcp45 totalPAT     0.0000          NA   anf
54  Adelophryne gutturosa    rcp85 totalPAT     0.0000          NA   anf
97  Adelophryne gutturosa      pre     2010   234.0147     0.00000   anf
98  Adelophryne gutturosa    rcp45     2010     0.0000     0.00000   anf
99  Adelophryne gutturosa    rcp85     2010     0.0000     0.00000   anf
100 Adelophryne gutturosa      pre     2015   234.0147     0.00000   anf
101 Adelophryne gutturosa    rcp45     2015     0.0000     0.00000   anf
102 Adelophryne gutturosa    rcp85     2015     0.0000     0.00000   anf

我要做的是,对于每个物种,创建一个具有该物种所有行的新列的繁殖结果0.2*该物种第一行的面积值(可以用year=="totalPAT"scenario=="pre"(。这是我通常会使用for循环做的事情,就像我在下一个示例中所做的那样,以说明我想要的结果:
看起来应该这样:

> for (sp in sp.df$sp){
+     sp.df$goal[sp.df$sp == sp] <- 0.2*sp.df$area[sp.df$sp == sp & sp.df$year =="totalPAT" & sp.df$scenario =="pre"]
}
> sp.df
                           sp scenario     year       area area.period group       goal
1   Adelophryne adiastola      pre totalPAT 27393.5433          NA   anf 5478.70866
2   Adelophryne adiastola    rcp45 totalPAT 26302.7931          NA   anf 5478.70866
3   Adelophryne adiastola    rcp85 totalPAT 23767.0566          NA   anf 5478.70866
46  Adelophryne adiastola      pre     2010  1132.1182     0.00000   anf 5478.70866
47  Adelophryne adiastola    rcp45     2010  1409.9582     0.00000   anf 5478.70866
48  Adelophryne adiastola    rcp85     2010  1367.2242     0.00000   anf 5478.70866
49  Adelophryne adiastola      pre     2015  1132.1182     0.00000   anf 5478.70866
50  Adelophryne adiastola    rcp45     2015  1431.3262    21.36800   anf 5478.70866
51  Adelophryne adiastola    rcp85     2015  1452.6968    85.47269   anf 5478.70866
52  Adelophryne gutturosa      pre totalPAT   276.5486          NA   anf   55.30972
53  Adelophryne gutturosa    rcp45 totalPAT     0.0000          NA   anf   55.30972
54  Adelophryne gutturosa    rcp85 totalPAT     0.0000          NA   anf   55.30972
97  Adelophryne gutturosa      pre     2010   234.0147     0.00000   anf   55.30972
98  Adelophryne gutturosa    rcp45     2010     0.0000     0.00000   anf   55.30972
99  Adelophryne gutturosa    rcp85     2010     0.0000     0.00000   anf   55.30972
100 Adelophryne gutturosa      pre     2015   234.0147     0.00000   anf   55.30972
101 Adelophryne gutturosa    rcp45     2015     0.0000     0.00000   anf   55.30972
102 Adelophryne gutturosa    rcp85     2015     0.0000     0.00000   anf   55.30972

但是有了这些长桌子,这需要大量时间。我开始学习dplyr,发现group_by对这个东西国王真的很有用...但是我仍然需要弄清楚如何做这些更复杂的事情...我在思考:

sp.df %>% 
  group_by(sp) %>% 
  mutate(goal = 0.2*filter(year == "totalPAT"))

但是

Error: no applicable method for 'filter_' applied to an object of class "logical"

也许我只是在使用怪异的表格...我只需要每个物种行中的目标,因此以后我可以将区域列中的值与此目标进行比较。如果您可以提供帮助,请非常感谢!

如果您可以确定您的data.frame是否正确排序,请尝试:

sp.df %>% group_by(sp) %>% mutate(goal = 0.2 * first(area))

使用dplyr::first在组中挑选第一个。

另一个选择是:

sp.df %>% group_by(sp) %>% mutate(goal = 0.2 * area[year == "totalPAT"][1])

挑选出 year == "totalPAT"的区域,以及从第一个(按小组(挑选的区域。

编辑:一些基准:

lj <- function(x)  x %>% 
 left_join(x %>% filter( year =='totalPAT', scenario =='pre') %>%
 mutate(goal = 0.2*area) %>%
 select(sp, goal), by='sp') 
fir <- function(x) x %>% arrange(year != "totalPAT" & scenario != "pre") %>% 
 group_by(sp) %>% mutate(goal = 0.2 * first(area))
reptimes <- function(x) x %>%
 group_by(sp) %>%
 mutate(goal = (0.2 * area[year == "totalPAT" & scenario =='pre']) %>% 
           rep(times=n()))
bracket <- function(x) x %>% group_by(sp) %>% 
 mutate(goal = 0.2 * area[year == "totalPAT" & scenario == "pre"][1])
microbenchmark::microbenchmark(fir(sp.df), bracket(sp.df), 
 lj(sp.df), reptimes(sp.df), times = 1000)
Unit: microseconds
        expr      min       lq     mean   median       uq      max neval
     fir(sp.df) 1480.502 1543.401 1755.633 1575.064 1651.837 20137.41  1000
 bracket(sp.df)  941.361  982.226 1178.849 1002.561 1045.390 19536.25  1000
      lj(sp.df) 1776.140 1856.851 2172.697 1906.072 1995.426 59463.52  1000
reptimes(sp.df) 1120.376 1168.362 1327.877 1191.406 1242.211 17540.17  1000

道德看起来像是不要使用left_join或此愚蠢的arrange,如果您执行了很多次。

相关内容

最新更新