我正在尝试学习dplyr,但仍然有问题。
这是我的数据框架的一小部分(它具有数百种,而不是我复制的两个物种(" sp"(,每个物种的另外几行(:
> sp.df <- structure(list(sp = c("Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne adiastola", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa", "Adelophryne gutturosa"), scenario = c("pre", "rcp45", "rcp85", "pre", "rcp45", "rcp85", "pre", "rcp45", "rcp85", "pre", "rcp45", "rcp85", "pre", "rcp45", "rcp85", "pre", "rcp45", "rcp85"), year = c("totalPAT", "totalPAT", "totalPAT", "2010", "2010", "2010", "2015", "2015", "2015", "totalPAT", "totalPAT", "totalPAT", "2010", "2010", "2010", "2015", "2015", "2015"), area = c(27393.5432893358, 26302.7931114686, 23767.0566182264, 1132.11815818819, 1409.95821237362, 1367.22415806142, 1132.11815818819, 1431.32621046934, 1452.69684644667, 276.54858281478, 0, 0, 234.014708239003, 0, 0, 234.014708239003, 0, 0), area.period = c(NA, NA, NA, 0, 0, 0, 0, 21.3679980957127, 85.4726883852542, NA, NA, NA, 0, 0, 0, 0, 0, 0), group = c("anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf", "anf")), .Names = c("sp", "scenario", "year", "area", "area.period", "group"), row.names = c(1L, 2L, 3L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 97L, 98L, 99L, 100L, 101L, 102L), class = "data.frame")
> sp.df
sp scenario year area area.period group
1 Adelophryne adiastola pre totalPAT 27393.5433 NA anf
2 Adelophryne adiastola rcp45 totalPAT 26302.7931 NA anf
3 Adelophryne adiastola rcp85 totalPAT 23767.0566 NA anf
46 Adelophryne adiastola pre 2010 1132.1182 0.00000 anf
47 Adelophryne adiastola rcp45 2010 1409.9582 0.00000 anf
48 Adelophryne adiastola rcp85 2010 1367.2242 0.00000 anf
49 Adelophryne adiastola pre 2015 1132.1182 0.00000 anf
50 Adelophryne adiastola rcp45 2015 1431.3262 21.36800 anf
51 Adelophryne adiastola rcp85 2015 1452.6968 85.47269 anf
52 Adelophryne gutturosa pre totalPAT 276.5486 NA anf
53 Adelophryne gutturosa rcp45 totalPAT 0.0000 NA anf
54 Adelophryne gutturosa rcp85 totalPAT 0.0000 NA anf
97 Adelophryne gutturosa pre 2010 234.0147 0.00000 anf
98 Adelophryne gutturosa rcp45 2010 0.0000 0.00000 anf
99 Adelophryne gutturosa rcp85 2010 0.0000 0.00000 anf
100 Adelophryne gutturosa pre 2015 234.0147 0.00000 anf
101 Adelophryne gutturosa rcp45 2015 0.0000 0.00000 anf
102 Adelophryne gutturosa rcp85 2015 0.0000 0.00000 anf
我要做的是,对于每个物种,创建一个具有该物种所有行的新列的繁殖结果0.2*该物种第一行的面积值(可以用year=="totalPAT"
和scenario=="pre"
(。这是我通常会使用for循环做的事情,就像我在下一个示例中所做的那样,以说明我想要的结果:
看起来应该这样:
> for (sp in sp.df$sp){
+ sp.df$goal[sp.df$sp == sp] <- 0.2*sp.df$area[sp.df$sp == sp & sp.df$year =="totalPAT" & sp.df$scenario =="pre"]
}
> sp.df
sp scenario year area area.period group goal
1 Adelophryne adiastola pre totalPAT 27393.5433 NA anf 5478.70866
2 Adelophryne adiastola rcp45 totalPAT 26302.7931 NA anf 5478.70866
3 Adelophryne adiastola rcp85 totalPAT 23767.0566 NA anf 5478.70866
46 Adelophryne adiastola pre 2010 1132.1182 0.00000 anf 5478.70866
47 Adelophryne adiastola rcp45 2010 1409.9582 0.00000 anf 5478.70866
48 Adelophryne adiastola rcp85 2010 1367.2242 0.00000 anf 5478.70866
49 Adelophryne adiastola pre 2015 1132.1182 0.00000 anf 5478.70866
50 Adelophryne adiastola rcp45 2015 1431.3262 21.36800 anf 5478.70866
51 Adelophryne adiastola rcp85 2015 1452.6968 85.47269 anf 5478.70866
52 Adelophryne gutturosa pre totalPAT 276.5486 NA anf 55.30972
53 Adelophryne gutturosa rcp45 totalPAT 0.0000 NA anf 55.30972
54 Adelophryne gutturosa rcp85 totalPAT 0.0000 NA anf 55.30972
97 Adelophryne gutturosa pre 2010 234.0147 0.00000 anf 55.30972
98 Adelophryne gutturosa rcp45 2010 0.0000 0.00000 anf 55.30972
99 Adelophryne gutturosa rcp85 2010 0.0000 0.00000 anf 55.30972
100 Adelophryne gutturosa pre 2015 234.0147 0.00000 anf 55.30972
101 Adelophryne gutturosa rcp45 2015 0.0000 0.00000 anf 55.30972
102 Adelophryne gutturosa rcp85 2015 0.0000 0.00000 anf 55.30972
但是有了这些长桌子,这需要大量时间。我开始学习dplyr,发现group_by
对这个东西国王真的很有用...但是我仍然需要弄清楚如何做这些更复杂的事情...我在思考:
sp.df %>%
group_by(sp) %>%
mutate(goal = 0.2*filter(year == "totalPAT"))
但是
Error: no applicable method for 'filter_' applied to an object of class "logical"
也许我只是在使用怪异的表格...我只需要每个物种行中的目标,因此以后我可以将区域列中的值与此目标进行比较。如果您可以提供帮助,请非常感谢!
如果您可以确定您的data.frame是否正确排序,请尝试:
sp.df %>% group_by(sp) %>% mutate(goal = 0.2 * first(area))
使用dplyr::first
在组中挑选第一个。
另一个选择是:
sp.df %>% group_by(sp) %>% mutate(goal = 0.2 * area[year == "totalPAT"][1])
挑选出 year == "totalPAT"
的区域,以及从第一个(按小组(挑选的区域。
编辑:一些基准:
lj <- function(x) x %>%
left_join(x %>% filter( year =='totalPAT', scenario =='pre') %>%
mutate(goal = 0.2*area) %>%
select(sp, goal), by='sp')
fir <- function(x) x %>% arrange(year != "totalPAT" & scenario != "pre") %>%
group_by(sp) %>% mutate(goal = 0.2 * first(area))
reptimes <- function(x) x %>%
group_by(sp) %>%
mutate(goal = (0.2 * area[year == "totalPAT" & scenario =='pre']) %>%
rep(times=n()))
bracket <- function(x) x %>% group_by(sp) %>%
mutate(goal = 0.2 * area[year == "totalPAT" & scenario == "pre"][1])
microbenchmark::microbenchmark(fir(sp.df), bracket(sp.df),
lj(sp.df), reptimes(sp.df), times = 1000)
Unit: microseconds
expr min lq mean median uq max neval
fir(sp.df) 1480.502 1543.401 1755.633 1575.064 1651.837 20137.41 1000
bracket(sp.df) 941.361 982.226 1178.849 1002.561 1045.390 19536.25 1000
lj(sp.df) 1776.140 1856.851 2172.697 1906.072 1995.426 59463.52 1000
reptimes(sp.df) 1120.376 1168.362 1327.877 1191.406 1242.211 17540.17 1000
道德看起来像是不要使用left_join
或此愚蠢的arrange
,如果您执行了很多次。