R - dplyr 创建基于两列计算百分比的列



>我有一个看起来像这样的csv文件:

Year, Answer, Total
2017, Yes, 100
2017, No, 10
2017, Yes, 100
2018, No, 40
2018, Yes, 200

我正在尝试制作一列来计算给定年份中"否"与"是"之间的比率。所以它看起来像这样:

Year, Answer, Total, Ratio
2017, Yes, 100, 1
2017, No, 10, 0.05
2017, Yes, 100, 1
2018, No, 40, 0.2 
2018, Yes, 200, 1

我正在使用R和dplyr。我想我必须创建一个列,其中包含给定年份中"是"的总数(会有重复项)。然后使用 ifelse 语句创建另一列,其中"是"行将为 1,"否"行将是总 NO 除以 Yes 总数。有没有更有效的方法可以做到这一点?谢谢

这个怎么样?

library(dplyr)
xdf <- data.frame(
stringsAsFactors = FALSE,
Year = c(2017, 2017, 2017, 2018, 2018),
Answer = c("Yes", "No", "Yes", "No", "Yes"),
Total = c(100, 10, 100, 40, 200)
)
xdf %>% 
group_by(Year, Answer) %>% 
summarise(Total = sum(Total)) %>% 
mutate(share = if_else(Answer == "No", Total/lead(Total), 1))
#> # A tibble: 4 x 4
#> # Groups:   Year [2]
#>    Year Answer Total share
#>   <dbl> <chr>  <dbl> <dbl>
#> 1  2017 No        10  0.05
#> 2  2017 Yes      200  1   
#> 3  2018 No        40  0.2 
#> 4  2018 Yes      200  1

这是一种使用自定义函数的方法

# function calculating the ratios
f1 <- function(k){
ind.yes <- intersect(which(df$year == df$year[k]),
which(df$answer == "yes")
)
ind.no <- intersect(which(df$year == df$year[k]),
which(df$answer == "no")
)
total.yes <- sum(df$total[ind.yes])
total.no <- sum(df$total[ind.no])
ratio.no.yes <- total.no/total.yes
return(ratio.no.yes)
}
# vapplying function f1
ratios <- vapply(1:nrow(df), f1, numeric(1))
# binding the data
df$ratios <- ratios

这是结果(使用虚拟数据框)

df <- data.frame(
year = sample(2015:2018, 10, replace = T),
answer = sample(c("yes", "no"), 10, replace = T),
total = sample(10:200, 10, replace = T),
stringsAsFactors = F)
ratios <- vapply(1:nrow(df), f1, numeric(1))
df$ratios <- ratios
# printing
> df
year answer total     ratios
1  2015    yes    76 0.08294931
2  2017    yes    43 2.55263158
3  2018    yes    63 0.00000000
4  2016    yes    61 0.83606557
5  2015     no    18 0.08294931
6  2017     no   142 2.55263158
7  2017    yes    33 2.55263158
8  2015    yes   141 0.08294931
9  2016     no    51 0.83606557
10 2017     no    52 2.55263158

我认为效率对此无关紧要。你可以把它变成一行,尽管它更难阅读:

DF %>% group_by(Year) %>% mutate(v = 
(Total / sum(Total[Answer == "Yes"]))^(Answer == "No")
)

x^cond使用 x^FALSE = x^0 = 1 在答案 != "否" 时分配所需的值 1。

最新更新