在R中,将二进制列聚合为一列需要很长时间



我正在运行前面讨论过的一段代码,该代码针对150万行的大数据,运行了好几个小时才完成。我的数据如下:

ID    London   Paris   Rome
1       Yes     No      Yes
2       No      No      Yes
3       No      Yes     Yes
4       No      Yes     No

我想添加一个列,显示一个ID去过的所有城市,以及一个列显示ID去过的城市数量,如下所示:

ID    London   Paris   Rome    All Cities      Count of Cities travelled
1       Yes     No      Yes    London, Rome                2
2       No      No      Yes     Rome                       1
3       No      Yes     Yes    Paris, Rome                 2
4       No      Yes     No     Paris                       1

我正在运行这段代码,当我在100行数据的样本上运行它时,它运行得很好:

cities <- c('London', 'Paris', 'Rome')
df %>%
rowwise %>%
mutate(`All Cities` = toString(names(.[, cities])[which(c_across(all_of(cities)) == 'Yes')]),
`Count of Cities travelled` = sum(c_across(all_of(cities)) == 'Yes'))

有什么方法可以改进这个代码吗?还是缩短运行时间?

谢谢!

这里有一个不使用rowwise()tidyverse方法,它的速度非常慢。

library(tidyverse)
cities <- c('London', 'Paris', 'Rome')
df <- read.table(header = T, text = "ID    London   Paris   Rome
1       Yes     No      Yes
2       No      No      Yes
3       No      Yes     Yes
4       No      Yes     No")
df %>% 
mutate(across(cities, ~ifelse(.x == "Yes", cur_column(), NA), .names = "{.col}1")) %>% 
unite(`All Cities`, ends_with("1"), sep = ", ", na.rm = T) %>% 
mutate(`Count of Cities travelled` = str_count(`All Cities`, ",") + 1)
ID London Paris Rome   All Cities Count of Cities travelled
1  1    Yes    No  Yes London, Rome                         2
2  2     No    No  Yes         Rome                         1
3  3     No   Yes  Yes  Paris, Rome                         2
4  4     No   Yes   No        Paris                         1

基R中的一个可能的解决方案:

df$Cities <- apply(df, 1, (x) paste(names(df[-1])[x[-1] == "Yes"], collapse = ", "))
df$N <- apply(df, 1, (x) sum(x[-1] == "Yes"))
df
#>   ID London Paris Rome       Cities N
#> 1  1    Yes    No  Yes London, Rome 2
#> 2  2     No    No  Yes         Rome 1
#> 3  3     No   Yes  Yes  Paris, Rome 2
#> 4  4     No   Yes   No        Paris 1

使用dplyrrowwise:

library(dplyr)
df %>%
rowwise %>%
mutate(Cities = str_c(colnames(df[-1])[c_across(2:4) == "Yes"], collapse = ", "),
N = sum(c_across(2:4) == "Yes")) %>%
ungroup
#> # A tibble: 4 × 6
#>      ID London Paris Rome  Cities           N
#>   <int> <chr>  <chr> <chr> <chr>        <int>
#> 1     1 Yes    No    Yes   London, Rome     2
#> 2     2 No     No    Yes   Rome             1
#> 3     3 No     Yes   Yes   Paris, Rome      2
#> 4     4 No     Yes   No    Paris            1

最新更新