我正在运行前面讨论过的一段代码,该代码针对150万行的大数据,运行了好几个小时才完成。我的数据如下:
ID London Paris Rome
1 Yes No Yes
2 No No Yes
3 No Yes Yes
4 No Yes No
我想添加一个列,显示一个ID去过的所有城市,以及一个列显示ID去过的城市数量,如下所示:
ID London Paris Rome All Cities Count of Cities travelled
1 Yes No Yes London, Rome 2
2 No No Yes Rome 1
3 No Yes Yes Paris, Rome 2
4 No Yes No Paris 1
我正在运行这段代码,当我在100行数据的样本上运行它时,它运行得很好:
cities <- c('London', 'Paris', 'Rome')
df %>%
rowwise %>%
mutate(`All Cities` = toString(names(.[, cities])[which(c_across(all_of(cities)) == 'Yes')]),
`Count of Cities travelled` = sum(c_across(all_of(cities)) == 'Yes'))
有什么方法可以改进这个代码吗?还是缩短运行时间?
谢谢!
这里有一个不使用rowwise()
的tidyverse
方法,它的速度非常慢。
library(tidyverse)
cities <- c('London', 'Paris', 'Rome')
df <- read.table(header = T, text = "ID London Paris Rome
1 Yes No Yes
2 No No Yes
3 No Yes Yes
4 No Yes No")
df %>%
mutate(across(cities, ~ifelse(.x == "Yes", cur_column(), NA), .names = "{.col}1")) %>%
unite(`All Cities`, ends_with("1"), sep = ", ", na.rm = T) %>%
mutate(`Count of Cities travelled` = str_count(`All Cities`, ",") + 1)
ID London Paris Rome All Cities Count of Cities travelled
1 1 Yes No Yes London, Rome 2
2 2 No No Yes Rome 1
3 3 No Yes Yes Paris, Rome 2
4 4 No Yes No Paris 1
基R中的一个可能的解决方案:
df$Cities <- apply(df, 1, (x) paste(names(df[-1])[x[-1] == "Yes"], collapse = ", "))
df$N <- apply(df, 1, (x) sum(x[-1] == "Yes"))
df
#> ID London Paris Rome Cities N
#> 1 1 Yes No Yes London, Rome 2
#> 2 2 No No Yes Rome 1
#> 3 3 No Yes Yes Paris, Rome 2
#> 4 4 No Yes No Paris 1
使用dplyr
和rowwise
:
library(dplyr)
df %>%
rowwise %>%
mutate(Cities = str_c(colnames(df[-1])[c_across(2:4) == "Yes"], collapse = ", "),
N = sum(c_across(2:4) == "Yes")) %>%
ungroup
#> # A tibble: 4 × 6
#> ID London Paris Rome Cities N
#> <int> <chr> <chr> <chr> <chr> <int>
#> 1 1 Yes No Yes London, Rome 2
#> 2 2 No No Yes Rome 1
#> 3 3 No Yes Yes Paris, Rome 2
#> 4 4 No Yes No Paris 1