假设我有以下data.frame
:
dt = tibble::tibble(
id_0 = rep(123, 6),
name_0 = rep("A", 6),
id_1 = c(rep(321, 3), rep(322, 3)),
name_1 = c(rep("B", 3), rep("C", 3)),
p_1 = c(rep(0.7, 3), rep(0.3, 3)),
id_2 = c(NA, 323:326, NA),
name_2 = c(NA, "D", "E", "J", "G", NA),
p_2 = c(NA, 0.8, 0.2, 0.9, 0.1, NA),
id_3 = c(NA, NA, 323, NA, NA, NA),
na_3 = c(NA, NA, "H", NA, NA, NA),
p_3 = c(NA, NA, 1, NA, NA, NA),
)
它看起来像这样:
# A tibble: 6 x 11
id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3
<dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr>
1 123 A 321 B 0.7 NA NA NA NA NA
2 123 A 321 B 0.7 323 D 0.8 NA NA
3 123 A 321 B 0.7 324 E 0.2 323 H
4 123 A 322 C 0.3 325 J 0.9 NA NA
5 123 A 322 C 0.3 326 G 0.1 NA NA
6 123 A 322 C 0.3 NA NA NA NA NA
我需要取所有p_*
列的逐行乘积。在这种情况下,它是Product = p_1 * p_2 * p_3
,但通常它可以是p_1到p_*的任何乘积(这个data.frame
随情况而变化,我指的是Product = product(p_1, p_2, ..., p_n)
)。注意p_*
总是大于0且小于等于1 (p_>0,P_ <= 1)因此,我需要完成的任务有两件事:Product
1)必须省略NAs
, 2)对于data.frame
中存在的任何数量的p_*
都是通用的。
# A tibble: 6 x 12
id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3 p_3 Product
<dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 123 A 321 B 0.7 NA NA NA NA NA NA 0.7
2 123 A 321 B 0.7 323 D 0.8 NA NA NA 0.56
3 123 A 321 B 0.7 324 E 0.2 323 H 1 0.14
4 123 A 322 C 0.3 325 J 0.9 NA NA NA 0.27
5 123 A 322 C 0.3 326 G 0.1 NA NA NA 0.03
6 123 A 322 C 0.3 NA NA NA NA NA NA 0.3
一行(也许可以改进为更多元的形式):
> dt$Product = apply(dt %>% select(starts_with('p_')), 1, prod, na.rm = T)
> dt
# A tibble: 6 x 12
id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3 p_3 Product
<dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 123 A 321 B 0.7 NA NA NA NA NA NA 0.7
2 123 A 321 B 0.7 323 D 0.8 NA NA NA 0.56
3 123 A 321 B 0.7 324 E 0.2 323 H 1 0.14
4 123 A 322 C 0.3 325 F 0.9 NA NA NA 0.27
5 123 A 322 C 0.3 326 G 0.1 NA NA NA 0.03
6 123 A 322 C 0.3 NA NA NA NA NA NA 0.3
可以这样写:使用magrittr
%<>%
:
dt %<>% mutate(Product = apply(dt %>% select(starts_with('p_')), 1, prod, na.rm = T))
我建议进行重塑操作:
library(dplyr)
library(tidyr) # pivot_longer
# preserve a row-wise "id"
dt <- mutate(dt, rn = row_number())
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num"))
# # A tibble: 24 x 6
# rn num id name p na
# <int> <chr> <dbl> <chr> <dbl> <chr>
# 1 1 0 123 A NA <NA>
# 2 1 1 321 B 0.7 <NA>
# 3 1 2 NA <NA> NA <NA>
# 4 1 3 NA <NA> NA <NA>
# 5 2 0 123 A NA <NA>
# 6 2 1 321 B 0.7 <NA>
# 7 2 2 323 D 0.8 <NA>
# 8 2 3 NA <NA> NA <NA>
# 9 3 0 123 A NA <NA>
# 10 3 1 321 B 0.7 <NA>
# # ... with 14 more rows
有了这个,我们可以很容易地group_by
,计算乘积,…
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num")) %>%
group_by(rn) %>%
summarize(Product = prod(p, na.rm = TRUE))
# # A tibble: 6 x 2
# rn Product
# <int> <dbl>
# 1 1 0.7
# 2 2 0.560
# 3 3 0.140
# 4 4 0.27
# 5 5 0.03
# 6 6 0.3
…然后加入到dt
中。
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num")) %>%
group_by(rn) %>%
summarize(Product = prod(p, na.rm = TRUE)) %>%
left_join(dt, ., by = "rn") %>%
select(-rn)
# # A tibble: 6 x 12
# id_0 name_0 id_1 name_1 p_1 id_2 name_2 p_2 id_3 na_3 p_3 Product
# <dbl> <chr> <dbl> <chr> <dbl> <int> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
# 1 123 A 321 B 0.7 NA <NA> NA NA <NA> NA 0.7
# 2 123 A 321 B 0.7 323 D 0.8 NA <NA> NA 0.560
# 3 123 A 321 B 0.7 324 E 0.2 323 H 1 0.140
# 4 123 A 322 C 0.3 325 F 0.9 NA <NA> NA 0.27
# 5 123 A 322 C 0.3 326 G 0.1 NA <NA> NA 0.03
# 6 123 A 322 C 0.3 NA <NA> NA NA <NA> NA 0.3
(旁注:根据您对"任意数量的p_*
">的评论,将您的数据保持在较长的格式(不在pivot_longer
)以进行更多处理可能是有意义的。)