r语言 - 如何逐行计算匹配模式并满足某些条件的特定列的列积?



假设我有以下data.frame:

dt = tibble::tibble(
id_0 = rep(123, 6),
name_0 = rep("A", 6),
id_1 = c(rep(321, 3), rep(322, 3)),
name_1 = c(rep("B", 3), rep("C", 3)),
p_1 = c(rep(0.7, 3), rep(0.3, 3)),
id_2 = c(NA, 323:326, NA),
name_2 = c(NA, "D", "E", "J", "G", NA),
p_2 = c(NA, 0.8, 0.2, 0.9, 0.1, NA),
id_3 = c(NA, NA, 323, NA, NA, NA),
na_3 = c(NA, NA, "H", NA, NA, NA),
p_3 = c(NA, NA, 1, NA, NA, NA),
)

它看起来像这样:

# A tibble: 6 x 11
id_0 name_0  id_1 name_1   p_1  id_2 name_2   p_2  id_3 na_3 
<dbl> <chr>  <dbl> <chr>  <dbl> <int> <chr>  <dbl> <dbl> <chr>
1   123 A        321 B        0.7    NA NA      NA      NA NA   
2   123 A        321 B        0.7   323 D        0.8    NA NA   
3   123 A        321 B        0.7   324 E        0.2   323 H    
4   123 A        322 C        0.3   325 J        0.9    NA NA   
5   123 A        322 C        0.3   326 G        0.1    NA NA   
6   123 A        322 C        0.3    NA NA      NA      NA NA 

我需要取所有p_*列的逐行乘积。在这种情况下,它是Product = p_1 * p_2 * p_3,但通常它可以是p_1到p_*的任何乘积(这个data.frame随情况而变化,我指的是Product = product(p_1, p_2, ..., p_n))。注意p_*总是大于0且小于等于1 (p_>0,P_ <= 1)因此,我需要完成的任务有两件事:Product1)必须省略NAs, 2)对于data.frame中存在的任何数量的p_*都是通用的。

理想的输出应该是这样的:
# A tibble: 6 x 12
id_0 name_0  id_1 name_1   p_1  id_2 name_2   p_2  id_3 na_3    p_3 Product
<dbl> <chr>  <dbl> <chr>  <dbl> <int> <chr>  <dbl> <dbl> <chr> <dbl>   <dbl>
1   123 A        321 B        0.7    NA NA      NA      NA NA       NA    0.7 
2   123 A        321 B        0.7   323 D        0.8    NA NA       NA    0.56
3   123 A        321 B        0.7   324 E        0.2   323 H         1    0.14
4   123 A        322 C        0.3   325 J        0.9    NA NA       NA    0.27
5   123 A        322 C        0.3   326 G        0.1    NA NA       NA    0.03
6   123 A        322 C        0.3    NA NA      NA      NA NA       NA    0.3 

一行(也许可以改进为更多元的形式):

> dt$Product = apply(dt %>% select(starts_with('p_')), 1, prod, na.rm = T)
> dt
# A tibble: 6 x 12
id_0 name_0  id_1 name_1   p_1  id_2 name_2   p_2  id_3 na_3    p_3  Product
<dbl> <chr>  <dbl> <chr>  <dbl> <int> <chr>  <dbl> <dbl> <chr> <dbl> <dbl>
1   123 A        321 B        0.7    NA NA      NA      NA NA       NA  0.7 
2   123 A        321 B        0.7   323 D        0.8    NA NA       NA  0.56
3   123 A        321 B        0.7   324 E        0.2   323 H         1  0.14
4   123 A        322 C        0.3   325 F        0.9    NA NA       NA  0.27
5   123 A        322 C        0.3   326 G        0.1    NA NA       NA  0.03
6   123 A        322 C        0.3    NA NA      NA      NA NA       NA  0.3

可以这样写:使用magrittr%<>%:

dt %<>% mutate(Product = apply(dt %>% select(starts_with('p_')), 1, prod, na.rm = T))

我建议进行重塑操作:

library(dplyr)
library(tidyr) # pivot_longer
# preserve a row-wise "id"
dt <- mutate(dt, rn = row_number())
dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num"))
# # A tibble: 24 x 6
#       rn num      id name      p na   
#    <int> <chr> <dbl> <chr> <dbl> <chr>
#  1     1 0       123 A      NA   <NA> 
#  2     1 1       321 B       0.7 <NA> 
#  3     1 2        NA <NA>   NA   <NA> 
#  4     1 3        NA <NA>   NA   <NA> 
#  5     2 0       123 A      NA   <NA> 
#  6     2 1       321 B       0.7 <NA> 
#  7     2 2       323 D       0.8 <NA> 
#  8     2 3        NA <NA>   NA   <NA> 
#  9     3 0       123 A      NA   <NA> 
# 10     3 1       321 B       0.7 <NA> 
# # ... with 14 more rows

有了这个,我们可以很容易地group_by,计算乘积,…

dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num")) %>%
group_by(rn) %>%
summarize(Product = prod(p, na.rm = TRUE))
# # A tibble: 6 x 2
#      rn Product
#   <int>   <dbl>
# 1     1   0.7  
# 2     2   0.560
# 3     3   0.140
# 4     4   0.27 
# 5     5   0.03 
# 6     6   0.3  

…然后加入到dt中。

dt %>%
pivot_longer(-rn, names_pattern = c("(.*)_([0-9])"), names_to = c(".value", "num")) %>%
group_by(rn) %>%
summarize(Product = prod(p, na.rm = TRUE)) %>%
left_join(dt, ., by = "rn") %>%
select(-rn)
# # A tibble: 6 x 12
#    id_0 name_0  id_1 name_1   p_1  id_2 name_2   p_2  id_3 na_3    p_3 Product
#   <dbl> <chr>  <dbl> <chr>  <dbl> <int> <chr>  <dbl> <dbl> <chr> <dbl>   <dbl>
# 1   123 A        321 B        0.7    NA <NA>    NA      NA <NA>     NA   0.7  
# 2   123 A        321 B        0.7   323 D        0.8    NA <NA>     NA   0.560
# 3   123 A        321 B        0.7   324 E        0.2   323 H         1   0.140
# 4   123 A        322 C        0.3   325 F        0.9    NA <NA>     NA   0.27 
# 5   123 A        322 C        0.3   326 G        0.1    NA <NA>     NA   0.03 
# 6   123 A        322 C        0.3    NA <NA>    NA      NA <NA>     NA   0.3  

(旁注:根据您对"任意数量的p_*">的评论,将您的数据保持在较长的格式(不在pivot_longer)以进行更多处理可能是有意义的。)

相关内容

  • 没有找到相关文章

最新更新