合并r中的冗余行项目



我有一个数据集,里面有许多不同植物物种的名称(列MTmatch(,其中一些重复出现。其中每一个都有一列(ReadSum(,其中有一个与之相关的和(以及许多其他信息(。如何组合/聚合所有冗余植物物种,并将相关的ReadSum与每个物种相加,同时单独保留非冗余行?

我想取一个这样的数据集,对其进行转换,使每个样本都有组合行的集合,或者至少有一个额外的列显示组合冗余物种的ReadSum列的总和。很抱歉,如果这让人困惑,我不知道该怎么问这个问题。

我一直在使用dplyr,使用group_by()summarise(),但这似乎是对整个专栏的总结,而不仅仅是对新组的总结。

structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028", 
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135", 
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432, 
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389, 
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677, 
0, 0, 0.01513888, 0, 0.00115466)), row.names = c(NA, -9L), class = "data.frame")
> dput(samp5[1:9])
structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028", 
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135", 
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432, 
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389, 
0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677, 
0, 0, 0.01513888, 0, 0.00115466), S026409.R1 = c(0.221175955, 
0, 0, 0, 0, 0, 0.005146173, 0, 0), S026412.R1 = c(0.026058888, 
0, 0, 0, 0, 0, 0, 0, 0), MAX = c(0.400577608, 0.009933177, 0.124412855, 
0.00814677, 0.009824944, 0.086475106, 0.154850408, 0.015593835, 
0.008340888), ReadSum = c(3.54892343, 0.012059346, 0.203303936, 
0.021075546, 0.009824944, 0.128007863, 0.859687787, 0.068159534, 
0.050266853), SPECIES = c("Abies ", "Abies ", "Acer", "Alnus", 
"Berberis", "Betula ", "Boykinia", "Boykinia", "Boykinia")), row.names = c(NA, 
-9L), class = "data.frame")

这两种方法中的任何一种都能产生你想要的结果吗?

数据:

df <- structure(list(ESVID = c("ESV_000090", "ESV_000682", "ESV_000028", 
"ESV_000030", "ESV_000010", "ESV_000182", "ESV_000040", "ESV_000135", 
"ESV_000383"), S026401.R1 = c(0.222447727, 0, 0, 0, 0, 0, 0.029074432, 
0, 0), S026404.R1 = c(0.022583349, 0, 0, 0, 0, 0, 0.016390389, 
       0.001257217, 0), S026406.R1 = c(0.360895503, 0, 0, 0.00814677, 
                                       0, 0, 0.01513888, 0, 0.00115466), S026409.R1 = c(0.221175955, 
                                                                                        0, 0, 0, 0, 0, 0.005146173, 0, 0), S026412.R1 = c(0.026058888, 
                                                                                                                                          0, 0, 0, 0, 0, 0, 0, 0), MAX = c(0.400577608, 0.009933177, 0.124412855, 
                                                                                                                                                                           0.00814677, 0.009824944, 0.086475106, 0.154850408, 0.015593835, 
                                                                                                                                                                           0.008340888), ReadSum = c(3.54892343, 0.012059346, 0.203303936, 
                                                                                                                                                                                                     0.021075546, 0.009824944, 0.128007863, 0.859687787, 0.068159534, 
                                                                                                                                                                                                     0.050266853), SPECIES = c("Abies ", "Abies ", "Acer", "Alnus", 
                                                                                                                                                                                                                               "Berberis", "Betula ", "Boykinia", "Boykinia", "Boykinia")), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                          -9L), class = "data.frame")

创建新列";combined_ ReadSum";(第2列(,它是";ReadSum";对于每个";物种":

library(dplyr)
df %>%
group_by(SPECIES) %>%
summarise(combined_ReadSum = sum(ReadSum)) %>%
left_join(df, by = "SPECIES")
#> # A tibble: 9 × 10
#>   SPECIES  combi…¹ ESVID S0264…² S0264…³ S0264…⁴ S0264…⁵ S0264…⁶     MAX ReadSum
#>   <chr>      <dbl> <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1 "Abies " 3.56    ESV_…  0.222  0.0226  0.361   0.221    0.0261 0.401   3.55   
#> 2 "Abies " 3.56    ESV_…  0      0       0       0        0      0.00993 0.0121 
#> 3 "Acer"   0.203   ESV_…  0      0       0       0        0      0.124   0.203  
#> 4 "Alnus"  0.0211  ESV_…  0      0       0.00815 0        0      0.00815 0.0211 
#> 5 "Berber… 0.00982 ESV_…  0      0       0       0        0      0.00982 0.00982
#> 6 "Betula… 0.128   ESV_…  0      0       0       0        0      0.0865  0.128  
#> 7 "Boykin… 0.978   ESV_…  0.0291 0.0164  0.0151  0.00515  0      0.155   0.860  
#> 8 "Boykin… 0.978   ESV_…  0      0.00126 0       0        0      0.0156  0.0682 
#> 9 "Boykin… 0.978   ESV_…  0      0       0.00115 0        0      0.00834 0.0503 
#> # … with abbreviated variable names ¹​combined_ReadSum, ²​S026401.R1,
#> #   ³​S026404.R1, ⁴​S026406.R1, ⁵​S026409.R1, ⁶​S026412.R1

或者,通过对每个独特物种的值求和来总结列:

library(dplyr)
df %>%
group_by(SPECIES) %>%
summarise(across(where(is.numeric), sum))
#> # A tibble: 6 × 8
#>   SPECIES    S026401.R1 S026404.R1 S026406.R1 S026409.R1 S0264…¹     MAX ReadSum
#>   <chr>           <dbl>      <dbl>      <dbl>      <dbl>   <dbl>   <dbl>   <dbl>
#> 1 "Abies "       0.222      0.0226    0.361      0.221    0.0261 0.411   3.56   
#> 2 "Acer"         0          0         0          0        0      0.124   0.203  
#> 3 "Alnus"        0          0         0.00815    0        0      0.00815 0.0211 
#> 4 "Berberis"     0          0         0          0        0      0.00982 0.00982
#> 5 "Betula "      0          0         0          0        0      0.0865  0.128  
#> 6 "Boykinia"     0.0291     0.0176    0.0163     0.00515  0      0.179   0.978  
#> # … with abbreviated variable name ¹​S026412.R1

创建于2022-10-28由reprex包(v2.0.1(

最新更新