r-编辑:一列不重叠和公共变量的组合



数据已更新

我有一个示例数据集

目标Ay1cccAy2>cctAABy1aaaBy4aat

您可以通过对每组使用combn来实现这一点。

library(dplyr)
library(tidyr)
df %>%
group_by(Target) %>%
summarise(Start = combn(Start, 2, function(x) 
list(setNames(x, c('start', 'end')))), 
Sequence = combn(sequence, 2, toString), .groups = 'drop') %>%
unnest_wider(Start)
# Target start end   Sequence
#  <chr>  <chr> <chr> <chr>   
#1 A      y1    y2    ccc, cct
#2 A      y1    y3    ccc, aag
#3 A      y2    y3    cct, aag
#4 B      y1    y4    aaa, aat

这里是另一种不使用combn()tidyverse方法。

  1. group_by(Target, Start),使得任何具有相同TargetStart的序列都可以折叠成一行
  2. 删除group_by()中的Start
  3. Start列更改为数字,这样我们就可以直接比较Start的值
  4. 创建一个包含大于自身的Start值的Start2列,提取相应的sequence字符串并存储在sequence2列中
  5. 基于Start2sequence2展开数据帧(因为sapply每行会有多个输出(
  6. group_by(Target, Start, Start2),这样我们就可以用sequence2pastesequence
library(tidyverse)
df %>% 
group_by(Target, Start) %>% 
summarize(sequence = paste0(sequence, collapse = ","), .groups = "drop_last") %>% 
mutate(Start_num = as.numeric(str_extract(Start, "\d+")),
Start2 = sapply(Start_num, function(x) Start[which(Start_num > Start_num[x])]),
sequence2 = sapply(Start_num, function(x) sequence[which(Start_num > Start_num[x])])) %>% 
unnest(cols = c(Start2, sequence2)) %>% 
group_by(Target, Start, Start2) %>% 
summarize(sequence = paste0(c(sequence, sequence2), collapse = ","), .groups = "drop")
# A tibble: 4 × 4
Target Start Start2 sequence   
<chr>  <chr> <chr>  <chr>      
1 A      y1    y2     ccc,cct    
2 A      y1    y3     ccc,aag,act
3 A      y2    y3     cct,aag,act
4 B      y1    y4     aaa,aat     

相关内容

  • 没有找到相关文章

最新更新