如何在r中使用dplyr将具有条件的多行折叠成一行?



我将用一个例子来说明我的问题。

样本数据:

df <- data.frame(ID = 1:5, Description = c("'foo' is a dog", "'bar' is a dog", "'foo' is a cat", "'foo' is not a cat", "'bar' is a fish"), Category = c("A", "A", "B", "B", "C"))
> df
ID      Description Category
1  1     'foo' is a dog        A
2  2     'bar' is a dog        A
3  3     'foo' is a cat        B
4  4     'foo' is not a cat    B
5  5     'bar' is a fish       C

我想做的是折叠相同类别的类似描述/ID,期望输出:

ID  Category          Description
1 3     B        ‘foo’ is a cat    
2 1,2   A        ‘foo,bar’ is a dog
3 5     C        ‘bar’ is a fish   
4 4     B        ‘foo’ is not a cat

我想开始使用dplyr,但我不能够有一个完整的想法如何实现这一点,有人能帮助我吗?

df %>% 
group_by(Category) %>% 
## some condition to check if content outside of single quote are the same. 
## If so, collapse them into one row, otherwise, leave as it is. 
## The regex to get the content outside of single quote 
`gsub("^'(.*?)'.*", "\2", x)` 
## then collapse 
summarise(new description = paste())

这是实现输出的另一种方法。

library(tidyverse)
df %>%
mutate(value = str_extract(Description, "'\w+'"), 
Description = trimws(str_remove(Description, value))) %>%
group_by(Description, Category) %>%
summarise(ID = toString(ID), 
value = sprintf("'%s'", toString(gsub("'", "", value)))) %>%
unite(Description, value, Description, sep = ' ')
#  Description         Category ID   
#  <chr>               <chr>    <chr>
#1 'foo' is a cat      B        3    
#2 'foo, bar' is a dog A        1, 2 
#3 'bar' is a fish     C        5    
#4 'foo' is not a cat  B        4    

想清楚了,请随时提出更好的解决方案:

df %>% 
mutate(sec = gsub("^'.*?'(.*)", "\1", Description),
content = gsub("^'(.*?)'.*", "\1", Description)) %>% 
group_by(sec, Category) %>%
summarise(
ID=str_c(unique(ID), collapse=","),
content=str_c(unique(content), collapse=",")) %>%
mutate(Description=str_c(sQuote(content), sec)) %>%
ungroup() %>%
dplyr::select(ID, Category, Description)

最新更新