R: 如何使用串联在一列中的var-val对来整理数据



我已经尝试在SO这里和这里解决这个问题,当然得到了很好的答案,但意识到这只是我认为是一个一般问题的部分解决方案:通常数据被组织为每个变量有一列变量(显然最有趣的是),然后是最后一列,其中几个变量值对被放在一起。我一直在努力寻找一种将最后一列变量转换为单独列的通用方法,这种整理数据的工作不应该是tidyr的工作吗?

require(dplyr)
require(stringr)
data <- 
      data.frame(
        shoptype=c("A","B","B"),
        city=c("bah", "bah", "slah"),
        sale=c("type cheese; price 200", "type ham; price 150","type cheese; price 100" )) %>%
      tbl_df()
> data
Source: local data frame [3 x 3]
  shoptype city                   sale
1        A  bah type cheese; price 200
2        B  bah    type ham; price 150
3        B slah type cheese; price 100

在这里,我们有关于一些城市的一些商店的信息,其中有一个串联列,变量用";"分隔,变量用空格分隔。有人想要这样的输出:

    shoptype    city    type    price
1   A   bah cheese  200
2   B   bah ham 150
3   B   slah    cheese  100

当所有行都是唯一的一行时(请参阅链接的SO问题)

require(plyr)
require(dplyr)
require(stringr)
require(tidyr)  
data %>%
  mutate(sale = str_split(as.character(sale), "; ")) %>%
  unnest(sale) %>%
  mutate(sale = str_trim(sale)) %>%
  separate(sale, into = c("var", "val")) %>%
  spread(var, val)

但如果我们将第二排商店类型更改为"A",我们会因此而出错。类似:

data2 <- 
  data.frame(
    shoptype=c("A","A","B"),
    city=c("bah", "bah", "slah"),
    sale=c("type cheese; price 200", "type ham; price 150","type cheese; price 100" )) %>%
  tbl_df()
data2 %>%
  mutate(sale = str_split(as.character(sale), "; ")) %>%
  unnest(sale) %>%
  mutate(sale = str_trim(sale)) %>%
  separate(sale, into = c("var", "val")) %>%
  spread(var, val)
Error: Duplicate identifiers for rows (2, 4), (1, 3)

我试图用一个唯一的id:来解决这个问题(再次查看链接的SO答案)

data2 %>%
  mutate(sale = str_split(as.character(sale), "; ")) %>%
  unnest(sale) %>%
  mutate(sale = str_trim(sale),
         v0=rownames(.)) %>%
  separate(sale, into = c("var", "val")) %>%
  spread(var, val)
Source: local data frame [6 x 5]
  shoptype city v0 price   type
1        A  bah  1    NA cheese
2        A  bah  2   200     NA
3        A  bah  3    NA    ham
4        A  bah  4   150     NA
5        B slah  5    NA cheese
6        B slah  6   100     NA

这给出了结构性缺失的数据,我无法按照上面我想要的输出中所描述的那样找出如何收集这些数据。

我想我真的错过了一些属于tidyr范围内的东西(我希望!)。

我认为没有必要使用tidyr::unnesttidyr::gather。以下是一个关注stringr::str_replacetidyr::separate:的替代解决方案

library(dplyr)
library(stringr)
library(tidyr)
data2 %>%
  mutate(
    sale = str_replace(sale, "type ", ""),
    sale = str_replace(sale, " price ", "")
    ) %>%
  separate(sale, into = c("type", "price"), sep = ";") 
# Source: local data frame [3 x 4]
#   shoptype city   type price
# 1        A  bah cheese   200
# 2        A  bah    ham   150
# 3        B slah cheese   100

在拆分前添加辅助ID:

data2 %>%
  group_by(shoptype, city) %>%
  mutate(id2 = sequence(n())) %>%
  mutate(sale = str_split(as.character(sale), "; ")) %>%
  unnest(sale) %>%
  mutate(sale = str_trim(sale)) %>%
  separate(sale, into = c("var", "val")) %>%
  spread(var, val)
# Source: local data frame [3 x 5]
# 
#   shoptype city id2 price   type
# 1        A  bah   1   200 cheese
# 2        A  bah   2   150    ham
# 3        B slah   1   100 cheese

如果您使用我的"splitstackshape"包中的一些函数,代码可以变得更加紧凑

as.data.frame(data2) %>%
  getanID(c("shoptype", "city")) %>%
  cSplit("sale", ";", "long") %>%
  cSplit("sale", " ") %>%
  spread(sale_1, sale_2)
#    shoptype city .id price   type
# 1:        A  bah   1   200 cheese
# 2:        A  bah   2   150    ham
# 3:        B slah   1   100 cheese

上面有两个很好的答案,但认为这对extract 来说是一个非常好的情况

data2 %>%
  extract(sale, c("type", "price"), "type (.+); price (.+)", convert = TRUE) 

最新更新