R 将 BigCartel csv 文件中的列拆分为数据帧或 data.table 中的长格式

Big Cartel有一个将订单导出到csv文件中的选项。但是，该结构对于我需要做的分析不是很好。

这是大型卡特尔 csv 订单下载中的列和行的子集(还有其他列对手头的问题并不重要(。

Number, Buyer name,Items,Item count,Item total,Total price,Total shipping,Total tax,Total discount
1,jim,product_name:Plate|product_option_name:Red|quantity:1|price:9.99|total:9.99,1,9.99,11.98,1.99,0,0
2,bill,product_name:Plate|product_option_name:Green|quantity:1|price:9.99|total:9.99;product_name:Plate|product_option_name:Blue|quantity:1|price:9.99|total:9.99,2,19.98,22.98,3,0,0
3,jane,product_name:Plate|product_option_name:Red|quantity:1|price:6.99|total:6.99;product_name:Thingy|product_option_name:|quantity:1|price:9.99|total:9.99;product_name:Mug|product_option_name:Grey|quantity:1|price:10.99|total:10.99;product_name:Cup|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Saucer|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Stopper|product_option_name:|quantity:1|price:9.99|total:9.99,6,57.94,64.94,7,0,0
4,dale,product_name:Plate|product_option_name:Green|quantity:1|price:10.99|total:10.99,1,10.99,13.99,4.99,0,1.99

项目列可以有多个"行项目"，并以分号 (;) 作为分隔符。每个"订单项"都有五个属性，用竖线 (|( 分隔，即product_name、product_option_name、数量、价格和总计(即行(。有一列，"项目计数"，它给出了"行项目"的数量以及(订单(总价、运费、税费和折扣的列。为了进行分析，我希望以下长格式的数据，其中运费，税费和折扣也被视为"产品项目"。

Number Buyer name line-item    product_option_name quantity price total
1      jim        Plate        Red                 1        9.99  9.99
1      jim        shipping                         1        1.99  1.99
1      jim        tax                              0        0     0
1      jim        discount                         0        0     0
2      bill       Plate        Green               1        9.99  9.99
2      bill       Plate        Blue                1        9.99  9.99
2      bill       shipping                         1        3     3
2      bill       tax                              0        0     0
2      bill       discount                         0        0     0
3      jane       Plate        Red                 1        6.99  6.99
3      jane       Thingy                           1        9.99  9.99
3      jane       Mug          Grey                1        10.99 10.99
3      jane       Cup          Grey                1        9.99  9.99
3      jane       Saucer       Grey                1        9.99  9.99
3      jane       Stopper                          1        9.99  9.99
3      jane       shipping                         1        7     7
3      jane       tax                              0        0     0
3      jane       discount                         0        0     0
4      dale       Plate        Green               1        10.99 10.99
4      dale       shipping                         1        4.99  4.99
4      dale       tax                              0        0     
4      dale       discount                         0        -1.99 -1.99

使用 r：data.table 中的 tstrsplit(( 和 r：splitstackshape 中的 cSplit(( 似乎是解决方案，但我无法正确理解语法。我也尝试了整洁的 dplyr 函数分离/传播等，但我就是无法获得我需要的输出，无论我做什么。

我一直在谷歌搜索和搜索所有 SO 问题 - 有一些解决方案(这个 R：用逗号将变量列拆分为多个(不平衡(列(很接近，但没有一个能让我到达那里，因为大多数假设一个广泛的格式而不是"长"。

这样的东西可能会让你得到你想要的东西。

library(dplyr)
library(tidyr)
library(stringr)
filepath <- # Path to datafile here
df <- read.csv(filepath, stringsAsFactors = FALSE)
cols <- paste0("col", 1:(max(str_count(df$Items, ";")) + 1))
df <- df %>%
separate(col = Items, into = cols, sep = ";", fill = "right") %>%
gather_("column", "details", cols, na.rm = TRUE) %>%
select(-column) %>%
separate(col = details, into = c("product_name", "product_option_name","quantity","price","total"), sep = "\|", fill = "right") %>%
mutate(product_name = sub("^.*\:", "", product_name),
product_option_name = sub("^.*\:", "", product_option_name),
quantity = sub("^.*\:", "", quantity),
price = sub("^.*\:", "", price),
total = sub("^.*\:", "", total)) %>%
gather("line", "item", c(Total.shipping, Total.discount, Total.tax, product_name)) %>%
mutate(product_option_name = ifelse(line == "product_name" & product_option_name != "", product_option_name, NA),
line_item = ifelse(line == "product_name", item, sub("^.*\.","", line)),
price = ifelse(line == "product_name", price, item),
price = ifelse(line_item == "discount", as.numeric(price) * (-1), price),
quantity = ifelse(line_item %in% c("shipping","discount","tax") & price == "0", 0, quantity),
total = as.numeric(price) * as.numeric(quantity)) %>%
distinct() %>%
select(Number, Buyer.name, line_item, product_option_name, quantity, price, total) %>%
arrange(Number)

相关内容

最新更新

热门标签：