我的数据来自SQL数据库,采用表格形式,其中单个事务有多行。我不希望仅仅使用"product"字段,而是希望使用数据框架中的所有其他列。
我的数据如下:
transID <- c('1','1','2','3')
state <- c('TX','TX','CA','MA')
product <- c('Oranges','Banana','Fish','Cheese')
Month <- c('January','January','Febuary','March')
Place <- c('A','A','B','C')
transactions <- data.frame(transID,state,product,Month,Place)
transactions
transID state product Month Place
1 1 TX Oranges January A
2 1 TX Banana January A
3 2 CA Fish Febuary B
4 3 MA Cheese March C
理想情况下,我的数据如下:
1 (TX,Oranges,Banana,January,A)
2 (CA,Fish,Febuary,B)
3 (MA, Cheese, March,C)
将这类数据转换为事务格式的最佳方法是什么?
我尝试过以下操作,但我只是将第1行和第2行作为一个单独的事务连接在一起:
transactionData <- ddply(transactions,c("transID"),
function(df1) paste(df1$state,
df1$product,
df1$Month,
df1$Place,
collapse = ","))
这有点尴尬,因为data.frames存储因子。
library("arules")
# make all columns into items
df <- data.frame(
id = transactions$transID,
items = factor(c(as.character(transactions$state),
as.character(transactions$product),
as.character(transactions$Month),
as.character(transactions$Place))))
# remove duplicated state, month and place enties
df <- df[!duplicated(df),]
# this is from the manual page '? transactions'
trans <- as(split(df[,"items"], df[,"id"]), "transactions")
inspect(trans)
items transactionID
[1] {A,Banana,January,Oranges,TX} 1
[2] {B,CA,Febuary,Fish} 2
[3] {C,Cheese,MA,March} 3
我希望这能有所帮助。
下面是一个基本解决方案:
stack(tapply(transactions[, -1],
transactions[, 1, drop = F],
FUN = function(DF) {
paste(unique(unlist(DF), use.names = F), collapse = ',')
}))[, 2:1]
# ind values
#1 1 TX,Oranges,Banana,January,A
#2 2 CA,Fish,Febuary,B
#3 3 MA,Cheese,March,C
主要部分是tapply()
部分,它被transID
分割,然后取消列出data.frame
的其余部分,并且只保留唯一值。这是tapply()
调用的输出。
1 2 3
"TX,Oranges,Banana,January,A" "CA,Fish,Febuary,B" "MA,Cheese,March,C"
stack()
和[, 2:1]
纯粹是化妆品,以产生有序的漂亮的data.frame
。
像这样重塑怎么样?
reshape(transactions,v.names = "product",timevar = "product",idvar = "state", direction = "wide")
transID state Month Place product.Oranges product.Banana product.Fish product.Cheese
1 1 TX January A Oranges Banana <NA> <NA>
3 2 CA Febuary B <NA> <NA> Fish <NA>
4 3 MA March C <NA> <NA> <NA> Cheese