为事务的R数据帧中的关联规则准备数据



我的数据来自SQL数据库,采用表格形式,其中单个事务有多行。我不希望仅仅使用"product"字段,而是希望使用数据框架中的所有其他列。

我的数据如下:

transID <- c('1','1','2','3')
state <- c('TX','TX','CA','MA')
product <- c('Oranges','Banana','Fish','Cheese')
Month <- c('January','January','Febuary','March')
Place <- c('A','A','B','C')
transactions <- data.frame(transID,state,product,Month,Place)
transactions
transID state product   Month Place
1       1    TX Oranges January     A
2       1    TX  Banana January     A
3       2    CA    Fish Febuary     B
4       3    MA  Cheese   March     C

理想情况下,我的数据如下:

1 (TX,Oranges,Banana,January,A)
2 (CA,Fish,Febuary,B)
3 (MA, Cheese, March,C)

将这类数据转换为事务格式的最佳方法是什么?

我尝试过以下操作,但我只是将第1行和第2行作为一个单独的事务连接在一起:

transactionData <- ddply(transactions,c("transID"),
function(df1) paste(df1$state,
df1$product,
df1$Month,
df1$Place,
collapse = ","))

这有点尴尬,因为data.frames存储因子。

library("arules")
# make all columns into items
df <- data.frame(
id = transactions$transID, 
items = factor(c(as.character(transactions$state),
as.character(transactions$product), 
as.character(transactions$Month), 
as.character(transactions$Place))))
# remove duplicated state, month and place enties
df <- df[!duplicated(df),]
# this is from the manual page '? transactions'
trans <- as(split(df[,"items"], df[,"id"]), "transactions")    
inspect(trans)

items                         transactionID
[1] {A,Banana,January,Oranges,TX} 1            
[2] {B,CA,Febuary,Fish}           2            
[3] {C,Cheese,MA,March}           3    

我希望这能有所帮助。

下面是一个基本解决方案:

stack(tapply(transactions[, -1], 
transactions[, 1, drop = F],
FUN = function(DF) {
paste(unique(unlist(DF), use.names = F), collapse = ',')
}))[, 2:1]
#  ind                      values
#1   1 TX,Oranges,Banana,January,A
#2   2           CA,Fish,Febuary,B
#3   3           MA,Cheese,March,C

主要部分是tapply()部分,它被transID分割,然后取消列出data.frame的其余部分,并且只保留唯一值。这是tapply()调用的输出。

1                             2                             3 
"TX,Oranges,Banana,January,A"           "CA,Fish,Febuary,B"           "MA,Cheese,March,C" 

stack()[, 2:1]纯粹是化妆品,以产生有序的漂亮的data.frame

像这样重塑怎么样?

reshape(transactions,v.names = "product",timevar = "product",idvar = "state", direction = "wide")
transID state   Month Place product.Oranges product.Banana product.Fish product.Cheese
1       1    TX January     A         Oranges         Banana         <NA>           <NA>
3       2    CA Febuary     B            <NA>           <NA>         Fish           <NA>
4       3    MA   March     C            <NA>           <NA>         <NA>         Cheese

相关内容

  • 没有找到相关文章

最新更新