之前的问题
在这篇文章中,我问了如何提取所谓的tidList,它提供了关于发现的频繁序列是否存在于用于挖掘这些频繁序列的每个事务中的信息。更具体地说,如何以行顺序与原始事务数据集相同的方式提取布尔矩阵(表示序列的存在或不存在(?
最终,通过使用存储在类序列对象中的 tidList 的 transactionInfo 属性,这很容易做到。
新问题
这个问题与前面的问题略有不同:在给定一组频繁序列的情况下,我如何"评分"新事务。 即,给定序列类型的对象,如何从事务类型的新对象中获取tidList 类型的对象?
为了说明这一点,我使用一些玩具数据集设计了一个示例:
library(arules)
library(arulesSequences)
library(stringr)
#Function used to convert character string into an object of type transactions.
#Source: https://github.com/cran/clickstream/blob/master/R/Clickstream.r.
as.transactions <- function(clickstreamList) {
transactionID <- unlist(lapply(seq(1, length(clickstreamList), 1), FUN =
function(x) rep(names(clickstreamList)[x], length(clickstreamList[[x]]))), use.names = F)
sequenceID <- unlist(lapply(seq(1, length(clickstreamList), 1), FUN =
function(x) rep(x, length(clickstreamList[[x]]))))
eventID <- unlist(lapply(clickstreamList, FUN = function(x)
1:length(x)), use.names = F)
transactionInfo <- data.frame(transactionID, sequenceID, eventID)
tr <- as(as.data.frame(unlist(clickstreamList, use.names = F)), "transactions")
transactionInfo(tr) <- transactionInfo
itemInfo(tr)$labels <- itemInfo(tr)$levels
return(tr)
}
#Dataset to mine frequent sequences from
data_mine_freq_seq <- data.frame(id = 1:10,
transaction = c("A B B A",
"A B C B D C B B B F A",
"A A B",
"B A B A",
"A B B B B",
"A A A B",
"A B B A B B",
"E F F A C B D A B C D E",
"A B B A B",
"A B"))
#Convert data to list containing character vectors
data_for_fseq_mining <- str_split(string = data_mine_freq_seq$transaction, pattern = " ")
#Include identifiers as names
names(data_for_fseq_mining) <- data_mine_freq_seq$id
#Convert to object of type transactions
data_for_fseq_mining_trans <- as.transactions(clickstreamList = data_for_fseq_mining)
#Mine frequent sequences with cspade, given some parameters.
sequences <- cspade(data = data_for_fseq_mining_trans,
parameter = list(support = 0.10, maxlen = 4, maxgap = 2),
control = list(tidList = TRUE, verbose = TRUE))
#Create a data frame that contains all sequences and their support (167 sequences in total).
sequences_df <- cbind(sequence = labels(sequences),
support = sequences@quality)
现在,我创建一个仅包含一个事务的新数据集:
data_score <- data.frame(id = 11, transaction = "A B B C D A")
#Convert data to list containing character vectors
data_score_list <- str_split(string = data_score$transaction, pattern = " ")
#Include identifier as name
names(data_score_list) <- data_score$id
#Convert to object of type transactions
data_score_trans <- as.transactions(clickstreamList = data_score_list)
如何找出对象序列中包含的哪些常见序列存在于"data_score_trans"中?
编辑我尝试了以下代码行:
supportingTransactions(x = sequences, transactions = data_score_trans)
这会产生预期和期望的结果:
tidLists in sparse format with
167 items/itemsets (rows) and
1 transactions (columns)
但是,当新事务包含不在原始数据集中的元素时,会发生错误:
#Added a 'G' at the end of the transaction. Element 'G' is not an element in
#'data_mine_freq_seq'.
data_score <- data.frame(id = 11, transaction = "A B B C D A G")
#Convert data to list containing character vectors
data_score_list <- str_split(string = data_score$transaction, pattern = " ")
#Include identifier as name
names(data_score_list) <- data_score$id
#Convert to object of type transactions
data_score_trans <- as.transactions(clickstreamList = data_score_list)
#Score 'data_score_trans' using 'sequences' again:
supportingTransactions(x = sequences, transactions = data_score_trans)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
如何解决这个问题?
我想出了一个利用正则表达式功能的解决方法。我定义了以下函数:
score_pattern <- function(pattern, events){
regex_elements <- str_extract_all(string = pattern, pattern = "\{.*?\}")
regex_elements <- str_replace_all(string = unlist(regex_elements),
pattern = "\{|\}", replacement = "")
expr <- ""
for(i in 1:length(regex_elements)){
if(i == 1){
expr <- paste0(expr, "(^| )", regex_elements[i], collapse = "")
} else {
expr <- paste0(expr, "( | .*? )", regex_elements[i], collapse = "")
}
}
expr <- paste0(expr, "( |$)", collapse = "")
print(expr)
score_pattern <- ifelse(test = grepl(pattern = expr, x = events) == TRUE,
yes = 1, no = 0)
return(score_pattern)
}
为了说明它的用途。下面是一个示例,其中我使用对象"sequences_df"(从"序列"列中选择一个序列(和"data_score"列"事务"中的事务数据:
score_pattern(pattern = "<{B},{A}>", events = data_score$transaction)
[1] "(^| )B( | .*? )A( |$)"
[1] 1
该函数返回一个包含零和一的数字向量,指示序列是否存在于提供的事务中(1 = 是,0 = 否(。
虽然这是一个解决方案,但它仅适用于对序列中连续元素之间的最大间隙没有限制的情况。 例如,创建的正则表达式没有"最大间隙"参数。结论:这仅在未设置 cspade 算法中的参数"maxgap"时才有效。