创建一个表，其中包含来自R中的一列的所有值对，计数唯一值

我有显示客户购买了哪些商品的数据。他们可以多次购买一个道具。我需要的是一个表，它显示了所有可能的商品成对组合以及购买该组合的唯一客户数量(表的对角线将只是购买每种商品的唯一人数)。

下面是一个例子:

item <- c("h","h","h","j","j")
customer <- c("a","a","b","b","b")
test.data <- data.frame(item,customer)

test.data:

item customer
h    a
h    a
h    b
j    b
j    b

需要的结果-一个表，以项目作为行和列名，以及表内购买该对的唯一客户的计数。因此，2名顾客购买了商品h, 1名顾客同时购买了商品h和商品j, 1名顾客购买了商品j。

item   h    j
h      2    1
j      1    1

我试过使用表函数，melt/cast等，但没有什么能让我得到表中需要的计数。我的第一步是使用unique()来摆脱重复的行。

使用data.table和gtools包，我们可以根据客户重新创建所有可能的排列:

library(data.table)
library(gtools)
item <- c("h","h","h","j","j")
customer <- c("a","a","b","b","b")
test.data <- data.table(item,customer)
DT <- unique(test.data) #The unique is used as multiple purchases do not count twice
tuples <- function(x){
  return(data.frame(permutations(length(x), 2, x, repeats.allowed = T, set = F), stringsAsFactors = F))
}
DO <- DT[, tuples(item), by = customer]

这给:

   customer X1 X2
1:        a  h  h
2:        b  h  h
3:        b  h  j
4:        b  j  h
5:        b  j  j

是客户拥有的所有唯一商品配对的列表。根据你的例子，我们将h x j与j x h区别对待。我们现在可以使用表函数得到每对的频率:

table(DO$X1,DO$X2)
    j h
  j 1 1
  h 1 2

基本R解:

n_intersect <- Vectorize( function(x,y) length(intersect(x,y)) )
cs_by_item <- with(test.data, tapply(customer, item, unique))
outer(cs_by_item , cs_by_item , n_intersect)
#   h j
# h 2 1
# j 1 1

相关内容

最新更新

热门标签：