R:计算矩阵所有行之间成对匹配字符串的频率



我在R中有一个5000 x 1000的字符矩阵,每个条目都是一种颜色(红色、蓝色、黄色、绿色等(。我想以成对的方式计算所有列中矩阵每行之间匹配颜色(字符串(的频率。1000列中的每一列都呈现不同的颜色标签迭代,对每列不同标签的数量没有限制。例如,第一列可能有8个不同的颜色标签,而第二列有10个,第三列有11个,等等。我对标签本身不感兴趣,只有一对行在每列中匹配或不匹配的频率

例如,我的字符矩阵看起来像这样(没有人为的定期重复的颜色模式(:

colors <- sample(c("grey", "green", "blue", "pink", "brown", "purple", "cyan", "red", "yellow"), 8, replace = TRUE)
labels <- matrix(rep(colors), nrow = 10, ncol = 5)
labels
[,1]     [,2]     [,3]     [,4]     [,5]    
[1,] "brown"  "purple" "yellow" "green"  "brown" 
[2,] "grey"   "red"    "brown"  "red"    "grey"  
[3,] "purple" "yellow" "green"  "brown"  "purple"
[4,] "red"    "brown"  "red"    "grey"   "red"   
[5,] "yellow" "green"  "brown"  "purple" "yellow"
[6,] "brown"  "red"    "grey"   "red"    "brown" 
[7,] "green"  "brown"  "purple" "yellow" "green" 
[8,] "red"    "grey"   "red"    "brown"  "red"   
[9,] "brown"  "purple" "yellow" "green"  "brown" 
[10,] "grey"   "red"    "brown"  "red"    "grey"  

我想用它来构造一个5000 x 5000平方的对称矩阵,它对应于行之间成对匹配的频率。每个条目[i,j](以及[j,i](都应该是所有列中第i行和第j行之间匹配的频率。例如,在上面的玩具标签矩阵中,第1行在第1列和第5列中与第6行匹配,但在其他列中不匹配,所以我希望匹配频率(2/5=0.4(是"1"的条目[1,6]和[6,1];频率矩阵";。对角线将全部为1,因为每一行总是与其自身匹配。类似这样的输出:

freq.mat
[,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]    
[1,]  1     0     0     0     0    0.4    0     0     1      0
[2,]  0     1     0     0    0.2   0.4    0     0     0      1     
[3,]  0     0     1     0     0     0     0    0.2    0      0
[4,]  0     0     0     1     0     0    0.2   0.6    0      0
[5,]  0    0.2    0     0     1     0     0     0     0     0.2
[6,] 0.4   0.4    0     0     0     1     0     0    0.4    0.4 
[7,]  0     0     0    0.2    0     0     1     0     0      0 
[8,]  0     0    0.2   0.6    0     0     0     1     0      0   
[9,]  1     0     0     0     0    0.4    0     0     1      0 
[10,]  0     1     0     0    0.2   0.4    0     0     0      1 

我尝试应用rowSums函数如下:

freq.mat <- apply(labels, 1, function(x) rowSums(x == labels))
diag(freq.matrix) <- 1
freq.matrix / 10

它生成了一个大小合适的矩阵,但条目并不是我所希望的成对行匹配频率。我还修改了一些嵌套的for循环,但没有取得多大进展,这也让我感觉非常";违背精神;R编程。

有人能帮我指一下正确的方向吗?非常感谢!

您正在比较错误的值:

apply(labels, 1, function(x) colMeans(x == t(labels)))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]  1.0  0.0  0.0  0.0  0.0  0.4  0.0  0.0  1.0   0.0
[2,]  0.0  1.0  0.0  0.0  0.2  0.4  0.0  0.0  0.0   1.0
[3,]  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.2  0.0   0.0
[4,]  0.0  0.0  0.0  1.0  0.0  0.0  0.2  0.6  0.0   0.0
[5,]  0.0  0.2  0.0  0.0  1.0  0.0  0.0  0.0  0.0   0.2
[6,]  0.4  0.4  0.0  0.0  0.0  1.0  0.0  0.0  0.4   0.4
[7,]  0.0  0.0  0.0  0.2  0.0  0.0  1.0  0.0  0.0   0.0
[8,]  0.0  0.0  0.2  0.6  0.0  0.0  0.0  1.0  0.0   0.0
[9,]  1.0  0.0  0.0  0.0  0.0  0.4  0.0  0.0  1.0   0.0
[10,]  0.0  1.0  0.0  0.0  0.2  0.4  0.0  0.0  0.0   1.0

最新更新