计算数据帧每一行之间的公共列数,以创建一个allvsall矩阵



以下是我拥有的数据集示例:

行代表不同的材料,列代表不同的基因。

gene1   gene2   gene3
1 PRESENT    LOST PRESENT
2 PRESENT PRESENT    LOST
3    LOST PRESENT PRESENT

我想找到材料之间常见基因的数量,并创建一个数据框架,即材料与材料的对比:

> test_self
1 2 3
1 2 1 1
2 1 2 1
3 1 1 2

我使用R创建了这个数据帧,代码如下

gene1  = c("PRESENT", "PRESENT", "LOST")
gene2 = c("LOST", "PRESENT", "PRESENT")
gene3 = c("PRESENT","LOST","PRESENT")
test_PAV <- data.frame (gene1,gene2,gene3)
test_self <- data.frame(matrix(ncol = nrow(test_PAV), nrow = nrow(test_PAV)))
colnames(test_self)<-row.names(test_PAV)
row.names(test_self)<-row.names(test_PAV)
for (rowInQuestion in 1:nrow(test_PAV)){
for (rowBeingComparedTo in 1:nrow(test_PAV)){
commonGeneCount <- 0
for (col in 1:ncol(test_PAV)){
if(test_PAV[rowInQuestion,col]=='PRESENT'&&test_PAV[rowBeingComparedTo,col]=='PRESENT'){
commonGeneCount <- commonGeneCount + 1
}
test_self[rowInQuestion,rowBeingComparedTo]<-commonGeneCount
} 
}
}

但我很清楚,这是一个非常低效的解决方案。首先,由于上三角形和下三角形是对称的,它们不必计算两次。

我在这里找到了一个类似的解决方案,但共性存在于一行中,但在我的情况下,为了称之为共性,两份材料中的同一基因(列(必须是"PRESENT"。

谢谢。

首先,让我们只保留PRESENT基因,并构建每行现有基因的frozenset

s = (
df.where(df.eq('PRESENT'))
.stack()
.reset_index(level=1)
.groupby(level=0)['level_1'].agg(frozenset)
)
# 1    (gene3, gene1)
# 2    (gene2, gene1)
# 3    (gene3, gene2)
# Name: level_1, dtype: object

现在,我们可以从基因集的intersection的长度中获得itertools.product(或者更快的combinations(来构建预期的数据帧:

from itertools import product
out = (
pd.Series({(i,j): len(a.intersection(b))
for (i,a),(j,b) in  product(zip(s.index, s), repeat=2)})
.unstack()
)
#    1  2  3
# 1  2  1  1
# 2  1  2  1
# 3  1  1  2

组合更快,因为只计算A/B或B/A中的1个,也不计算A/A:

from itertools import combinations
out = (
pd.Series({(i,j): len(a.intersection(b))
for (i,a),(j,b) in combinations(zip(s.index, s), r=2)})
.unstack()
)
#      2    3
# 1  1.0  1.0
# 2  NaN  1.0

矢量化将提高较大数据集的速度:

test_PAV <- as.data.frame(matrix(sample(c("PRESENT", "LOST"), 3e4, replace = TRUE), ncol = 150))
fCommonVec <- function(test_PAV) {
m <- as.matrix(test_PAV) == "PRESENT"
n <- nrow(m) - 1L
test_self <- matrix(nrow = nrow(test_PAV), ncol = nrow(test_PAV))
test_self[lower.tri(test_self)] <- rowSums(m[rep.int(1:n, n:1),] & m[sequence(n:1, 2:nrow(m)),])
diag(test_self) <- rowSums(m)
return(test_self)
}

与循环版本比较:

fCommonLoop <- function(test_PAV) {
pavMatrix_transposeBinary <- as.matrix(test_PAV) == "PRESENT"
pavSelfSimilarity <- matrix(nrow = nrow(pavMatrix_transposeBinary), ncol = nrow(pavMatrix_transposeBinary))

for (rowInQuestion in 1:nrow(pavMatrix_transposeBinary)){
for (rowBeingComparedTo in 1:nrow(pavMatrix_transposeBinary)){
pavSelfSimilarity[rowInQuestion,rowBeingComparedTo] <- sum(pavMatrix_transposeBinary[rowInQuestion,]*pavMatrix_transposeBinary[rowBeingComparedTo,])
}
}

return(pavSelfSimilarity)
}
test_self <- fCommonVec(test_PAV)
pavSelfSimilarity <- fCommonLoop(test_PAV)
all.equal(test_self[lower.tri(test_self, diag = TRUE)], pavSelfSimilarity[lower.tri(pavSelfSimilarity, diag = TRUE)])
#> [1] TRUE
microbenchmark::microbenchmark(fCommonVec = fCommonVec(test_PAV),
fCommonLoop = fCommonLoop(test_PAV))
#> Unit: milliseconds
#>         expr      min       lq      mean   median       uq      max neval
#>   fCommonVec  25.5148  30.6751  36.76842  32.6326  36.0579  76.0979   100
#>  fCommonLoop 215.4542 226.2288 235.84177 229.6783 235.9802 304.1668   100

更新

对于真正大的数据集,这里有一个按列迭代以减少内存使用的函数:

test_PAV <- as.data.frame(matrix(sample(c("PRESENT", "LOST"), 267*113000, replace = TRUE), nrow = 267))
fCommonVec2 <- function(test_PAV) {
m <- as.matrix(test_PAV) == "PRESENT"
idx1 <- (nrow(m) - 1L):1
idx2 <- rep.int(1:(nrow(m) - 1L), idx1)
idx3 <- sequence(idx1, 2:nrow(m))
v <- integer(length(idx2))

for (i in 1:ncol(m)) {
v <- v + (m[idx2, i] & m[idx3, i])
}

test_self <- matrix(nrow = nrow(test_PAV), ncol = nrow(test_PAV))
test_self[lower.tri(test_self)] <- v
diag(test_self) <- rowSums(m)
return(test_self)
}

在我的机器上花了大约40秒,只占我8GB RAM的一小部分。

system.time(test_self2 <- fCommonVec2(test_PAV))
#>    user  system elapsed 
#>   36.52    0.24   36.76

所以如果有人想使用它,我也想出了另一个解决方案;

我没有使用带有"PRESENT"one_answers"LOST"的数据帧,而是使用sed将它们转换为1和0。

sed -i '' 's/PRESENT/1/g' pav-matrix.binary.txt
sed -i '' 's/LOST/0/g' pav-matrix.binary.txt

然后使用这个二进制矩阵来比较每行的每个乘积,得到乘积数组的和。

for (rowInQuestion in 1:nrow(pavMatrix_transposeBinary)){
for (rowBeingComparedTo in 1:nrow(pavMatrix_transposeBinary)){
pavSelfSimilarity[rowInQuestion,rowBeingComparedTo] <- sum(pavMatrix_transposeBinary[rowInQuestion,]*pavMatrix_transposeBinary[rowBeingComparedTo,])
}
}

这个带有awkfor循环应该可以工作。

for i in {2..4}; do 
for j in {2..4};do 
tmp=$(awk '{print $'$i',$'$j'}' input |sed 1d|awk '$1==$2 {print}'|grep -c "PRESENT");echo -en "$tmpt"
done 
echo -en "n" 
done >> test_self

输出:

2       1       1
1       2       1
1       1       2

您可能需要自己添加标题,顺序应该相同。

谢谢!!

最新更新