以下是我拥有的数据集示例:
行代表不同的材料,列代表不同的基因。
gene1 gene2 gene3
1 PRESENT LOST PRESENT
2 PRESENT PRESENT LOST
3 LOST PRESENT PRESENT
我想找到材料之间常见基因的数量,并创建一个数据框架,即材料与材料的对比:
> test_self
1 2 3
1 2 1 1
2 1 2 1
3 1 1 2
我使用R
创建了这个数据帧,代码如下
gene1 = c("PRESENT", "PRESENT", "LOST")
gene2 = c("LOST", "PRESENT", "PRESENT")
gene3 = c("PRESENT","LOST","PRESENT")
test_PAV <- data.frame (gene1,gene2,gene3)
test_self <- data.frame(matrix(ncol = nrow(test_PAV), nrow = nrow(test_PAV)))
colnames(test_self)<-row.names(test_PAV)
row.names(test_self)<-row.names(test_PAV)
for (rowInQuestion in 1:nrow(test_PAV)){
for (rowBeingComparedTo in 1:nrow(test_PAV)){
commonGeneCount <- 0
for (col in 1:ncol(test_PAV)){
if(test_PAV[rowInQuestion,col]=='PRESENT'&&test_PAV[rowBeingComparedTo,col]=='PRESENT'){
commonGeneCount <- commonGeneCount + 1
}
test_self[rowInQuestion,rowBeingComparedTo]<-commonGeneCount
}
}
}
但我很清楚,这是一个非常低效的解决方案。首先,由于上三角形和下三角形是对称的,它们不必计算两次。
我在这里找到了一个类似的解决方案,但共性存在于一行中,但在我的情况下,为了称之为共性,两份材料中的同一基因(列(必须是"PRESENT"。
谢谢。
首先,让我们只保留PRESENT基因,并构建每行现有基因的frozenset
:
s = (
df.where(df.eq('PRESENT'))
.stack()
.reset_index(level=1)
.groupby(level=0)['level_1'].agg(frozenset)
)
# 1 (gene3, gene1)
# 2 (gene2, gene1)
# 3 (gene3, gene2)
# Name: level_1, dtype: object
现在,我们可以从基因集的intersection
的长度中获得itertools.product
(或者更快的combinations
(来构建预期的数据帧:
from itertools import product
out = (
pd.Series({(i,j): len(a.intersection(b))
for (i,a),(j,b) in product(zip(s.index, s), repeat=2)})
.unstack()
)
# 1 2 3
# 1 2 1 1
# 2 1 2 1
# 3 1 1 2
组合更快,因为只计算A/B或B/A中的1个,也不计算A/A:
from itertools import combinations
out = (
pd.Series({(i,j): len(a.intersection(b))
for (i,a),(j,b) in combinations(zip(s.index, s), r=2)})
.unstack()
)
# 2 3
# 1 1.0 1.0
# 2 NaN 1.0
矢量化将提高较大数据集的速度:
test_PAV <- as.data.frame(matrix(sample(c("PRESENT", "LOST"), 3e4, replace = TRUE), ncol = 150))
fCommonVec <- function(test_PAV) {
m <- as.matrix(test_PAV) == "PRESENT"
n <- nrow(m) - 1L
test_self <- matrix(nrow = nrow(test_PAV), ncol = nrow(test_PAV))
test_self[lower.tri(test_self)] <- rowSums(m[rep.int(1:n, n:1),] & m[sequence(n:1, 2:nrow(m)),])
diag(test_self) <- rowSums(m)
return(test_self)
}
与循环版本比较:
fCommonLoop <- function(test_PAV) {
pavMatrix_transposeBinary <- as.matrix(test_PAV) == "PRESENT"
pavSelfSimilarity <- matrix(nrow = nrow(pavMatrix_transposeBinary), ncol = nrow(pavMatrix_transposeBinary))
for (rowInQuestion in 1:nrow(pavMatrix_transposeBinary)){
for (rowBeingComparedTo in 1:nrow(pavMatrix_transposeBinary)){
pavSelfSimilarity[rowInQuestion,rowBeingComparedTo] <- sum(pavMatrix_transposeBinary[rowInQuestion,]*pavMatrix_transposeBinary[rowBeingComparedTo,])
}
}
return(pavSelfSimilarity)
}
test_self <- fCommonVec(test_PAV)
pavSelfSimilarity <- fCommonLoop(test_PAV)
all.equal(test_self[lower.tri(test_self, diag = TRUE)], pavSelfSimilarity[lower.tri(pavSelfSimilarity, diag = TRUE)])
#> [1] TRUE
microbenchmark::microbenchmark(fCommonVec = fCommonVec(test_PAV),
fCommonLoop = fCommonLoop(test_PAV))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> fCommonVec 25.5148 30.6751 36.76842 32.6326 36.0579 76.0979 100
#> fCommonLoop 215.4542 226.2288 235.84177 229.6783 235.9802 304.1668 100
更新
对于真正大的数据集,这里有一个按列迭代以减少内存使用的函数:
test_PAV <- as.data.frame(matrix(sample(c("PRESENT", "LOST"), 267*113000, replace = TRUE), nrow = 267))
fCommonVec2 <- function(test_PAV) {
m <- as.matrix(test_PAV) == "PRESENT"
idx1 <- (nrow(m) - 1L):1
idx2 <- rep.int(1:(nrow(m) - 1L), idx1)
idx3 <- sequence(idx1, 2:nrow(m))
v <- integer(length(idx2))
for (i in 1:ncol(m)) {
v <- v + (m[idx2, i] & m[idx3, i])
}
test_self <- matrix(nrow = nrow(test_PAV), ncol = nrow(test_PAV))
test_self[lower.tri(test_self)] <- v
diag(test_self) <- rowSums(m)
return(test_self)
}
在我的机器上花了大约40秒,只占我8GB RAM的一小部分。
system.time(test_self2 <- fCommonVec2(test_PAV))
#> user system elapsed
#> 36.52 0.24 36.76
所以如果有人想使用它,我也想出了另一个解决方案;
我没有使用带有"PRESENT"one_answers"LOST"的数据帧,而是使用sed
将它们转换为1和0。
sed -i '' 's/PRESENT/1/g' pav-matrix.binary.txt
sed -i '' 's/LOST/0/g' pav-matrix.binary.txt
然后使用这个二进制矩阵来比较每行的每个乘积,得到乘积数组的和。
for (rowInQuestion in 1:nrow(pavMatrix_transposeBinary)){
for (rowBeingComparedTo in 1:nrow(pavMatrix_transposeBinary)){
pavSelfSimilarity[rowInQuestion,rowBeingComparedTo] <- sum(pavMatrix_transposeBinary[rowInQuestion,]*pavMatrix_transposeBinary[rowBeingComparedTo,])
}
}
这个带有awk
的for
循环应该可以工作。
for i in {2..4}; do
for j in {2..4};do
tmp=$(awk '{print $'$i',$'$j'}' input |sed 1d|awk '$1==$2 {print}'|grep -c "PRESENT");echo -en "$tmpt"
done
echo -en "n"
done >> test_self
输出:
2 1 1
1 2 1
1 1 2
您可能需要自己添加标题,顺序应该相同。
谢谢!!