我正在尝试计算行对之间的多态snps的数量,并且遇到了回答该问题所需的计算资源的问题。 从概念上讲,我知道这个问题可以(并且应该)使用并行处理来回答,但我正在努力弄清楚如何为并行处理编程问题。 我还没有找到像这样的并行处理问题。 提前感谢您的建议。
基本上,我试图比较行对之间的SNP:第1行到第2行,3...7;然后第2行到第3,4...7行。所以 n(n-1)/2 比较。对于每个SNP,如果被比较的两条线与AA,AB或BB匹配,则线对于该SNP不是多态的。 如果SNP中的任何一条线都有"NC",则SNP将从计算中剔除。因此,比较第 1 行和第 2 行:有 1 个匹配的 SNP、2 个"NC SNP"和 2 个多态 SNP(2 = 5-(1+2))。
我尝试使用 foreach 使 for 循环更快,但我一定做错了什么,因为结果需要更多时间才能完成。
我还尝试将代码编写为函数,然后调用略微提高速度的函数。
这是一个由 7 行和 5 个 SNP 组成的玩具数据集,但实际上,数据集是 1000 个 SNP 和数百行。
Line SNP1 SNP2 SNP3 SNP4 SNP5
Line1 AA BB AA NC BB
Line2 AA AA NC NC AA
Line3 BB AB NC BB AA
Line4 NC BB AB NC BB
Line5 AA AA BB AB AA
Line6 NC NC AA AA NC
Line7 BB AA AA NC BB
到目前为止,在同事的帮助下编写代码
#load in the snps
snps <-read.csv("data.csv", header=T, stringsAsFactors = F)
#create all combinations first
#this is a built-in function that will spit out every combination. Just give it the line names twice.
#remove combinations with matching lines
test <- expand.grid(lineA = snps$Line, lineB = snps$Line)
test <- test[which(test$lineA!=test$lineB),]
test <- test[order(test$lineA),]
test <- test[!duplicated(t(apply(test, 1, sort))),]
#create empty columns to be populated
test["NC"] <- NA
test["match"] <- NA
test["polymorphic"] <- NA
#get the total number of snps so we can count polymorphic loci for each combo
snp_total_count <- ncol(snps)-1
for (i in 1:nrow(test))
{
#get the lines you are going to compare
lineA <- which(snps$Line==test$lineA[i])
lineB <- which(snps$Line==test$lineB[i])
#find the matches not counting NC's
test$match[i] <- length(which(snps[lineA,]!="NC" & snps[lineA,]==snps[lineB,]))
#find the NCs/- cases so paired NC's or single NC's. can't tell polymorphic state or not. count all together
#1st count positions in which both lineA and lineB are NC,
#then count positions in which only lineA is "NC" (lineA = NC and does not equal LineB) and
#then count positions in which only lineB is "NC"(lineB = NC and does not equal LineA)
#then add all 3 values together
test$NC[i] <- length(which(snps[lineA,]=="NC" & snps[lineA,]==snps[lineB,])) + length(which(snps[lineA,]=="NC" & snps[lineA,]!=snps[lineB,])) + length(which(snps[lineB,]=="NC" & snps[lineA,]!=snps[lineB,]))
#calculate # polymorphic SNPs = total - matching - NC snps
test$polymorphic[i] <- snp_total_count - (test$NC[i]+ test$match[i])
}
要获得匹配的SNP,请使用:data[LineX, ] == d[LineY, ]
,要获得NC
SNPs,请使用:data[LineX, ] == "NC" | data[LineY, ] == "NC"
。要并行运行它,您可以使用future
它为每个并行化提供支持。
library(doFuture)
registerDoFuture()
plan(multiprocess)
N <- nrow(d)
d$Line <- NULL
result <- foreach(i = 1:(N - 1), .combine = rbind) %do% {
foreach(j = (i + 1):N, .combine = rbind) %dopar% {
data.frame(
NC = sum(d[i, ] == "NC" | d[j, ] == "NC"),
MATCH = sum(d[i, ] == d[j, ] & d[i, ] != "NC"),
I = i, J = j)
}
}
数据(d
):
structure(list(Line = c("Line1", "Line2", "Line3", "Line4", "Line5",
"Line6", "Line7"), SNP1 = c("AA", "AA", "BB", "NC", "AA", "NC",
"BB"), SNP2 = c("BB", "AA", "AB", "BB", "AA", "NC", "AA"), SNP3 = c("AA",
"NC", "NC", "AB", "BB", "AA", "AA"), SNP4 = c("NC", "NC", "BB",
"NC", "AB", "AA", "NC"), SNP5 = c("BB", "AA", "AA", "BB", "AA",
"NC", "BB")), .Names = c("Line", "SNP1", "SNP2", "SNP3", "SNP4",
"SNP5"), row.names = c(NA, -7L), class = "data.frame")
结果(result
):
NC MATCH I J
1 2 1 1 2
2 2 0 1 3
3 2 2 1 4
4 1 1 1 5
5 4 1 1 6
6 1 2 1 7
7 2 1 2 3
8 3 0 2 4
9 2 3 2 5
10 5 0 2 6
...
如果可能的话,使用data.table
和多核可能会显着提高速度foreach
+doMC
.下面是一个简单的示例,您需要添加如何处理 NC 值的特定条件。将registerDoMC
以内的核心设置为可用的内核数。
library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)
dt <- data.table(Line=paste("Line", 1:100, sep=""),
SNP1=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE),
SNP2=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE),
SNP3=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE),
SNP4=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE)
)
看head(dt)
:
Line SNP1 SNP2 SNP3 SNP4
1: Line1 AC BC AB AB
2: Line2 BC BB AA AC
3: Line3 AB BB AA AC
4: Line4 BC BC AC BC
5: Line5 AB AA BB AA
6: Line6 AB AB CC AC
而展望...
snpCols <- colnames(dt)[2:length(colnames(dt))]
results <- foreach(index.1 = 1:dim(dt)[1], .combine="rbind") %dopar% {
row1 <- dt[index.1]
foreach(index.2 = index.1:dim(dt)[1], .combine="rbind") %do% {
row2 <- dt[index.2]
# do operations / return final data.table object that has values containing column values you want
return(data.table("lineX"=row1$Line,
"lineY"=row2$Line,
"nMatches"=sum(row1[,snpCols, with=FALSE] == row2[,snpCols, with=FALSE])
)
)
}
}
这会产生对象results
lineX lineY nMatches
1: Line1 Line1 4
2: Line1 Line2 0
3: Line1 Line3 0
4: Line1 Line4 1
5: Line1 Line5 0
---
5046: Line98 Line99 0
5047: Line98 Line100 0
5048: Line99 Line99 4
5049: Line99 Line100 0
5050: Line100 Line100 4
请注意,这也是将每行与自身进行比较;从现在开始,您可以根据需要保留或删除它们。