用于在配对线比较中计数多态性SNP的并行处理



我正在尝试计算行对之间的多态snps的数量,并且遇到了回答该问题所需的计算资源的问题。 从概念上讲,我知道这个问题可以(并且应该)使用并行处理来回答,但我正在努力弄清楚如何为并行处理编程问题。 我还没有找到像这样的并行处理问题。 提前感谢您的建议。

基本上,我试图比较行对之间的SNP:第1行到第2行,3...7;然后第2行到第3,4...7行。所以 n(n-1)/2 比较。对于每个SNP,如果被比较的两条线与AA,AB或BB匹配,则线对于该SNP不是多态的。 如果SNP中的任何一条线都有"NC",则SNP将从计算中剔除。因此,比较第 1 行和第 2 行:有 1 个匹配的 SNP、2 个"NC SNP"和 2 个多态 SNP(2 = 5-(1+2))。

我尝试使用 foreach 使 for 循环更快,但我一定做错了什么,因为结果需要更多时间才能完成。

我还尝试将代码编写为函数,然后调用略微提高速度的函数。

这是一个由 7 行和 5 个 SNP 组成的玩具数据集,但实际上,数据集是 1000 个 SNP 和数百行。

Line    SNP1    SNP2    SNP3    SNP4    SNP5
Line1   AA  BB  AA  NC  BB
Line2   AA  AA  NC  NC  AA
Line3   BB  AB  NC  BB  AA
Line4   NC  BB  AB  NC  BB
Line5   AA  AA  BB  AB  AA
Line6   NC  NC  AA  AA  NC
Line7   BB  AA  AA  NC  BB

到目前为止,在同事的帮助下编写代码

#load in the snps
snps <-read.csv("data.csv", header=T, stringsAsFactors = F)
#create all combinations first
#this is a built-in function that will spit out every combination. Just give it the line names twice.
#remove combinations with matching lines
test <- expand.grid(lineA = snps$Line, lineB = snps$Line) 
test <- test[which(test$lineA!=test$lineB),] 
test <- test[order(test$lineA),]
test <- test[!duplicated(t(apply(test, 1, sort))),]
#create empty columns to be populated
test["NC"]          <- NA
test["match"]       <- NA
test["polymorphic"] <- NA
#get the total number of snps so we can count polymorphic loci for each combo
snp_total_count <- ncol(snps)-1
for (i in 1:nrow(test))   
{
#get the lines you are going to compare
lineA <- which(snps$Line==test$lineA[i])
lineB <- which(snps$Line==test$lineB[i])
#find the matches not counting NC's 
test$match[i] <- length(which(snps[lineA,]!="NC" & snps[lineA,]==snps[lineB,]))
#find the NCs/- cases so paired NC's or single NC's. can't tell polymorphic state or not. count all together 
#1st count positions in which both lineA and lineB are NC, 
#then count positions in which only lineA is "NC" (lineA = NC and does not equal LineB) and 
#then count positions in which only lineB is "NC"(lineB = NC and does not equal LineA) 
#then add all 3 values together
test$NC[i] <- length(which(snps[lineA,]=="NC" & snps[lineA,]==snps[lineB,])) + length(which(snps[lineA,]=="NC" & snps[lineA,]!=snps[lineB,])) + length(which(snps[lineB,]=="NC" & snps[lineA,]!=snps[lineB,]))
#calculate # polymorphic SNPs = total - matching - NC snps 
test$polymorphic[i] <- snp_total_count - (test$NC[i]+ test$match[i])
}

要获得匹配的SNP,请使用:data[LineX, ] == d[LineY, ],要获得NCSNPs,请使用:data[LineX, ] == "NC" | data[LineY, ] == "NC"。要并行运行它,您可以使用future它为每个并行化提供支持。

library(doFuture)
registerDoFuture()
plan(multiprocess)
N <- nrow(d)
d$Line <- NULL
result <- foreach(i = 1:(N - 1), .combine = rbind) %do% {
foreach(j = (i + 1):N, .combine = rbind) %dopar% {
data.frame(
NC = sum(d[i, ] == "NC" | d[j, ] == "NC"),
MATCH = sum(d[i, ] == d[j, ] & d[i, ] != "NC"),
I = i, J = j)
}
}

数据(d):

structure(list(Line = c("Line1", "Line2", "Line3", "Line4", "Line5", 
"Line6", "Line7"), SNP1 = c("AA", "AA", "BB", "NC", "AA", "NC", 
"BB"), SNP2 = c("BB", "AA", "AB", "BB", "AA", "NC", "AA"), SNP3 = c("AA", 
"NC", "NC", "AB", "BB", "AA", "AA"), SNP4 = c("NC", "NC", "BB", 
"NC", "AB", "AA", "NC"), SNP5 = c("BB", "AA", "AA", "BB", "AA", 
"NC", "BB")), .Names = c("Line", "SNP1", "SNP2", "SNP3", "SNP4", 
"SNP5"), row.names = c(NA, -7L), class = "data.frame")

结果(result):

NC MATCH I J
1   2     1 1 2
2   2     0 1 3
3   2     2 1 4
4   1     1 1 5
5   4     1 1 6
6   1     2 1 7
7   2     1 2 3
8   3     0 2 4
9   2     3 2 5
10  5     0 2 6
...

如果可能的话,使用data.table和多核可能会显着提高速度foreach+doMC.下面是一个简单的示例,您需要添加如何处理 NC 值的特定条件。将registerDoMC以内的核心设置为可用的内核数。

library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)
dt <- data.table(Line=paste("Line", 1:100, sep=""), 
SNP1=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE),
SNP2=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE),
SNP3=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE),
SNP4=sample(c("AA", "AB", "AC", "BB", "BC", "CC"), size=100, replace=TRUE)
)

head(dt)

Line SNP1 SNP2 SNP3 SNP4
1: Line1   AC   BC   AB   AB
2: Line2   BC   BB   AA   AC
3: Line3   AB   BB   AA   AC
4: Line4   BC   BC   AC   BC
5: Line5   AB   AA   BB   AA
6: Line6   AB   AB   CC   AC

而展望...

snpCols <- colnames(dt)[2:length(colnames(dt))]
results <- foreach(index.1 = 1:dim(dt)[1], .combine="rbind") %dopar% {
row1 <- dt[index.1]
foreach(index.2 = index.1:dim(dt)[1], .combine="rbind") %do% {
row2 <- dt[index.2]
# do operations / return final data.table object that has values containing column values you want
return(data.table("lineX"=row1$Line, 
"lineY"=row2$Line,
"nMatches"=sum(row1[,snpCols, with=FALSE] == row2[,snpCols, with=FALSE])
)
)
}
}

这会产生对象results

lineX   lineY nMatches
1:   Line1   Line1        4
2:   Line1   Line2        0
3:   Line1   Line3        0
4:   Line1   Line4        1
5:   Line1   Line5        0
---
5046:  Line98  Line99        0
5047:  Line98 Line100        0
5048:  Line99  Line99        4
5049:  Line99 Line100        0
5050: Line100 Line100        4

请注意,这也是将每行与自身进行比较;从现在开始,您可以根据需要保留或删除它们。

最新更新