r-在HMM中实现Viterbi算法,改变基因组学标记的发射矩阵



我想寻求帮助,以实现基于SNP基因型数据分配祖先的隐马尔可夫方法。假设我有一个转换矩阵生成如下:

states <- c("A1","A2","A3","A4","A5","A6","A7","A8") # Define the names of the states
A1 <- c(0.9,0.1,0.1,0.1,0.1,0.1,0.1,0.1) # Set the probabilities of switching states, where the previous state was "A1"
A2 <- c(0.1,0.9,0.1,0.1,0.1,0.1,0.1,0.1) # Set the probabilities of switching states, where the previous state was "A2"
A3 <- c(0.1,0.1,0.9,0.1,0.1,0.1,0.1,0.1) # Set the probabilities of switching states,  where the previous state was "A3"
A4 <- c(0.1,0.1,0.1,0.9,0.1,0.1,0.1,0.1) # Set the probabilities of switching states, where the previous state was "A4"
A5 <- c(0.1,0.1,0.1,0.1,0.9,0.1,0.1,0.1) # Set the probabilities of switching states, where the previous state was "A5"
A6 <- c(0.1,0.1,0.1,0.1,0.1,0.9,0.1,0.1) # Set the probabilities of switching states, where the previous state was "A6"
A7 <- c(0.1,0.1,0.1,0.1,0.1,0.1,0.9,0.1) # Set the probabilities of switching states, where the previous state was "A7"
A8 <- c(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.9) # Set the probabilities of switching states, where the previous state was "A8"
thetransitionmatrix <- matrix(c(A1,A2,A3,A4,A5,A6,A7,A8), 8, 8, byrow = TRUE) # Create an 8 x 8 matrix
rownames(thetransitionmatrix) <- states
colnames(thetransitionmatrix) <- states
thetransitionmatrix # Print out the transition matrix
    A1  A2  A3  A4  A5  A6  A7  A8
A1 0.9 0.1 0.1 0.1 0.1 0.1 0.1 0.1
A2 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.1
A3 0.1 0.1 0.9 0.1 0.1 0.1 0.1 0.1
A4 0.1 0.1 0.1 0.9 0.1 0.1 0.1 0.1
A5 0.1 0.1 0.1 0.1 0.9 0.1 0.1 0.1
A6 0.1 0.1 0.1 0.1 0.1 0.9 0.1 0.1
A7 0.1 0.1 0.1 0.1 0.1 0.1 0.9 0.1
A8 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.9

发射矩阵是n个8x4矩阵的列表,其中n等于数据中SNPs/行的数量。例如,给定3个SNPs/行的8个样本(A1-A8)的以下数据:

A1 A2 A3 A4 A5 A6 A7 A8
T T T T T T T C 
T C T T T T T C
A A A G G A A A

列表中的矩阵1将是

   A C G T
A1 0 0 0 1/7
A2 0 0 0 1/7 
A3 0 0 0 1/7
A4 0 0 0 1/7
A5 0 0 0 1/7
A6 0 0 0 1/7
A7 0 0 0 1/7
A8 0 1 0 0

由于第1行中有7个样本具有T,因此每个样本具有1/7的概率。由于只有A8拥有C,因此将C分配给A8的概率为100%。对于第3行,输出应为

   A C G T
A1 1/6 0 0 0
A2 1/6 0 0 0 
A3 1/6 0 0 0
A4 1/2 0 0 0
A5 1/2 0 0 0
A6 1/6 0 0 0
A7 1/6 0 0 0
A8 1/6 0 0 0

使用上述转移矩阵和发射矩阵列表,我希望在任何等位基因序列上实现Viterbi算法。我目前拥有的代码无法为每行使用不同的发射矩阵

viterbi <- function(sequence, transitionmatrix, emissionmatrix)
  # This carries out the Viterbi algorithm.
  # Adapted from "Applied Statistics for Bioinformatics using R" by Wim P. Krijnen,     page 209
  # ( cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf )
  {
 # Get the names of the states in the HMM:
 states <- rownames(theemissionmatrix)
 # Make the Viterbi matrix v:
 v <- makeViterbimat(sequence, transitionmatrix, emissionmatrix)
 # Go through each of the rows of the matrix v (where each row represents
 # a position in the DNA sequence), and find out which column has the
 # maximum value for that row (where each column represents one state of
 # the HMM):
 mostprobablestatepath <- apply(v, 1, function(x) which.max(x))
 # Print out the most probable state path:
 prevnucleotide <- sequence[1]
 prevmostprobablestate <- mostprobablestatepath[1]
 prevmostprobablestatename <- states[prevmostprobablestate]
 startpos <- 1
 for (i in 2:length(sequence))
 {
    nucleotide <- sequence[i]
    mostprobablestate <- mostprobablestatepath[i]
    mostprobablestatename <- states[mostprobablestate]
    if (mostprobablestatename != prevmostprobablestatename)
    {
       print(paste("Positions",startpos,"-",(i-1), "Most probable state = ", prevmostprobablestatename))
       startpos <- i
    }
    prevnucleotide <- nucleotide
    prevmostprobablestatename <- mostprobablestatename
 }
 print(paste("Positions",startpos,"-",i, "Most probable state = ", prevmostprobablestatename))
   }

# the viterbi() function requires a second function makeViterbimat():
makeViterbimat <- function(sequence, transitionmatrix, emissionmatrix)
  # This makes the matrix v using the Viterbi algorithm.
  # Adapted from "Applied Statistics for Bioinformatics using R" by Wim P. Krijnen, page 209
  # ( cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf )
  {
 # Change the sequence to uppercase
 sequence <- toupper(sequence)
 # Find out how many states are in the HMM
 numstates <- dim(transitionmatrix)[1]
 # Make a matrix with as many rows as positions in the sequence, and as many
 # columns as states in the HMM
 v <- matrix(NA, nrow = length(sequence), ncol = dim(transitionmatrix)[1])
 # Set the values in the first row of matrix v (representing the first position of the sequence) to 0
 v[1, ] <- 0
 # Set the value in the first row of matrix v, first column to 1
 v[1,1] <- 1
 # Fill in the matrix v:
 for (i in 2:length(sequence)) # For each position in the DNA sequence:
 {
    for (l in 1:numstates) # For each of the states of in the HMM:
    {
       # Find the probabilility, if we are in state l, of choosing the nucleotide at position in the sequence
       statelprobnucleotidei <- emissionmatrix[l,sequence[i]]
       # v[(i-1),] gives the values of v for the (i-1)th row of v, ie. the (i-1)th position in the sequence.
       # In v[(i-1),] there are values of v at the (i-1)th row of the sequence for each possible state k.
       # v[(i-1),k] gives the value of v at the (i-1)th row of the sequence for a particular state k.
       # transitionmatrix[l,] gives the values in the lth row of the transition matrix, xx should not be transitionmatrix[,l]?
       # probabilities of changing from a previous state k to a current state l.
       # max(v[(i-1),] * transitionmatrix[l,]) is the maximum probability for the nucleotide observed
       # at the previous position in the sequence in state k, followed by a transition from previous
       # state k to current state l at the current nucleotide position.
       # Set the value in matrix v for row i (nucleotide position i), column l (state l) to be:
       v[i,l] <-  statelprobnucleotidei * max(v[(i-1),] * transitionmatrix[,l])
    }
}
return(v)
  }

是什么阻止您简单地给函数一个预先计算的发射矩阵列表,而不是一个矩阵?

makeViterbimat <- function(sequence, transitionmatrix, emissionmatrixList)
  # This makes the matrix v using the Viterbi algorithm.
  # Adapted from "Applied Statistics for Bioinformatics using R" by Wim P. Krijnen, page 209
  # ( cran.r-project.org/doc/contrib/Krijnen-IntroBioInfStatistics.pdf )
  {
 # Change the sequence to uppercase
 sequence <- toupper(sequence)
 # Find out how many states are in the HMM
 numstates <- dim(transitionmatrix)[1]
 # Make a matrix with as many rows as positions in the sequence, and as many
 # columns as states in the HMM
 v <- matrix(NA, nrow = length(sequence), ncol = dim(transitionmatrix)[1])
 # Set the values in the first row of matrix v (representing the first position of the sequence) to 0
 v[1, ] <- 0
 # Set the value in the first row of matrix v, first column to 1
 v[1,1] <- 1
 # Fill in the matrix v:
 for (i in 2:length(sequence)) # For each position in the DNA sequence:
 {
    emissionmatrix = emissionmatrixList[[i]]
    for (l in 1:numstates) # For each of the states of in the HMM:
    {
       # Find the probabilility, if we are in state l, of choosing the nucleotide at position in the sequence
       statelprobnucleotidei <- emissionmatrix[l,sequence[i]]
       # v[(i-1),] gives the values of v for the (i-1)th row of v, ie. the (i-1)th position in the sequence.
       # In v[(i-1),] there are values of v at the (i-1)th row of the sequence for each possible state k.
       # v[(i-1),k] gives the value of v at the (i-1)th row of the sequence for a particular state k.
       # transitionmatrix[l,] gives the values in the lth row of the transition matrix, xx should not be transitionmatrix[,l]?
       # probabilities of changing from a previous state k to a current state l.
       # max(v[(i-1),] * transitionmatrix[l,]) is the maximum probability for the nucleotide observed
       # at the previous position in the sequence in state k, followed by a transition from previous
       # state k to current state l at the current nucleotide position.
       # Set the value in matrix v for row i (nucleotide position i), column l (state l) to be:
       v[i,l] <-  statelprobnucleotidei * max(v[(i-1),] * transitionmatrix[,l])
    }
}
return(v)
  }

或者你的问题是如何构建这个发射矩阵列表?

最新更新