R中的模拟最大似然,MaxLik

  • 本文关键字:MaxLik 模拟 r
  • 更新时间 :
  • 英文 :


我正试图通过R中的MaxLik包模拟最大似然来估计模型。不幸的是,随着数据量的增加,我遇到了严重的性能问题。有人能就以下方面提出建议吗:

有没有办法加速我的代码(它已经向量化了,所以我有点不知道如何进一步改进它(?有没有一种方法可以通过Rcpp来实现优化过程,以加快速度?有没有更聪明的方法可以用定制的似然函数实现模拟的最大似然?

我已经在AWS实例上尝试过doParallel,但这并不能显著加快进程。

我创建了一个可复制的例子,并评论了最重要的部分:

#create data:
#Binary DV (y), 10 IDV (V3 - V12), 50 groups (g), with 100 sequential observations each (id)
set.seed(123)
n <- 5000
p <- 10
x <- matrix(rnorm(n * p), n)
g <- rep(seq(1:(n/100)),each=100)
id <- rep(seq(1:(n/max(g))),max(g))
beta <- runif(p)
xb <- c(x %*% beta)
p <- exp(xb) / (1 + exp(xb))
y <- rbinom(n, 1, p)
data <- as.data.table(cbind(id,y,x,g))
#Find starting values for MaxLik via regular glm
standard <-
glm(
y  ~ 
V3 +
V4 +
V5 +
V6 +
V7 +
V8 +
V9 +
V10 +
V11 +
V12,
data = data,
family = binomial(link = "logit")
)
summary(standard)
#set starting values for MaxLik
b <- c(standard$coefficients,sd_V3=0.5,sd_V4=0.5)
#draw 50 x # of groups random values from a normal distribution
draws <- 50
#for each g in the data, 50 randomvalues are drawn
rands <- as.data.table(cbind(g=rep(g,each=draws),matrix(rnorm(length(g)*draws,0,1),length(g)*draws,2)))
colnames(rands) <- c("g","SD_V3","SD_V4")
#merge random draws to each group, so every observation is repeated x # of draws
data <- merge(data,rands,by="g",all=T,allow.cartesian=T)
#the likelihood function (for variables V3 and V4, a mean [b3] & b[4] and a SD b[12] & b[14] is estimated
loglik1 <- function(b){
#I want the standard deviations to vary only across groups (g), but all other parameters to vary across all observations, which is why I am taking the mean across g and id (remember, every observation is a cartesian product with the random draws per group)
ll <- data[,.(gll=mean(((1/(1+exp(-(b[1]+
(b[2]+b[12]*SD_V3)*V3 + 
(b[3]+b[13]*SD_V4)*V4 + 
(b[4])*V5 + 
(b[5])*V6 + 
(b[6])*V7 + 
(b[7])*V8 + 
(b[8])*V9 + 
(b[9])*V10 + 
(b[10])*V11 + 
(b[11])*V12))))^y)*
(1-(1/(1+exp(-(b[1]+
(b[2])*V3 + 
(b[3])*V4 + 
(b[4])*V5 + 
(b[5])*V6 + 
(b[6])*V7 + 
(b[7])*V8 + 
(b[8])*V9 + 
(b[9])*V10 + 
(b[10])*V11 + 
(b[11])*V12)))))^(1-y))),by=.(g,id)]
return(log(ll[,gll]))
}
co <- maxLik::maxControl(gradtol=1e-04,printLevel=2)
maxlik <- maxLik::maxLik(loglik1,start=b,method="bfgs",control=co)
summary(maxlik)

非常感谢你的建议

我能够通过更改loglik1<-的内部来显著减少优化时间(从小时到分钟(函数(b({…}到

return(data[,.(g,id,y,logit=1/(1+exp(-(b[1]+
(b[2]+b[12]*SD_V3)*V3 + 
(b[3]+b[13]*SD_V4)*V4 + 
(b[4])*V5 + 
(b[5])*V6 + 
(b[6])*V7 + 
(b[7])*V8 + 
(b[8])*V9 + 
(b[9])*V10 + 
(b[10])*V11 + 
(b[11])*V12))))][,mean(y*log(logit)+(1-y)*log(1)-logit),by=.(g,id)][,sum(V1)])

然而,这只能部分解决问题,因为随着数据大小的增加,估计时间再次增加:(

我可能不得不处理这个问题,除非有人有一个优雅的解决方案?

编辑:过一段时间再了解一下,以防将来有人遇到这个问题。。。原因是,脚本花费了很长时间,在于包MaxLik和推导Hessian矩阵的计算时间。如果你不需要,你可以告诉MaxLik不要计算它。由于我确实需要它,我决定通过Rcpp计算它。

最新更新