"smoothing"时间数据 - 可以更高效地完成吗？ - "smoothing" time data

我有一个包含ID、开始日期和结束日期的数据帧。我的数据按ID、开始、结束（按此顺序）排序。

现在，我希望将具有相同ID、时间跨度重叠的所有行（或开始日期正好在另一行结束日期的第二天）合并在一起。

合并它们意味着它们最终在一行中具有相同的ID、最小值（开始日期）和最大值（结束日期）（我希望你理解我的意思）。

我已经为此编写了一个函数（它还没有经过充分测试，但目前看起来还不错）。问题是，由于我的数据帧有近100000个观测值，所以函数非常慢。

你能帮我提高效率吗？

这是的功能

smoothingEpisodes <- function (theData) {
    theOutput <- data.frame()
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]
    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]
        if (curId != nextId | (curEnd + 1) < nextStart) {
            theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    theOutput <- rbind(theOutput, data.frame("ID" = curId, "START" = curStart, "END" = curEnd))
    theOutput
}

谢谢！

[编辑]

测试数据：

    ID      START        END
1    1 2000-01-01 2000-03-31
2    1 2000-04-01 2000-05-31
3    1 2000-04-15 2000-07-31
4    1 2000-09-01 2000-10-31
5    2 2000-01-15 2000-03-31
6    2 2000-02-01 2000-03-15
7    2 2000-04-01 2000-04-15
8    3 2000-06-01 2000-06-15
9    3 2000-07-01 2000-07-15

（START和END的数据类型为"Date"，ID为数字）

数据的dput：

structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), 
    END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 
    11062, 11123, 11153), class = "Date")), .Names = c("ID", 
"START", "END"), class = "data.frame", row.names = c(NA, 9L))

我建议的第一个[无需认真考虑您正在尝试做什么]优化是为theOutput分配存储。目前，在循环的每次迭代中，您都在增长theOutput。在R中，这是一个绝对的无！！这是你永远不会做的事情，除非你喜欢慢得可怜的代码。R必须在每次迭代中复制对象并展开它，这很慢。

查看代码，我们知道theOutput需要有nrow(theData) - 1行和3列。所以在循环开始之前创建：

theOutput <- data.frame(matrix(ncol = 3, nrow = nrow(theData) - 1))

然后在循环过程中填写此对象：

theOutput[i, ] <- data.frame("ID" = curId, "START" = curStart, "END" = curEnd))

例如。

不清楚START和END是什么？如果这些是数字，那么使用矩阵而不是数据帧也可以提高速度效率。

此外，每次迭代创建一个数据帧的速度会很慢。如果不花很多时间，我无法计时，但你可以直接填写你想要的比特，而不会在每次迭代中产生data.frame()调用：

theOutput[i, "ID"] <- curId
theOutput[i, "START"] <- curStart
theOutput[i, "END"] <- curEnd

然而，我能给你的最好的提示是，评测你的代码。查看瓶颈所在并加快速度。在较小的数据子集上运行函数；其大小足以让您有一点运行时间来收集有用的分析数据，而不必等待很长时间才能完成分析运行。要在R中进行配置文件，请使用Rprof():

Rprof(filename = "my_fun_profile.Rprof")
## run your function call here on a subset of the data
Rprof(NULL)

您可以使用查看输出

summaryRprof("my_fun_profile.Rprof")

Hadley Wickham（@Hadley）有一个包可以让这件事变得更容易。它被称为profr。正如德克在评论中提醒我的那样，还有卢克·蒂尔尼的专业工具包。

编辑：由于OP提供了一些测试数据，我很快就找到了一些东西，以显示通过以下良好的循环实践实现的加速：

smoothingEpisodes2 <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]
    nr <- nrow(theData)
    out1 <- integer(length = nr)
    out2 <- out3 <- numeric(length = nr)
    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]
        if (curId != nextId | (curEnd + 1) < nextStart) {
            out1[i-1] <- curId
            out2[i-1] <- curStart
            out3[i-1] <- curEnd
            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    out1[i] <- curId
    out2[i] <- curStart
    out3[i] <- curEnd
    theOutput <- data.frame(ID = out1,
                            START = as.Date(out2, origin = "1970-01-01"),
                            END = as.Date(out3, origin = "1970-01-01"))
    ## drop empty
    theOutput <- theOutput[-which(theOutput$ID == 0), ]
    theOutput
}

使用对象testData中提供的测试数据集，我得到：

> res1 <- smoothingEpisodes(testData)
> system.time(replicate(100, smoothingEpisodes(testData)))
   user  system elapsed 
  1.091   0.000   1.131 
> res2 <- smoothingEpisodes2(testData)
> system.time(replicate(100, smoothingEpisodes2(testData)))
   user  system elapsed 
  0.506   0.004   0.517

50%的速度。不引人注目，但通过在每次迭代中不增长对象来实现这一点很简单。

我做的略有不同，以避免最后删除空行：

smoothingEpisodes <- function (theData) {
    curId <- theData[1, "ID"]
    curStart <- theData[1, "START"]
    curEnd <- theData[1, "END"]
    theLength <- nrow(theData)
    out.1 <- integer(length = theLength)
    out.2 <- out.3 <- numeric(length = theLength)
    j <- 1
    for(i in 2:nrow(theData)) {
        nextId <- theData[i, "ID"]
        nextStart <- theData[i, "START"]
        nextEnd <- theData[i, "END"]
        if (curId != nextId | (curEnd + 1) < nextStart) {
            out.1[j] <- curId
            out.2[j] <- curStart
            out.3[j] <- curEnd
            j <- j + 1
            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    out.1[j] <- curId
    out.2[j] <- curStart
    out.3[j] <- curEnd
    theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))
    theOutput
}

我的原始版本有了很大的改进！

Marcel，我想我应该试着改进一下你的代码。下面的版本大约快30倍（从3秒到0.1秒）。。。诀窍是首先将三列提取为整数和双矢量。

顺便说一句，我尝试在适用的情况下使用[[，并尝试通过写入j <- j + 1L等将整数保持为整数。这在这里没有任何区别，但有时在整数和双打之间强制可能需要相当长的时间。

smoothingEpisodes3 <- function (theData) {
    theLength <- nrow(theData)
    if (theLength < 2L) return(theData)
    id <- as.integer(theData[["ID"]])
    start <- as.numeric(theData[["START"]])
    end <- as.numeric(theData[["END"]])
    curId <- id[[1L]]
    curStart <- start[[1L]]
    curEnd <- end[[1L]]
    out.1 <- integer(length = theLength)
    out.2 <- out.3 <- numeric(length = theLength)
    j <- 1L
    for(i in 2:nrow(theData)) {
        nextId <- id[[i]]
        nextStart <- start[[i]]
        nextEnd <- end[[i]]
        if (curId != nextId | (curEnd + 1) < nextStart) {
            out.1[[j]] <- curId
            out.2[[j]] <- curStart
            out.3[[j]] <- curEnd
            j <- j + 1L
            curId <- nextId
            curStart <- nextStart
            curEnd <- nextEnd
        } else {
            curEnd <- max(curEnd, nextEnd, na.rm = TRUE)
        }
    }
    out.1[[j]] <- curId
    out.2[[j]] <- curStart
    out.3[[j]] <- curEnd
    theOutput <- data.frame(ID = out.1[1:j], START = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))
    theOutput
}

然后，以下代码将显示速度差。我刚刚把你的数据复制了1000次。。。

x <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), START = structure(c(10957, 
11048, 11062, 11201, 10971, 10988, 11048, 11109, 11139), class = "Date"), 
    END = structure(c(11047, 11108, 11169, 11261, 11047, 11031, 
    11062, 11123, 11153), class = "Date")), .Names = c("ID", 
"START", "END"), class = "data.frame", row.names = c(NA, 9L))
r <- 1000
y <- data.frame(ID=rep(x$ID, r) + rep(1:r, each=nrow(x))-1, START=rep(x$START, r), END=rep(x$END, r))
system.time( a1 <- smoothingEpisodes(y) )   # 2.95 seconds
system.time( a2 <- smoothingEpisodes3(y) )  # 0.10 seconds
all.equal( a1, a2 )

"smoothing"时间数据 - 可以更高效地完成吗？

相关内容

最新更新

热门标签：