R语言 将不规则日期时间(缺少日期时间)与常规日期时间计划匹配



假设datetimes1是不规则间隔的日期时间,datetimes2是定期间隔的日期时间。datetimes1缺少一些日期时间,例如 5:10,如第一个表所示。

我想要的是尝试将datetimes1datetimes2匹配,以便每个datetimes1都接近datetimes2,并且所有datetimes1都在看似正确的行中。

起初,我尝试将datetimes1四舍五入到最接近的 5 分钟,并尝试将它们与datetimes2匹配,但有些日期时间相差 3 分钟,因此它们被四舍五入到不正确的值。

接下来我尝试的是找到哪些datetimes1完全等于datetimes2并将这些datetimes1首先匹配到datetimes2,方法是将tolerance设置为 0,然后开始在每个循环中将tolerance增加 1,并将尚未匹配的datetimes1datetimes2由指定tolerance关闭的匹配。

此方法的问题在于,5:33 和 5:37 两次都比 5:35相差 2 分钟,因此 5:33 首先与 5:35 匹配,然后 5:37 不会包含在表中。 请参阅第二个表,了解我使用给定代码获得的结果。

你知道我该如何解决这个问题吗?

我想看到的:

datetimes1          datetimes2
1 2014-07-24 05:05:00 2014-07-24 05:05:00
2                <NA> 2014-07-24 05:10:00
3 2014-07-24 05:15:00 2014-07-24 05:15:00
4 2014-07-24 05:23:00 2014-07-24 05:20:00
5 2014-07-24 05:27:00 2014-07-24 05:25:00
6 2014-07-24 05:33:00 2014-07-24 05:30:00
7 2014-07-24 05:37:00 2014-07-24 05:35:00
8 2014-07-24 05:41:00 2014-07-24 05:40:00
9 2014-07-24 05:45:00 2014-07-24 05:45:00

但我得到了这个:

datetimes1          datetimes2
1 2014-07-24 05:05:00 2014-07-24 05:05:00
2                <NA> 2014-07-24 05:10:00
3 2014-07-24 05:15:00 2014-07-24 05:15:00
4                <NA> 2014-07-24 05:20:00
5 2014-07-24 05:23:00 2014-07-24 05:25:00
6 2014-07-24 05:27:00 2014-07-24 05:30:00
7 2014-07-24 05:33:00 2014-07-24 05:35:00
8 2014-07-24 05:41:00 2014-07-24 05:40:00
9 2014-07-24 05:45:00 2014-07-24 05:45:00

这是我的代码:

irregulars <- c("2014-07-24 05:05",
"2014-07-24 05:15",
"2014-07-24 05:23",
"2014-07-24 05:27",
"2014-07-24 05:33",
"2014-07-24 05:37",
"2014-07-24 05:41",
"2014-07-24 05:45")
df1 <- data.frame(datetimes <- as.POSIXct(irregulars, "GMT"))
regulars <- c("2014-07-24 05:05",
"2014-07-24 05:10",
"2014-07-24 05:15", 
"2014-07-24 05:20",
"2014-07-24 05:25",
"2014-07-24 05:30",
"2014-07-24 05:35",
"2014-07-24 05:40",
"2014-07-24 05:45")
df2 <- setNames(data.frame(matrix(NA,length(regulars),2)),c("datetimes1","datetimes2"))
df2$datetimes2 <- as.POSIXct(regulars, "GMT")
# Match irregulars to regulars
for(tolerance in c(0:3)) {
for(idx in which(!df1$datetimes %in% df2$datetimes1)) {
dt <- abs(difftime(df2$datetimes2, df1$datetimes[idx], "GMT", "mins"))
dt.min <- min(dt[is.na(df2$datetimes1)])
if (dt.min > tolerance) next
idx2 <- which(dt == dt.min)
df2$datetimes1[idx2] <- df1$datetimes[idx]
}
}
df2$datetimes1 <- as.POSIXct(df2$datetimes1, "GMT", origin = "1970-01-01 00:00:00")

这里有一种方法。假设我们的匹配算法是从一组候选人中找出一个与目标x绝对差最小的时间,条件是差值必须低于容差水平(例如 5 分钟或 300 秒):

closest <- function(x, candidates, tol = 300) {
timediff <- abs(difftime(x, candidates, units = "secs"))
if (all(timediff >= tol)) return(NA)
candidates[which.min(timediff)]
}

我们案例中的候选人是一组"非正规者",我们的目标是"常客"。这里的主要思想是遍历"常规",每当我们从候选集合中找到匹配项时,我们都会将其从候选集合中删除:

candidates <- irregulars
out <- sapply(regulars, function(x) {
matched <- closest(x, candidates, tol = 300)
candidates <<- setdiff(candidates, matched)
matched
})

这是完整的MWE。 首先设置时间向量:

irregulars <- c("2014-07-24 05:05",
"2014-07-24 05:15",
"2014-07-24 05:23",
"2014-07-24 05:27",
"2014-07-24 05:33",
"2014-07-24 05:37",
"2014-07-24 05:41",
"2014-07-24 05:45")
regulars <- c("2014-07-24 05:05",
"2014-07-24 05:10",
"2014-07-24 05:15", 
"2014-07-24 05:20",
"2014-07-24 05:25",
"2014-07-24 05:30",
"2014-07-24 05:35",
"2014-07-24 05:40",
"2014-07-24 05:45")

定义closest函数并迭代:

closest <- function(x, candidates, tol = 600) {
timediff <- abs(difftime(x, candidates, units = "secs"))
if (all(timediff >= tol)) return(NA)
candidates[which.min(timediff)]
}
candidates <- irregulars
out <- sapply(regulars, function(x) {
matched <- closest(x, candidates, tol = 300)
candidates <<- setdiff(candidates, matched)
matched
})

显示输出:

data.frame(datetimes1 = out,
datetimes2 = names(out),
row.names = NULL)
#        datetimes1       datetimes2
# 1 2014-07-24 05:05 2014-07-24 05:05
# 2             <NA> 2014-07-24 05:10
# 3 2014-07-24 05:15 2014-07-24 05:15
# 4 2014-07-24 05:23 2014-07-24 05:20
# 5 2014-07-24 05:27 2014-07-24 05:25
# 6 2014-07-24 05:33 2014-07-24 05:30
# 7 2014-07-24 05:37 2014-07-24 05:35
# 8 2014-07-24 05:41 2014-07-24 05:40
# 9 2014-07-24 05:45 2014-07-24 05:45

假设不规则时间永远不能与常规时间相隔 5 分钟,您可以遍历两个向量,并且仅在差值小于 5 分钟时选择:

i=1
j=1
while(i<=nrow(df2) & j<=nrow(df1))
{
d <-difftime(df2$datetimes2[i], df1$datetimes[j], "GMT",unit="mins")
if (abs(d) < 5) {
df2$datetimes1[i] <-  df1$datetimes[j]
j=j+1
i=i+1
} else if(d>0) j=j+1
else i=i+1

}
df2$datetimes1 <- as.POSIXct(df2$datetimes1, "GMT", origin = "1970-01-01 00:00:00")

> df2
datetimes1          datetimes2
1 2014-07-24 05:05:00 2014-07-24 05:05:00
2                <NA> 2014-07-24 05:10:00
3 2014-07-24 05:15:00 2014-07-24 05:15:00
4 2014-07-24 05:23:00 2014-07-24 05:20:00
5 2014-07-24 05:27:00 2014-07-24 05:25:00
6 2014-07-24 05:33:00 2014-07-24 05:30:00
7 2014-07-24 05:37:00 2014-07-24 05:35:00
8 2014-07-24 05:41:00 2014-07-24 05:40:00
9 2014-07-24 05:45:00 2014-07-24 05:45:00

最新更新