假设datetimes1
是不规则间隔的日期时间,datetimes2
是定期间隔的日期时间。datetimes1
缺少一些日期时间,例如 5:10,如第一个表所示。
我想要的是尝试将datetimes1
与datetimes2
匹配,以便每个datetimes1
都接近datetimes2
,并且所有datetimes1
都在看似正确的行中。
起初,我尝试将datetimes1
四舍五入到最接近的 5 分钟,并尝试将它们与datetimes2
匹配,但有些日期时间相差 3 分钟,因此它们被四舍五入到不正确的值。
接下来我尝试的是找到哪些datetimes1
完全等于datetimes2
并将这些datetimes1
首先匹配到datetimes2
,方法是将tolerance
设置为 0,然后开始在每个循环中将tolerance
增加 1,并将尚未匹配的datetimes1
与datetimes2
由指定tolerance
关闭的匹配。
此方法的问题在于,5:33 和 5:37 两次都比 5:35相差 2 分钟,因此 5:33 首先与 5:35 匹配,然后 5:37 不会包含在表中。 请参阅第二个表,了解我使用给定代码获得的结果。
你知道我该如何解决这个问题吗?
我想看到的:
datetimes1 datetimes2
1 2014-07-24 05:05:00 2014-07-24 05:05:00
2 <NA> 2014-07-24 05:10:00
3 2014-07-24 05:15:00 2014-07-24 05:15:00
4 2014-07-24 05:23:00 2014-07-24 05:20:00
5 2014-07-24 05:27:00 2014-07-24 05:25:00
6 2014-07-24 05:33:00 2014-07-24 05:30:00
7 2014-07-24 05:37:00 2014-07-24 05:35:00
8 2014-07-24 05:41:00 2014-07-24 05:40:00
9 2014-07-24 05:45:00 2014-07-24 05:45:00
但我得到了这个:
datetimes1 datetimes2
1 2014-07-24 05:05:00 2014-07-24 05:05:00
2 <NA> 2014-07-24 05:10:00
3 2014-07-24 05:15:00 2014-07-24 05:15:00
4 <NA> 2014-07-24 05:20:00
5 2014-07-24 05:23:00 2014-07-24 05:25:00
6 2014-07-24 05:27:00 2014-07-24 05:30:00
7 2014-07-24 05:33:00 2014-07-24 05:35:00
8 2014-07-24 05:41:00 2014-07-24 05:40:00
9 2014-07-24 05:45:00 2014-07-24 05:45:00
这是我的代码:
irregulars <- c("2014-07-24 05:05",
"2014-07-24 05:15",
"2014-07-24 05:23",
"2014-07-24 05:27",
"2014-07-24 05:33",
"2014-07-24 05:37",
"2014-07-24 05:41",
"2014-07-24 05:45")
df1 <- data.frame(datetimes <- as.POSIXct(irregulars, "GMT"))
regulars <- c("2014-07-24 05:05",
"2014-07-24 05:10",
"2014-07-24 05:15",
"2014-07-24 05:20",
"2014-07-24 05:25",
"2014-07-24 05:30",
"2014-07-24 05:35",
"2014-07-24 05:40",
"2014-07-24 05:45")
df2 <- setNames(data.frame(matrix(NA,length(regulars),2)),c("datetimes1","datetimes2"))
df2$datetimes2 <- as.POSIXct(regulars, "GMT")
# Match irregulars to regulars
for(tolerance in c(0:3)) {
for(idx in which(!df1$datetimes %in% df2$datetimes1)) {
dt <- abs(difftime(df2$datetimes2, df1$datetimes[idx], "GMT", "mins"))
dt.min <- min(dt[is.na(df2$datetimes1)])
if (dt.min > tolerance) next
idx2 <- which(dt == dt.min)
df2$datetimes1[idx2] <- df1$datetimes[idx]
}
}
df2$datetimes1 <- as.POSIXct(df2$datetimes1, "GMT", origin = "1970-01-01 00:00:00")
这里有一种方法。假设我们的匹配算法是从一组候选人中找出一个与目标x
绝对差最小的时间,条件是差值必须低于容差水平(例如 5 分钟或 300 秒):
closest <- function(x, candidates, tol = 300) {
timediff <- abs(difftime(x, candidates, units = "secs"))
if (all(timediff >= tol)) return(NA)
candidates[which.min(timediff)]
}
我们案例中的候选人是一组"非正规者",我们的目标是"常客"。这里的主要思想是遍历"常规",每当我们从候选集合中找到匹配项时,我们都会将其从候选集合中删除:
candidates <- irregulars
out <- sapply(regulars, function(x) {
matched <- closest(x, candidates, tol = 300)
candidates <<- setdiff(candidates, matched)
matched
})
这是完整的MWE。 首先设置时间向量:
irregulars <- c("2014-07-24 05:05",
"2014-07-24 05:15",
"2014-07-24 05:23",
"2014-07-24 05:27",
"2014-07-24 05:33",
"2014-07-24 05:37",
"2014-07-24 05:41",
"2014-07-24 05:45")
regulars <- c("2014-07-24 05:05",
"2014-07-24 05:10",
"2014-07-24 05:15",
"2014-07-24 05:20",
"2014-07-24 05:25",
"2014-07-24 05:30",
"2014-07-24 05:35",
"2014-07-24 05:40",
"2014-07-24 05:45")
定义closest
函数并迭代:
closest <- function(x, candidates, tol = 600) {
timediff <- abs(difftime(x, candidates, units = "secs"))
if (all(timediff >= tol)) return(NA)
candidates[which.min(timediff)]
}
candidates <- irregulars
out <- sapply(regulars, function(x) {
matched <- closest(x, candidates, tol = 300)
candidates <<- setdiff(candidates, matched)
matched
})
显示输出:
data.frame(datetimes1 = out,
datetimes2 = names(out),
row.names = NULL)
# datetimes1 datetimes2
# 1 2014-07-24 05:05 2014-07-24 05:05
# 2 <NA> 2014-07-24 05:10
# 3 2014-07-24 05:15 2014-07-24 05:15
# 4 2014-07-24 05:23 2014-07-24 05:20
# 5 2014-07-24 05:27 2014-07-24 05:25
# 6 2014-07-24 05:33 2014-07-24 05:30
# 7 2014-07-24 05:37 2014-07-24 05:35
# 8 2014-07-24 05:41 2014-07-24 05:40
# 9 2014-07-24 05:45 2014-07-24 05:45
假设不规则时间永远不能与常规时间相隔 5 分钟,您可以遍历两个向量,并且仅在差值小于 5 分钟时选择:
i=1
j=1
while(i<=nrow(df2) & j<=nrow(df1))
{
d <-difftime(df2$datetimes2[i], df1$datetimes[j], "GMT",unit="mins")
if (abs(d) < 5) {
df2$datetimes1[i] <- df1$datetimes[j]
j=j+1
i=i+1
} else if(d>0) j=j+1
else i=i+1
}
df2$datetimes1 <- as.POSIXct(df2$datetimes1, "GMT", origin = "1970-01-01 00:00:00")
> df2
datetimes1 datetimes2
1 2014-07-24 05:05:00 2014-07-24 05:05:00
2 <NA> 2014-07-24 05:10:00
3 2014-07-24 05:15:00 2014-07-24 05:15:00
4 2014-07-24 05:23:00 2014-07-24 05:20:00
5 2014-07-24 05:27:00 2014-07-24 05:25:00
6 2014-07-24 05:33:00 2014-07-24 05:30:00
7 2014-07-24 05:37:00 2014-07-24 05:35:00
8 2014-07-24 05:41:00 2014-07-24 05:40:00
9 2014-07-24 05:45:00 2014-07-24 05:45:00