如何选择R数据框中时间较早的副本?



我想知道如何根据一列中的重复元素有效地从确定为重复行的重复行中进行选择。在重复的行中,我想根据日期列识别并选择时间较早的行。我用下面的代码解决了这个问题。我正在寻找一个更有效的基数R不涉及for环的解决方案

# data required
data <- structure(list(var1 = c("11", "11", 
"12", "12", "13", "13", 
"14", "14", "15", "16"
), EndDate = structure(c(1588792540, 1588942766, 1589118458, 
1589059900, 1588669654, 1588979219, 1588876217, 1588786020, 1588506698, 
1588512011), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
                               10L), class = "data.frame")
duplicate_var1 <- data$var1[duplicated(data$var1)] # save duplicate indices
data["identifier"] <- NA # create a new, empty column
# loop to determine where the duplicates are
for (i in 1:dim(data)[1]) { # for each row
if (data$var1[i] %in% duplicate_var1 == TRUE) { 
var1_locations <- which(data$var1 == data$var1[i]) 
var1_location_2 <- setdiff(var1_locations, i) 
if (data$EndDate[i] < data$EndDate[var1_location_2]) {
data[i, "identifier"] <- 1
} 
} else {
data[i, "identifier"] <- 1
}
}
# save the reduced data
newdf <- data[!is.na(data["identifier"]), ]

非Base R解决方案,但我把它留在这里,以防将来对某人有用。

library(data.table) # load data.table
setDT(data, key = "EndDate") # convert to data.table object and order by EndDate

由于数据是按EndDate排序的(这是键设置),因此取var1相同的每个组的第一行:

data[, .SD[1, ], by = var1] 

最新更新