我有一个由大约5500万条记录组成的数据集,分布在许多CSV中。每个记录都包含一个日期时间、一个纬度和一个经度,表示闪电事件及其发生的位置。我需要添加一个字段,其中对于每个事件,我需要列出自上次事件发生以来在当前事件5海里内的时间(以秒为单位(。从技术上讲,我在下面写的代码是有效的,但在第一个CSV上花了一个多小时后,我不得不停止它。
尽管我用R写了这篇文章,但我愿意使用任何语言或工具来实现我的目标,而不需要几天或几周的时间来处理数据集。我也尝试过SQLite,但到目前为止,我只是在尝试让它做数学运算时感到沮丧。
另外请注意,在此搜索过程中,由于之前没有任何符合条件的事件,某些条目将返回NA。这很好;在这种情况下,我想要NA。
代码后面是CSV数据的示例,我制作了一个包含一天数据的文件,可在以下位置进行测试:https://www.dropbox.com/s/i02jl6bswq0zlel/2019.05.02.TL.csv?dl=1
(我现在在下面的代码中使用出租车距离,根据所有记录事件的平均纬度,0.08335
和0.0965
的纬度/经度值约为5nmi。如果有一种方法可以快速计算毕达哥拉斯距离,那就太棒了。(
代码:
library(dplyr)
#Get list of all relevant CSVs
fList = list.files(pattern = "*.TL.csv")
#Reads in the first CSV; haven't implemented looping through files yet
t1 = read.csv(as.character(fList[1]), colClasses=c(DateTime="character"))
t1 = t1[order(t1$DateTime),] #Sort by DateTime
rownames(t1) = NULL #Reset index
#Used for filtering the current row out of the search
index = c(1:nrow(t1))
#Create empty column
t1$TimeGap = NA
for (i in 2:nrow(t1)) {
temp = max(filter(t1, (DateTime <= t1$DateTime[i]) &
(Latitude >= t1$Latitude[i]-0.08335) &
(Latitude <= t1$Latitude[i]+0.08335) &
(Longitude >= t1$Longitude[i]-0.0965) &
(Longitude <= t1$Longitude[i]+0.0965) &
(index != i))$DateTime)
t1$TimeGap[i] = as.numeric(difftime(as.POSIXct(t1$DateTime[i],format="%Y-%m-%d %H:%M:%OS"),
as.POSIXct(temp,format="%Y-%m-%d %H:%M:%OS"),
units="secs"))
}
CSV数据样本:
"DateTime","Latitude","Longitude"
"2019-05-02 14:43:37.833",26.9517,-81.4851
"2019-05-02 14:43:37.857",26.9674,-81.4758
"2019-05-02 14:43:37.988",26.9802,-81.4698
"2019-05-02 14:45:41.512",27.0024,-81.4612
"2019-05-02 15:22:59.614",27.295,-81.1728
"2019-05-02 15:24:06.284",27.3444,-81.1213
"2019-05-02 15:24:51.607",27.3306,-81.146
"2019-05-02 15:26:52.130",27.5441,-81.1099
"2019-05-02 15:26:52.131",27.3214,-81.1758
"2019-05-02 15:26:52.131",27.3326,-81.1614
"2019-05-02 15:26:52.134",27.5396,-81.0952
"2019-05-02 15:26:52.134",27.5377,-81.1069
"2019-05-02 15:26:52.156",27.517,-81.1147
"2019-05-02 15:26:52.167",27.5377,-81.0962
"2019-05-02 15:28:59.356",27.5156,-81.1152
"2019-05-02 15:28:59.357",27.519,-81.1092
"2019-05-02 15:28:59.359",27.406,-81.174
"2019-05-02 15:28:59.362",27.4081,-81.1489
"2019-05-02 15:28:59.362",27.508,-81.1472
"2019-05-02 15:28:59.364",27.5183,-81.1497
"2019-05-02 15:28:59.417",27.5338,-81.1712
"2019-05-02 15:31:39.021",27.4052,-81.1956
"2019-05-02 15:31:39.027",27.4381,-81.1837
"2019-05-02 15:31:39.027",27.5141,-81.159
"2019-05-02 15:31:39.027",27.417,-81.1631
"2019-05-02 15:31:39.027",27.5439,-81.1326
"2019-05-02 15:31:39.048",27.4809,-81.1691
"2019-05-02 15:31:39.048",27.4666,-81.1561
"2019-05-02 15:31:39.048",27.4666,-81.1561
"2019-05-02 15:31:39.048",27.401,-81.1679
这听起来像是(相对较新的(程序包disk.frame
将脱颖而出的情况。它类似于data.table
,但考虑到了非常大的数据量。在没有使用过它的情况下,但根据我从data.table
了解到的情况,并假设您的所有数据都在file.path(getwd(), 'data')
中,您可能会使用类似的东西来实现您的目标
library(disk.frame)
# Setup from the ingesting Data vignette
if(interactive()) {
setup_disk.frame()
# highly recommended, however it is pun into interactive() for CRAN because
# change user options are not allowed on CRAN
options(future.globals.maxSize = Inf)
} else {
setup_disk.frame(2)
}
# Import csv data from shared folder
disk.f <- csv_to_disk.frame(list.files(file.path(getwd(), 'data'), full.names = TRUE))
一旦导入,我们应该能够在标准data.table
语法中进行计算
disk.f[, `:=`(LatitudeMax = Latitude + 0.08335,
LatitudeMin = Latitude - 0.08335,
LongitudeMax = Longitude + 0.0965,
LongitudeMin = Longitude - 0.0965,
rowid = seq_len(.N),
Datetime = as.POSIXct(Datetime, format = "%Y-%m-%d %H:%M:%OS"))]
# Perform selv join.. I hope I got the direction right here.
calc.disk.f <-
disk.f[disk.f[, .(Latitude.i = Latitude,
Longitude.i = Longitute,
Datetime.i = Datetime,
rowid.i = rowid)],
on = .(Datetime >= Datetime.i,
LongitudeMax >= Longitude.i,
LongitudeMin <= Longitde.i,
LatitudeMax >= Latitude.i,
LatitudeMin <= Latitude.i,
rowid != rowid.i),
allow.cartesian = TRUE][, .(Timegap = difftime(Datetime,
max(Datetime.i),
units = 'secs')),
by = rowid]
# Merge result into original frame.
disk.f <- calc_disk_f[disk.f, on = rowid]
# clean up calc.disk.f
delete(calc.disk.f)
# Save result or change to a regular data.table/data.frame/tibble
## Not sure if this is in csv, fst or what format.
write_disk.frame(disk.f, outdir = file.path(getwd(), data, 'result.df'))
免责声明
我还没能测试上面的代码,所以它可能充满了打字错误和错误的连接(我的聚合组应该是rowid
还是rowid.i
?我可能已经把它变成了这样或那样(,以及一个未在disk.frame
中测试的自连接(可能需要创建disk.frame
的副本(。我可能会创建一个更可复制的示例,但更可能的是,这可以作为一个起点,并为正确的解决方案提供依据。
注意,disk.frame
类似地有一个类似dplyr
的接口,可以在data.table
自联接语法不起作用的情况下使用。
更新1:
disk.frame
的data.table
语法似乎是data.table
宇宙的部分包装器,因此我建议的联接没有按预期工作。使用foverlaps
是我目前的第二个快速子集设置方法。