加速R中的嵌套循环进行距离比较

我有2个数据框-STORE_LIST_A(50,000行)&STORE_LIST_B(30,000行)。这两个数据名都包含以下3列:STORE_ID纬度,经度,

<表类> STORE_ID 纬度经度 tbody><<tr>ABPS693P8.095134977.5472305ABTS730P8.102462777.5581893

好的，这里有一个尝试:)我运行了几次内存限制，但这可能取决于你的商店有多近。我找到了旧金山所有公司的一个很好的数据集，所以在30k * 50k的组合上运行代码，但我的标准是相隔大约250米。

使用<<p>库/strong>
library(data.table) library(geosphere)
只是为了准备一个类似的数据集
# downloaded all stores in SF # https://towardsdatascience.com/plotting-spatial-data-in-r-a38a405a07f1 dt <- fread("sf_business_dataset.csv") # prepare and convert the data comparable to your format dt <- dt[, .SD, .SDcols = c("Location Id", "Business Location")] dt[, c("LATITUDE", "LONGITUDE") := lapply(tstrsplit(gsub(".*\((.+)\)", "\1", `Business Location`), ","), as.numeric)] dt <- dt[!is.na(LATITUDE) & !is.na(LONGITUDE), .(STORE_ID = `Location Id`, LATITUDE, LONGITUDE)] # randomize, not sure how it was ordered, so we get a randomized A and B list. dt[, rIndex := sample(1:.N, .N, replace = F)] setorder(dt, rIndex) dt[, rIndex := NULL]
让我们玩一玩，我们用n = 10000来创建你的大小，让它变小来测试代码。在此之后，我们有了可以使用的示例数据。

# simulate your sample sizes n <- 10000 STORE_LIST_A <- dt[1:(3*n)] STORE_LIST_B <- dt[(3*n+1):((3+5)*n)]
<<p>解决方案/strong>这看起来有点多余，但大约0.008的小数差等于大约1公里，我们对任何超过这个距离的计算都不感兴趣。注意SF数据，我把它变小了，因为我们会有太多的组合!我决定使用foverlaps作为笛卡尔连接的后过滤，结果是25亿个组合，祝你好运:)

r <- 0.002 # set to 0.008 roughly for 1km
注意，在A中我们设置范围，在B中我们"克隆";开始/结束。我们基本上是将数据拆分，将经纬度分开做，如果其中任何一个距离超过1km，那么A和B之间的距离也会超过1km !
STORE_LIST_A[, LATITUDE_start := LATITUDE - r][, LATITUDE_end := LATITUDE + r] STORE_LIST_A[, LONGITUDE_start := LONGITUDE - r][, LONGITUDE_end := LONGITUDE + r] STORE_LIST_B[, LATITUDE_start := LATITUDE][, LATITUDE_end := LATITUDE] STORE_LIST_B[, LONGITUDE_start := LONGITUDE][, LONGITUDE_end := LONGITUDE] setkey(STORE_LIST_A, LATITUDE_start, LATITUDE_end) setkey(STORE_LIST_B, LATITUDE_start, LATITUDE_end) close_lat <- foverlaps(STORE_LIST_B, STORE_LIST_A, type = "within", nomatch = 0) # 62M matches setkey(STORE_LIST_A, LONGITUDE_start, LONGITUDE_end) setkey(STORE_LIST_B, LONGITUDE_start, LONGITUDE_end) close_lon <- foverlaps(STORE_LIST_B, STORE_LIST_A, type = "within", nomatch = 0) # 53M matches
现在我们必须内部连接这两个，因为只有当经纬度之间的距离足够小时，距离才足够小。下面对这些数据的分析结果是"只有"。5.3M匹配，你必须计算距离。

close_enough <- close_lat[close_lon, on = .(STORE_ID, i.STORE_ID), .(STORE_ID, LONGITUDE, LATITUDE, i.STORE_ID, i.LONGITUDE, i.LATITUDE), nomatch = 0]
对实际计算的距离进行一些清理和过滤，注意这仍然很慢!但是你知道我们只需要计算560万箱，而不是1500万箱:)

close_enough[, distance := distHaversine(c(LONGITUDE, LATITUDE), c(i.LONGITUDE, i.LATITUDE)), .(STORE_ID, i.STORE_ID)][distance < 1000]
样本输出(来自较小的测试运行)
STORE_ID LONGITUDE LATITUDE i.STORE_ID i.LONGITUDE i.LATITUDE distance 1: 0375071-01-001 -122.49 37.761 0465561-01-001 -122.49 37.761 51.190 2: 0925437-02-001 -122.49 37.781 1030867-06-151 -122.49 37.780 142.786 3: 0434463-01-001 -122.49 37.780 1030867-06-151 -122.49 37.780 136.233 4: 0460546-01-001 -122.48 37.782 0357002-01-001 -122.48 37.783 174.721 5: 0433580-02-001 -122.48 37.764 0412914-01-001 -122.48 37.763 91.625 --- 500: 1020837-02-151 -122.39 37.762 1005253-08-141 -122.39 37.760 173.533 501: 0436520-02-001 -122.39 37.760 0371484-02-001 -122.39 37.760 0.000 502: 0474822-01-001 -122.39 37.760 0371484-02-001 -122.39 37.760 33.514 503: 1020837-02-151 -122.39 37.762 0371484-02-001 -122.39 37.760 173.533 504: 1022102-03-151 -122.39 37.740 0470578-01-001 -122.39 37.738 238.217

相关内容

最新更新

热门标签：