尝试提高处理时序数据的 R 函数的效率



我有10亿行的时间和销售数据,像这样:

datetime               price 
"2016-05-01 18:00:02"  2060.75
"2016-05-01 18:00:22"  2060.50
"2016-05-01 18:00:35"  2060.50
"2016-05-01 18:01:59"  2060.75
"2016-05-01 18:03:21"  2061.00
"2016-05-01 18:03:21"  2061.25
"2016-05-01 18:03:42"  2061.00
"2016-05-01 18:04:22"  2061.00
"2016-05-01 18:04:25"  2061.25
"2016-05-01 18:04:44"  2061.50
"2016-05-01 18:06:41"  2061.50

和我有一个函数,在每分钟的间隔将给出最近的价格:

datetime               price 
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:02:00"  2060.75
"2016-05-01 18:03:00"  2060.75
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:05:00"  2061.50
"2016-05-01 18:06:00"  2061.50
"2016-05-01 18:07:00"  2061.50

我的函数四舍五入到最近的分钟:

datetime               price 
"2016-05-01 18:01:00"  2060.75
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:02:00"  2060.75
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:04:00"  2061.25
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:05:00"  2061.00
"2016-05-01 18:05:00"  2061.25
"2016-05-01 18:05:00"  2061.50
"2016-05-01 18:07:00"  2061.50

,然后从底部开始向上移动,删除重复次数的行:

datetime               price 
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:02:00"  2060.75
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:05:00"  2061.50
"2016-05-01 18:07:00"  2061.50

,然后加上缺失的分钟数:

datetime               price 
"2016-05-01 18:01:00"  2060.50
"2016-05-01 18:02:00"  2060.75
"2016-05-01 18:03:00"  2060.75
"2016-05-01 18:04:00"  2061.00
"2016-05-01 18:05:00"  2061.50
"2016-05-01 18:06:00"  2061.50
"2016-05-01 18:07:00"  2061.50

我已经尝试了许多不同的函数,但这是我能找到的最快的方法,函数仍然很慢,我想一定有一种更有效的方法来做到这一点,我想不到。有人能帮忙吗?

您可以使用library(data.table)的滚动连接,分两步完成此操作

创建数据。所有感兴趣的"分钟"表

dt_minutes <- data.table(datetime = seq(as.POSIXct("2016-05-01 18:00:00"), 
                                    length.out = 10, 
                                    by = "mins"))
dt_minutes
#                datetime
# 1: 2016-05-01 18:00:00
# 2: 2016-05-01 18:01:00
# 3: 2016-05-01 18:02:00
# 4: 2016-05-01 18:03:00
# 5: 2016-05-01 18:04:00
# 6: 2016-05-01 18:05:00
# 7: 2016-05-01 18:06:00
# 8: 2016-05-01 18:07:00
# 9: 2016-05-01 18:08:00
# 10: 2016-05-01 18:09:00

并使用滚动连接来获取每分钟的最新价格

## you'll need to set your data to a data.table
# library(data.table)
# setDT(dt)
dt[dt_minutes, roll = TRUE, on = "datetime"]
#               datetime   price
# 1: 2016-05-01 18:00:00      NA
# 2: 2016-05-01 18:01:00 2060.50
# 3: 2016-05-01 18:02:00 2060.75
# 4: 2016-05-01 18:03:00 2060.75
# 5: 2016-05-01 18:04:00 2061.00
# 6: 2016-05-01 18:05:00 2061.50
# 7: 2016-05-01 18:06:00 2061.50
# 8: 2016-05-01 18:07:00 2061.50
# 9: 2016-05-01 18:08:00 2061.50
# 10: 2016-05-01 18:09:00 2061.50

数据
library(data.table)
dt <- fread('datetime               price 
"2016-05-01 18:00:02"  2060.75
"2016-05-01 18:00:22"  2060.50
"2016-05-01 18:00:35"  2060.50
"2016-05-01 18:01:59"  2060.75
"2016-05-01 18:03:21"  2061.00
"2016-05-01 18:03:21"  2061.25
"2016-05-01 18:03:42"  2061.00
"2016-05-01 18:04:22"  2061.00
"2016-05-01 18:04:25"  2061.25
"2016-05-01 18:04:44"  2061.50
"2016-05-01 18:06:41"  2061.50', header = T)

这里有一篇很好的关于滚动连接的博客文章,可以让你开始。

最新更新