将Dukascopy二进制数据转换为R中的文本(数据帧)



我想从Dukascopy下载并保存报价市场数据:https://www.dukascopy.com/swiss/english/marketwatch/historical/

我在他们的服务器上成功地提出了请求,并获得了二进制数据作为get请求的结果:

library(httr)

url <- "https://datafeed.dukascopy.com/datafeed/USA30IDXUSD/2022/07/15/23h_ticks.bi5"
p <- GET(url,
add_headers("user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"))
p
content(p)

我不知道如何将这个二进制文件转换为可读数据。我在python中找到了一些脚本:https://www.driftinginrecursion.com/post/dukascopy_opensource_data/https://github.com/terukusu/download-tick-from-dukascopy/blob/master/download_tick_from_dukascopy.py

例如,第一个链接使用功能:

def bi5_to_csv(date_ts, out_dir, files):
print('Starting Coversion of All .bi5 Files...')
sort = sorted(files)
chunk_size = struct.calcsize('>3i2f')
data = []
for bi5 in sort:
try:
size = os.path.getsize(bi5)
except (IOError, OSError):
break
if size > 0:
with lzma.open(bi5) as f:
while True:
chunk = f.read(chunk_size)
if chunk:
data.append(struct.unpack('>3i2f', chunk))
else:
break
os.remove(bi5)
if not data:
print('All Downloaded Files Where Empty!')
return 1
df = pd.DataFrame(data)
df.columns = ['UTC', 'AskPrice', 'BidPrice', 'AskVolume', 'BidVolume']
df.AskPrice = df.AskPrice / 100000
df.BidPrice = df.BidPrice / 100000
df.UTC = pd.TimedeltaIndex(df.UTC, 'ms')
df.UTC = df.UTC.astype(str)
df.UTC = df.UTC.replace(regex=['0 days'], value=[str(date_ts)])
df.UTC = df.UTC.str[:-3]
df.to_csv(out_dir + '/daily.csv', index=False)
print('Finished Converting Files!')
return 0

在第二个脚本中使用:

def tokenize(buffer):
token_size = 20
token_count = int(len(buffer) / token_size)
tokens = list(map(lambda x: struct.unpack_from('>3I2f', buffer, token_size * x), range(0, token_count)))
return tokens

def normalize_tick(symbol, day, time, ask, bid, ask_vol, bid_vol):
date = day + timedelta(milliseconds=time)
# TODO 網羅する。この通過ペア以外も有るかも
if any(map(lambda x: x in symbol.lower(), ['usdrub', 'xagusd', 'xauusd', 'jpy'])):
point = 1000
else:
point = 100000

我不知道如何在E中应用此代码,也不知道如何将此二进制文件转换为tick数据。还发现了C++实现:从Dukascopy tick二进制文件读取数据

我的奇怪尝试:

阅读一点关于bi5文件的信息,可以得到一个信息,那就是用lzma归档的二进制文件(请参阅:reedit。所以,首先我们必须解压缩文件。要解压缩它,我使用lzma

library(httr)
binF <- "/home/sapi/aaa/23h_ticks.lzma"
url <- "https://datafeed.dukascopy.com/datafeed/USA30IDXUSD/2022/07/15/23h_ticks.bi5"
p <- GET(url,
add_headers("user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"), write_disk(binF, overwrite = TRUE))
system(paste("lzma -d", binF))

解压缩后,我们必须读入(二进制(并转换为任何有意义的数据。从上面的reddit讨论中,我们可以看到:

数据存储在20字节宽的行中,每行4字节对应于一段数据。示例:

# TIME is a 32-bit big-endian integer representing the number of milliseconds that have passed since the beginning of this hour.
# ASKP is a 32-bit big-endian integer representing the asking price of the pair, multiplied by 100,000.
# BIDP is a 32-bit big-endian integer representing the bidding price of the pair, multiplied by 100,000.
# ASKV is a 32-bit big-endian floating point number representing the asking volume, divided by 1,000,000.
# BIDV is a 32-bit big-endian floating point number representing the bidding volume, divided by 1,000,000.

现在我们必须找到一种方法来将big-endian转换为某种东西。这个stackoverflow问题给了我们一个提示。

library(dplyr)
binF <- "/home/sapi/aaa/23h_ticks"
con <- file(binF, "rb")

for (i in 0:(file.size(binF)/20-1)) {
data <- readBin(con = con, "raw", 20)
TIME <- data[4:1] %>% rawToBits %>% as.logical %>% which %>% {2^(. - 1)} %>% sum
ASKP <- data[8:5] %>% rawToBits %>% as.logical %>% which %>% {2^(. - 1)} %>% sum
BIDP <- data[12:9] %>% rawToBits %>% as.logical %>% which %>% {2^(. - 1)} %>% sum
ASKV <- data[16:13] %>% rawToBits %>% as.logical %>% which %>% {2^(. - 1)} %>% sum
BIDV <- data[20:17] %>% rawToBits %>% as.logical %>% which %>% {2^(. - 1)} %>% sum
print(paste(TIME, ASKP, BIDP, ASKV, BIDV))
}
#> [1] "246 33867231 33862497 946528651 943230116"
#> [1] "296 33867291 33861749 947628162 943230116"
#> [1] "421 33867003 33862299 897988541 943230116"
#> [1] "472 33867781 33862219 947628162 943230116"
#> [1] "2440 33867773 33862779 897988541 943230116"
[...]
#> [1] "3583902 33882501 33877289 947628162 943230116"

请注意,我不知道这些数据是否正确/有意义。然而,从毫秒列来看,这是有意义的,因为它在3600左右结束。

创建于2022-09-05,reprex v2.0.2

最新更新