我有一个原始二进制文件,它是几个gig,我试图在块中处理它。在我开始处理数据之前,我必须删除它所具有的标题。由于原始二进制文件格式,.find或检查数据块中的字符串等字符串方法都不起作用。我想自动剥离标题,但它可以在长度上有所不同,我目前寻找最后一个新行字符的方法不起作用,因为原始二进制数据在数据中具有匹配的位。
Data format:
BEGIN_HEADERrn
header of various line countrn
HEADER_ENDrn raw data starts here
我如何在文件中读取
filename="binary_filename"
chunksize=1024
with open(filename, "rb") as f:
chunk = f.read(chunksize)
for index, byte in enumerate(chunk):
if byte == ord('n'):
print("found one " + str(index))
是否有一种简单的方法来提取HEADER_ENDrn行而不通过文件滑动字节数组?当前的方法:
chunk = f.read(chunksize)
index=0
not_found=True
while not_found:
if chunk[index:index+12] == b'HEADER_ENDrn':
print("found")
not_found=False
index+=1
你可以使用linecache:
import linecache
currentline = 0
while(linecache.getline("file.bin",currentline)!="HEADER_ENDn"):
currentline=currentline+1
#print raw data
currentline = currentline + 1
rawdata = linecache.getline("file.bin",currentline)
currentrawdata = rawdata
while(currentrawdata):
currentrawdata = linecache.getline("file.bin",currentline+1)
rawdata = rawdata + currentrawdata
currentline = currentline + 1
print rawdata
更新我们可以把问题分成两部分,首先我们可以去掉标题,然后我们可以把它读成块:
lines= open('test_file.bin').readlines()
currentline = 0
while(lines[currentline] != "HEADER_ENDrn"):
currentline=currentline+1
open('newfile.bin', 'w').writelines(lines[currentline:-1])
将创建一个只包含原始数据的文件(newfile.bin)。现在它可以直接以块的形式读取:
chunksize=1024
with open('newfile.bin', "rb") as f:
chunk = f.read(chunksize)
更新2 也可以不使用中间文件:
#defines the size of the chunks
chunksize=20
filename= 'test_file.bin'
endHeaderTag = "HEADER_ENDrn"
#Identifies at which line there is HEADER_END
lines= open(filename).readlines()
currentline = 0
while(lines[currentline] != endHeaderTag):
currentline=currentline+1
currentline=currentline+1
#Now currentline contains the index of the first line to the raw data
#With the reduce operation we generate a single string from the list of lines
#we are considering only the lines after the currentline
header_stripped = reduce(lambda x,y:x+y,lines[currentline:])
#Lastly we read successive chunks and we store them into the chunk list.
chunks = []
reminder = len(header_stripped)%chunksize
for i in range(1,len(header_stripped)/chunksize + reminder):
chunks.append( header_stripped[(i-1)*chunksize:i*chunksize])