Python3 如何根据行内容将大型文本文件拆分为较小的文件



我有一个包含数据的文件

# FULL_ID BJD MAG 未证书标志

和近 12,000 行。此表包含 32 个对象的数据,每个对象由唯一的FULL_ID标识。例如,它可能会说

# FULL_ID   BJD        MAG      UNCERT      FLAG
2_543     3215.52    19.78    0.02937     OO
2_543     3215.84    19.42    0.02231     OO
3_522     3215.52    15.43    0.01122     OO
3_522     3222.22    16.12    0.01223     OO

我想要的是BigData.dat代码运行此文件,并最终得到多个文件,例如2_543.dat3_522.dat等,每个都包含:

# BJD    MAG    UNCERT    FLAG

对于属于该FULL_ID的所有BigData.dat行。

目前我正在这样做:

with open(path, 'r') as BigFile:
line = BigFile.readline()
for line in BigFile:
fields = line.split(None)
id = fields[0]
output = open(id+".dat", 'a')
writeline = str(fields[1])+' '+str(fields[2])+' '+str(fields[3])+' '+str(fields[4])+'n'
output.write(writeline)
output.close()

它确实会产生正确的输出,但它们没有标题行:# BJD MAG UNCERT FLAG

如何确保此行位于每个文件的顶部?

打开文件是一项昂贵的操作,并且对每个输入行重复执行此操作效率不高。相反,我会保留已见FULL_ID值到文件对象的映射。如果不存在FULL_ID,则必须以"w"模式打开文件,并应立即添加标头。这边:

  1. 标头已正确写入输出文件
  2. 如果脚本运行多次,则会正确擦除输出文件中的旧值

代码可以是:

with open(path) as bigFile:
outfiles = {}         # mapping FULL_ID -> output file
header = ' '.join(['#'] + next(bigFile).split()[2:])   # compute output header
for line in bigFile:
row = line.split()
try:
output = outfiles[row[0]]
except KeyError:
output = open(f'{row[0]}.dat', 'w')
print(header, file=output)
outfiles[row[0]] = output
print(' '.join(row[1:]), file=output)
for output in outfiles.values():               # close all files before exiting
output.close()

限制是您必须保持所有文件打开,直到输入文件结束。它应该代表 32 个对象,但对于更大的数字会中断。有效的方法是将简单的字典更改为更复杂的缓存,能够在容量耗尽时关闭最新文件,并在需要时重新打开它(在追加模式下(。


下面是一个可能的缓存实现:

class FileCache:
"""Caches a number of open files referenced by string Ids.
(by default the id is the name)"""
def __init__(self, size, namemapping=None, header=None):
"""Creates a new cache of size size.
namemapping is a function that gives the filename from an ID
header is an optional header that will be written at creation
time
"""
self.size = size
self.namemapping = namemapping if namemapping is not None 
else lambda x: x
self.header = header
self.map = {}             # dict id -> slot number
self.slots = [(None, None)] * size   # list of pairs (id, file object)
self.curslot = 0          # next slot to be used
def getFile(self, id):
"""Gets an open file from the cache.
Directly gets it if it is already present, eventually reopen
it in append mode. Adds it to the cache if absent and open it
in truncate mode."""
try:
slot = self.map[id]
if slot != -1:
return self.slots[slot][1]   # found and active
mode = 'a'                       # need re-opening
except:
mode = 'w'                       # new id: create file
slot = self.curslot
self.curslot = (slot + 1) % self.size
if self.slots[slot][0] is not None:  # eventually close previous
self.slots[slot][1].close()
self.map[self.slots[slot][0]] = -1
fd = open(self.namemapping(id), mode)
# if file is new, write the optional header
if (mode == 'w') and self.header is not None:
print(self.header, file=fd)
self.slots[slot] = (id, fd)
self.map[id] = slot
return fd
def close(self):
"""Closes any cached file."""
for i in self.slots:
i[1].close()
self.map[i[0]] = -1
self.slots = [(None, None)] * self.size

上面的代码将变为:

with open(path) as bigFile:
header = ' '.join(['#'] + next(bigFile).split()[2:])   # compute output header
outfiles = FileCache(10, lambda x: x+'.dat', header) # cache FULL_ID -> file
for line in bigFile:
row = line.split()
output = outfiles.getFile(row[0])
print(' '.join(row[1:]), file=output)
outfiles.close()               # close all files before exiting

您正在覆盖 for 循环中的标题行,请将其保存在单独的变量中。此外,您可以记住标头是否已写入文件:

path = 'big.dat'
header_written = []
with open(path, 'r') as BigFile:
header = BigFile.readline()  # keep header separately!
for line in BigFile:
fields = line.split(None)
_id = fields[0]
output = open(_id+".dat", 'a')
if _id not in header_written:  # check and save the ID to keep track if header was written
output.write(header)
header_written.append(_id)
writeline = str(fields[1])+' '+str(fields[2])+' '+str(fields[3])+' '+str(fields[4])+'n'
output.write(writeline)
output.close()

文件:

# FULL_ID   BJD        MAG      UNCERT      FLAG
3215.52 19.78 0.02937 OO
3215.84 19.42 0.02231 OO

最新更新