如何使用Python(200 GB+)从一个长csv文件的中间读取区块



我有一个大的csv文件,我正在用块读取它。在进程的中间,内存已经满了,所以我想从它离开的地方重新启动。我知道哪个区块,但不知道如何直接进入那个区块。

这就是我尝试过的。

# data is the txt file
reader = pd.read_csv(data , 
delimiter = "t",
chunksize = 1000
)

# Please see the code below. When my last process broke, i was 154 so I think it should 
# start from 154000th line. This time I don't 
# plan to read whole file at once so I have an 
# end point at 160000
first = 154*1000
last = 160*1000
output_path = 'usa_hotspot_data_' + str(first) + '_' + str(last) + '.csv'
print("Output file: ", output_path)
try:
os.remove(output_path)
except OSError:
pass
# Read chunks and save to a new csv
for i,chunk in enumerate(reader):
if (i >= first and i<=last) :
< -- here I do something  -- > 
# Progress Bar to keep track 
if (i% 1000 == 0):
print("#", end ='')

然而,这需要很多时间才能到达我想去的第I条线。我怎么能跳过之前的阅读片段,直接去那里呢?

pandas.read_csv

skiprows:要跳过的行号(0索引(或要跳过的行数(int(。

您可以将此skiprows传递给read_csv,它将起到偏移的作用。

最新更新