从存储了数十亿数字的文本文件中获取前5个元素，而不将它们存储在变量中

从存储了数十亿个数字的文本文件中查找5个顶部数字的列表。数字要么昏迷分离，要么换行。由于内存问题，我无法将列表的内容存储在变量中
我使用了生成器，并给出了批量大小为5。因此，每次调用next(result_generator)时，我都会从文本文件中获得5个元素
第一次调用next(result_generator)时，我将获得5个元素并对它们进行排序。我认为他们是前五名
下次我调用next(result_generator)时，我会得到另一个5。我会将其与以前的5组合。我会把它分类，然后从这10个中选出前5名
类似地，取下一个5并与前一个5组合得到前50，直到它next(result_generator)返回None

我面临的问题是生成器工作不正常，它没有接收接下来的5个元素。当进行next(result_generator)的第二次调用时，它将变为Exception。我试着对数据库做同样的事情，它在那里工作得很好。我怀疑文件操作有问题。我正在使用随机函数来生成数字，并将其写入文本文件中进行示例输入。

在文本文件中生成随机数的代码：

count =500
f = open('billion.txt','w')
while(count >1):
a = random.randint(1, 1000)
f.write(str(a)+"n")
count-=1
f.close()

从文本文件中查找前5个元素的代码：

result = []
full_list = []
final_list = []
def result_generator(batchsize=5):
while True:
global result
global full_list
global final_list
result = sorted([int(next(myfile).rstrip()) for x in range(batchsize)], reverse=True)
final_list = sorted(full_list + result, reverse=True)[:5]
full_list = result.copy()
# print("result list is : {}".format(final_list))
if not final_list:
break
else:
yield final_list

with open("billion.txt") as myfile:
result = result_generator()
print("datatype is :", type(result))
print("result is ",next(result))
for i in range (0,2):
try:
for each in next(result):
print("Row {} is :".format(each))
except StopIteration:
print("stop iteration")
except Exception:
print("Some different issue")

例如

131205,65,55222278672902,69,26亿

预期结果：[902,672,278,222,205]
实际结果：[222,205,131,65,55]

为什么不使用heapq

有一些文件，如file.txt

你正常地迭代你的文件，可以进行

import heapq
data = []
heapq.heapify(data)
N = 5
result = []
# Assuming numbers are each on a new line
with open('file.txt', 'r') as f:
for line in f:
heapq.heappush(data, int(line.strip()))
if len(data) > N:
heapq.heappop(data)
while data:
result.append(heapq.heappop(data))
result.reverse()
print(result)

[902, 672, 278, 222, 205]

您将使用O(N)内存和O(MlogN)时间，其中M以数十亿为单位表示您的问题，N是您想要获得的最高数字

相关内容

最新更新

热门标签：