最有效的方法来计算一个大数组的平均值?

我有一些大的实验数据。csv文件。它们的大小在30MB-3GB之间。我已经用熊猫成功地阅读了它们，并对数据进行了一些其他计算。现在我有一个非常长的1D数组，我需要取它的平均值。

默认情况下，我使用statistics.mean(array)，但这似乎需要很长时间才能运行。

通过测试我的代码的各个部分，我知道它是行statistics.mean(array)花费这么长时间运行。

有没有比这更有效的方法来计算大数据集的平均值?

谢谢!

def GetMean(ionVelocityArray):
return stats.mean(ionVelocityArray)

这个函数在一个30MB的文件上运行已经等了2个小时了

这取决于数组的大小你可以循环它然后在末尾除以数组的大小:

def GetMean(ionVelocityArray):
total = 0
for _ in ionVelocityArray:
total += 1
return total / len(ionVelocityArray)

但是如果它超过20k个元素，我会对数组进行排序，并使用四分位数范围进行估计，并使用它来计算平均值，或者如果有重复值，那么当它排序时，你可以将其存储在字典中，其中键是列表中的一个元素，值是t6count，并使用它作为平均值。

我使用statistics.mean和numpy.mean进行了一个简单的测试，1e8浮动随机1D数组的内存大小为800 MB。我使用的是一台普通的笔记本电脑。

使用statistics库计算平均值所需的时间为95 s，而使用numpy只需0.11 s。下面是代码:

import statistics
from time import perf_counter
import numpy as np
test_array = np.random.rand(100_000_000)
size_in_memory = test_array.size * test_array.itemsize * 1e-6
t0 = perf_counter()
mean_statistics = statistics.mean(test_array)
tf_statistics = perf_counter() - t0
t0 = perf_counter()
mean_numpy = np.mean(test_array)
tf_numpy = perf_counter() - t0

如果可以，使用numpy。

相关内容

最新更新

热门标签：