为什么statistics.mean()这么慢

我将statistics模块的mean函数的性能与简单的sum(l)/len(l)方法进行了比较，发现mean函数由于某种原因非常慢。我使用了timeit和下面的两个代码片段来比较它们，有人知道是什么导致了执行速度的巨大差异吗？我使用的是Python 3.5。

from timeit import repeat
print(min(repeat('mean(l)',
'''from random import randint; from statistics import mean; 
l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))

上面的代码在我的机器上执行大约0.043秒。

from timeit import repeat
print(min(repeat('sum(l)/len(l)',
'''from random import randint; from statistics import mean; 
l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))

上面的代码在我的机器上执行大约0.000565秒。

Python的statistics模块不是为了速度而构建的，而是为了精度

在该模块的规格中，

当处理疯狂的浮点运算时，内置的sum可能会失去准确性不同的幅度。因此，上述天真的平均数无法实现这一点"酷刑测试">

assert mean([1e30, 1, 3, -1e30]) == 1

返回0而不是1，这是100%的纯计算误差。

在mean中使用math.fsum将使其在float中更准确数据，但它也有将任何参数转换为即使在不必要的时候也可以浮动。例如，我们应该期望列表的平均值的分数是分数，而不是浮点。

相反，如果我们看看这个模块中_sum()的实现，该方法的文档字符串的第一行似乎证实了：

def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)
Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.
[...] """

所以，是的，sum的statistics实现，而不是对Python内置的sum()函数的简单的一行调用，它本身需要大约20行，其中有一个嵌套的for循环。

之所以会出现这种情况，是因为statistics._sum选择保证它可能遇到的所有类型的数字的最大精度(即使它们之间存在很大差异)，而不是简单地强调速度。

因此，内置sum的速度提高了一百倍似乎是正常的。它的成本要低得多，因为你碰巧用奇异的数字来称呼它。

其他选项

如果你需要在算法中优先考虑速度，你应该看看Numpy，它的算法正在C.中实现

NumPy均值远不如statistics精确，但它(自2013年以来)实现了一个基于成对求和的例程，该例程比简单的sum/len更好(链接中有更多信息)。

然而。。。

import numpy as np
import statistics
np_mean = np.mean([1e30, 1, 3, -1e30])
statistics_mean = statistics.mean([1e30, 1, 3, -1e30])
print('NumPy mean: {}'.format(np_mean))
print('Statistics mean: {}'.format(statistics_mean))
> NumPy mean: 0.0
> Statistics mean: 1.0

如果您关心速度，请使用numpy/scpy/pandas：

In [119]: from random import randint; from statistics import mean; import numpy as np;
In [122]: l=[randint(0, 10000) for i in range(10**6)]
In [123]: mean(l)
Out[123]: 5001.992355
In [124]: %timeit mean(l)
1 loop, best of 3: 2.01 s per loop
In [125]: a = np.array(l)
In [126]: np.mean(a)
Out[126]: 5001.9923550000003
In [127]: %timeit np.mean(a)
100 loops, best of 3: 2.87 ms per loop

结论：它会快几个数量级——在我的例子中，它快了700倍，但可能没有那么精确(因为numpy没有使用Kahan求和算法)。

我不久前也问过同样的问题，但一旦我注意到源代码中第317行调用的_sum函数的平均值，我就明白了原因：

def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)
Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.
If optional argument ``start`` is given, it is added to the total.
If ``data`` is empty, ``start`` (defaulting to 0) is returned.
Examples
--------
>>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
(<class 'float'>, Fraction(11, 1), 5)
Some sources of round-off error will be avoided:
>>> _sum([1e50, 1, -1e50] * 1000)  # Built-in sum returns zero.
(<class 'float'>, Fraction(1000, 1), 3000)
Fractions and Decimals are also supported:
>>> from fractions import Fraction as F
>>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
(<class 'fractions.Fraction'>, Fraction(63, 20), 4)
>>> from decimal import Decimal as D
>>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
>>> _sum(data)
(<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)
Mixed types are currently treated as an error, except that int is
allowed.
"""
count = 0
n, d = _exact_ratio(start)
partials = {d: n}
partials_get = partials.get
T = _coerce(int, type(start))
for typ, values in groupby(data, type):
T = _coerce(T, typ)  # or raise TypeError
for n,d in map(_exact_ratio, values):
count += 1
partials[d] = partials_get(d, 0) + n
if None in partials:
# The sum will be a NAN or INF. We can ignore all the finite
# partials, and just look at this special one.
total = partials[None]
assert not _isfinite(total)
else:
# Sum all the partial sums using builtin sum.
# FIXME is this faster if we sum them in order of the denominator?
total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
return (T, total, count)

与只调用内置的sum相比，有很多操作发生，因为根据文档字符串mean计算高精度总和。

你可以看到使用平均值与总和可以给你不同的输出：

In [7]: l = [.1, .12312, 2.112, .12131]
In [8]: sum(l) / len(l)
Out[8]: 0.6141074999999999
In [9]: mean(l)
Out[9]: 0.6141075

len()和sum()都是Python内置函数(功能有限)，用C编写，更重要的是，经过优化，可以快速处理某些类型或对象(列表)。

您可以在这里查看内置函数的实现：

https://hg.python.org/sandbox/python2.7/file/tip/Python/bltinmodule.c

mean()是一个用Python编写的高级函数。看看它是如何实现的：

https://hg.python.org/sandbox/python2.7/file/tip/Lib/statistics.py

您可以看到，稍后在内部使用了另一个名为_sum()的函数，与内置函数相比，该函数会进行一些额外的检查。

如果您想要一个更快的均值函数，statistics模块在python 3.8中引入了fmean函数。在计算平均值之前，它将数据转换为float。

(此处执行)

快速比较：

import timeit, statistics
def test_basic_mean(): return sum(range(10000)) / 10000
def test_mean(): return statistics.mean(range(10000))
def test_fmean(): return statistics.fmean(range(10000))
print("basic mean:", min(timeit.repeat(stmt=test_basic_mean, setup="from __main__ import test_basic_mean", repeat=20, number=10)))
print("statistics.mean:", min(timeit.repeat(stmt=test_mean, setup="from __main__ import statistics, test_mean", repeat=20, number=10)))
print("statistics.fmean:", min(timeit.repeat(stmt=test_fmean, setup="from __main__ import statistics, test_fmean", repeat=20, number=10)))

给我：

basic mean: 0.0013072469737380743
statistics.mean: 0.025932796066626906
statistics.fmean: 0.001833588001318276

根据该帖子：Python 中算术平均值的计算

它应该是"由于在统计学中特别精确地实现了和运算符"。

均值函数使用内部_sum函数进行编码，该函数本应比正常加法更精确，但速度慢得多(此处提供的代码：https://hg.python.org/cpython/file/3.5/Lib/statistics.py)。

PEP中规定：https://www.python.org/dev/peps/pep-0450/对于该模块，精度被认为与速度一样重要。

相关内容

最新更新

热门标签：