如何使这个python脚本快速?(与分支预测相关的基准测试来自这里的一篇帖子)

从这里开始——一个分支预测问题，我开始编写Python版本的程序，以检查Python中排序/未排序版本的运行时间。我试着先排序。

这是代码：

import time
from random import *
arraysize = 327
data = []
for  c in range(arraysize):
data.append( randint( 1 , 256 )  ) 

## !!! with this, the next loop runs faster
data.sort()
## test
start_time = time.clock()
sum = 0

for i in range(100000):
for c in range(arraysize):
if data[c] >= 128:
sum += data[c]

print time.clock() - start_time

我不确定我的简单计时方法的准确性，但它似乎已经足够好了。当我设置arraysize = 32768时，我第一次等待了超过20分钟！！超过20分钟！但有了arraysize = 327，我得到了8.141656691s的时间。

如果我的代码中有错误，或者以某种方式使用Numpy/Scipy是否会加快速度，请纠正我。谢谢

我从@mgilson的答案开始，并对其进行了一些修改。我想测试"决策位"和查找表技术，正如我在回答原始问题时所讨论的那样：https://stackoverflow.com/a/17782979/166949

我对原作做了一些改动。有些只是反映我个人喜好的风格。然而，我在原始代码中发现了错误，我认为测量正确的代码很重要。

我让Python代码现在从命令行获取参数，并编写了一个shell脚本，该脚本使用Python 2.x、Python 3.x和PyPy运行Python脚本。确切的版本是：Python 2.7.6、Python 3.4.0和PyPy 2.7.3

我在Linux Mint 17 64位版本上进行了测试。CPU是AMD Phenom 9850，运行频率为2.5 GHz，内存为16 GB。根据uname -r，Linux内核版本为：3.13.0-24通用

我让代码从命令行获取参数的原因是327是一个很短的列表。我认为当列表更长时，sum()和生成器表达式会做得更好，所以我从命令行传递列表大小和试用次数。结果显示了哪些Python，以及列表的长度和试用次数。

结论：令我惊讶的是，即使有很长的列表，Python在列表理解方面使用sum()的速度也是最快的！运行生成器会有一些开销，这似乎比构建列表然后将其拆下的开销慢。

然而，如果列表变得非常大，我预计生成器将开始执行列表理解。有了一百万个随机值的列表，listcomp仍然更快，但当我使用1600万个随机值时，genexp变得更快。对于较短的列表，生成器表达式的速度惩罚并不大。因此，我仍然喜欢生成器表达式作为解决Python中这个问题的首选方法。

有趣的是，PyPy的表查找速度最快。这是有道理的：这是我在C中找到的最快的方法，PyPy从JIT生成本机代码。

对于具有虚拟机的CPython来说，调用单个操作比调用多个操作更快；PythonVM的开销可能超过更昂贵的基本操作。因此，整数除法比比特掩蔽加比特移位更快，因为除法是单个操作。但在PyPy中，比特掩蔽+移位比除法快得多。

此外，在CPython中，使用sum()可以让代码在C内部运行，因此速度非常快；但在PyPy中，sum()比只写一个straighborward循环要慢，JIT可能会把它变成一个非常快的本地循环。我的猜测是，PyPy很难将生成器机制挖掘和优化为本地代码。

shell脚本：

for P in python python3 pypy; do
echo "$P ($1, $2)"
$P test_branches.py $1 $2
echo
done

Python代码：

import random
import sys
import timeit
try:
RANGE = xrange
except NameError:
RANGE = range
if len(sys.argv) != 3:
print("Usage: python test_branches.py <length_of_array> <number_of_trials>")
sys.exit(1)
TEST_DATA_LEN = int(sys.argv[1])
NUM_REPEATS = int(sys.argv[2])
_test_data = [random.randint(0,255) for _ in RANGE(TEST_DATA_LEN)]
def test0(data):
"""original way"""
total = 0
for i in RANGE(TEST_DATA_LEN):
if data[i] >= 128:
total += data[i]
return total

def test1(data):
"""better loop"""
total = 0
for n in data:
if n >= 128:
total += n
return total
def test2(data):
"""sum + generator"""
return sum(n for n in data if n >= 128)
def test3(data):
"""sum + listcomp"""
return sum([n for n in data if n >= 128])
def test4(data):
"""decision bit -- bit mask and shift"""
lst = [0, 0]
for n in data:
lst[(n & 0x80) >> 7] += n
return lst[1]
def test5(data):
"""decision bit -- division"""
lst = [0, 0]
for n in data:
lst[n // 128] += n
return lst[1]
_lut = [0 if n < 128 else n for n in RANGE(256)]
def test6(data):
"""lookup table"""
total = 0
for n in data:
total += _lut[n]
return total
def test7(data):
"""lookup table with sum()"""
return sum(_lut[n] for n in data)
test_functions = [v for k,v in globals().items() if k.startswith("test")]
test_functions.sort(key=lambda x: x.__name__)
correct = test0(_test_data)
for fn in test_functions:
name = fn.__name__
doc = fn.__doc__
if fn(_test_data) != correct:
print("{}() not correct!!!".format(name))
s_call = "{}(_test_data)".format(name)
s_import = "from __main__ import {},_test_data".format(name)
t = timeit.timeit(s_call,s_import,number=NUM_REPEATS)
print("{:7.03f}: {}".format(t, doc))

结果：

python (327, 100000)
3.170: original way
2.211: better loop
2.378: sum + generator
2.188: sum + listcomp
5.321: decision bit -- bit mask and shift
4.977: decision bit -- division
2.937: lookup table
3.464: lookup table with sum()
python3 (327, 100000)
5.786: original way
3.444: better loop
3.286: sum + generator
2.968: sum + listcomp
8.858: decision bit -- bit mask and shift
7.056: decision bit -- division
4.640: lookup table
4.783: lookup table with sum()
pypy (327, 100000)
0.296: original way
0.312: better loop
1.932: sum + generator
1.011: sum + listcomp
0.172: decision bit -- bit mask and shift
0.613: decision bit -- division
0.140: lookup table
1.977: lookup table with sum()

python (65536, 1000)
6.528: original way
4.661: better loop
4.974: sum + generator
4.150: sum + listcomp
10.971: decision bit -- bit mask and shift
10.218: decision bit -- division
6.052: lookup table
7.070: lookup table with sum()
python3 (65536, 1000)
12.999: original way
7.618: better loop
6.826: sum + generator
5.587: sum + listcomp
19.326: decision bit -- bit mask and shift
14.917: decision bit -- division
9.779: lookup table
9.575: lookup table with sum()
pypy (65536, 1000)
0.681: original way
0.884: better loop
2.640: sum + generator
2.642: sum + listcomp
0.316: decision bit -- bit mask and shift
1.573: decision bit -- division
0.280: lookup table
1.561: lookup table with sum()

python (1048576, 100)
10.371: original way
7.065: better loop
7.910: sum + generator
6.579: sum + listcomp
17.583: decision bit -- bit mask and shift
15.426: decision bit -- division
9.285: lookup table
10.850: lookup table with sum()
python3 (1048576, 100)
20.435: original way
11.221: better loop
10.162: sum + generator
8.981: sum + listcomp
29.108: decision bit -- bit mask and shift
23.626: decision bit -- division
14.706: lookup table
14.173: lookup table with sum()
pypy (1048576, 100)
0.985: original way
0.926: better loop
5.462: sum + generator
6.623: sum + listcomp
0.527: decision bit -- bit mask and shift
2.334: decision bit -- division
0.481: lookup table
5.800: lookup table with sum()

python (16777216, 10)
15.704: original way
11.331: better loop
11.995: sum + generator
13.787: sum + listcomp
28.527: decision bit -- bit mask and shift
25.204: decision bit -- division
15.349: lookup table
17.607: lookup table with sum()
python3 (16777216, 10)
32.822: original way
18.530: better loop
16.339: sum + generator
14.999: sum + listcomp
47.755: decision bit -- bit mask and shift
38.930: decision bit -- division
23.704: lookup table
22.693: lookup table with sum()
pypy (16777216, 10)
1.559: original way
2.234: better loop
6.586: sum + generator
10.931: sum + listcomp
0.817: decision bit -- bit mask and shift
3.714: decision bit -- division
0.752: lookup table
3.837: lookup table with sum()

一个小的优化，也是一个风格问题，列表可以直接迭代，而不需要对它们进行索引：

for d in data:
if d >= 128:
sum += d

这样可以节省一些函数调用。

如果使用内置的sum函数，这个循环也可能会更快：

my_sum += sum( d for d in data if d>=128 )

列表comp可能比上面的生成器更快(以牺牲一点额外内存为代价)：

my_sum += sum( [d for d in data if d>=128] )

当然，从算法的角度来看，我们可以去掉最外循环，因为内循环的总和不会改变：

my_sum = 100000 * sum( d for d in data if d>=128 )

但我猜你已经知道了。。。

最后，我将如何对其进行基准测试：

import random
import timeit
N = 327
testarr = [random.randint(1,256) for _ in range(N)]
def test1(data):
"""Original way"""
s = 0
for c in range(N):
if data[c] >= 128:
s += data[c]

def test2(data):
"""better loop"""
s = 0
for d in data:
if d >= 128:
s += d
def test3(data):
""" sum + generator """
sum( d for d in data if d >= 128 )
def test4(data):
""" sum + listcomp """
sum( [ d for d in data if d >= 128 ])
NNUMBER = 100000
print timeit.timeit('test1(testarr)','from __main__ import test1,testarr',number=NNUMBER)
print timeit.timeit('test2(testarr)','from __main__ import test2,testarr',number=NNUMBER)
print timeit.timeit('test3(testarr)','from __main__ import test3,testarr',number=NNUMBER)
print timeit.timeit('test4(testarr)','from __main__ import test4,testarr',number=NNUMBER)

我的结果(OS-X Mavericks，python 2.7.5——里程数可能有所不同)：

2.80526804924  # loop with range
2.04129099846  # loop over elements
1.91441488266  # sum with generator
2.05234098434  # sum with list

对我来说，sum的发电机以微弱优势获胜。具有sum的列表comp和显式循环大约相等。循环索引是最慢的(毫不奇怪)。

相关内容

最新更新

热门标签：