我测试了s.replace('a', '')
,其中s
是一个由200万个'a'
组成的字符串,在开始、中间或结束时可能是单个异常值'b'
。这一个异常值使其速度慢得多:
在TIO,他们的Python 3.8预发布版本:
a = 'a' * 10**6; s = f'{a}{a}' 5 ms 5 ms 5 ms
a = 'a' * 10**6; s = f'b{a}{a}' 24 ms 24 ms 24 ms
a = 'a' * 10**6; s = f'{a}b{a}' 25 ms 25 ms 25 ms
a = 'a' * 10**6; s = f'{a}{a}b' 25 ms 25 ms 25 ms
a = 'a' * 10**6; s = f'b{a}b{a}b' 25 ms 25 ms 25 ms
在我的旧笔记本电脑上使用Python 3.10:
a = 'a' * 10**6; s = f'{a}{a}' 4 ms 4 ms 4 ms
a = 'a' * 10**6; s = f'b{a}{a}' 93 ms 94 ms 95 ms
a = 'a' * 10**6; s = f'{a}b{a}' 94 ms 95 ms 95 ms
a = 'a' * 10**6; s = f'{a}{a}b' 94 ms 94 ms 95 ms
a = 'a' * 10**6; s = f'b{a}b{a}b' 95 ms 95 ms 96 ms
单个异常值'b
'是如何使它变得如此缓慢的?
完整的基准代码:
from timeit import repeat
setups = [
"a = 'a' * 10**6; s = f'{a}{a}'",
"a = 'a' * 10**6; s = f'b{a}{a}'",
"a = 'a' * 10**6; s = f'{a}b{a}'",
"a = 'a' * 10**6; s = f'{a}{a}b'",
"a = 'a' * 10**6; s = f'b{a}b{a}b'",
]
for _ in range(3):
for setup in setups:
times = repeat("s.replace('a', '')", setup, number=1)
print(f'{setup:{max(map(len, setups))}}',
*('%3d ms ' % (t * 1e3) for t in sorted(times)[:3]))
print()
str.replace()
在https://github.com/python/cpython/blob/main/Objects/unicodeobject.c#L10604在C函数CCD_ 7中。以下是该函数的摘录。它表明,如果返回的字符串(new_size
(的大小为0,我们会提前停止并返回一个空字符串。否则,我们将执行循环输入字符串并逐个执行替换的长任务。
new_size = slen + n * (len2 - len1);
if (new_size == 0) {
u = unicode_new_empty();
goto done;
}
if (new_size > (PY_SSIZE_T_MAX / rkind)) {
PyErr_SetString(PyExc_OverflowError,
"replace string is too long");
goto error;
}
u = PyUnicode_New(new_size, maxchar);
if (!u)
goto error;
assert(PyUnicode_KIND(u) == rkind);
res = PyUnicode_DATA(u);
ires = i = 0;
if (len1 > 0) {
while (n-- > 0) {
/* look for next match */
j = anylib_find(rkind, self,
sbuf + rkind * i, slen-i,
str1, buf1, len1, i);
if (j == -1)
break;
else if (j > i) {
/* copy unchanged part [i:j] */
memcpy(res + rkind * ires,
sbuf + rkind * i,
rkind * (j-i));
ires += j - i;
}
/* copy substitution string */
if (len2 > 0) {
memcpy(res + rkind * ires,
buf2,
rkind * len2);
ires += len2;
}
i = j + len1;
}
if (i < slen)
/* copy tail [i:] */
memcpy(res + rkind * ires,
sbuf + rkind * i,
rkind * (slen-i));
}