是否可以检查单词是否由标点符号组成而没有任何循环



我有一个包含很多单词的列表,所以我不想写嵌套循环,因为程序运行需要很多时间。所以,也许有一种方法可以检查单词是否由标点符号组成,比如当我们必须检查数字时,函数any(map(str.isdigit, s1))是数字?

除非列表非常大,或者CPU性能低,否则处理单词列表不会花费太多时间。考虑下面的例子,它有100万个20个字符的字符串。

import random
import string
In [16]: s = [''.join(random.choices(string.ascii_letters + string.punctuation, k=20)) for _ in range(1000000)]
In [17]: %%timeit -n 3 -r 3
...: [any(map(str.isdigit, s1)) for s1 in s]
...: 
...: 
1.23 s ± 2.53 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
In [18]: %%timeit -n 3 -r 3
...: [any([s2 in string.punctuation for s2 in s1]) for s1 in s]
...: 
...: 
1.72 s ± 18.1 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)

你可以用正则表达式来加速它

import re
import string
In [16]: s = [''.join(random.choices(string.ascii_letters + string.punctuation, k=20)) for _ in range(1000000)]
In [17]: patt = re.compile('[%s]' % re.escape(string.punctuation))
In [18]: %%timeit -n 3 -r 3
[bool(re.match(patt, s1)) for s1 in s]
1.03 s ± 3.23 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)

这可能取决于您定义的"标点符号";。模块CCD_ 2将CCD_ 3定义为CCD_。您也可以将其定义为";什么不是字母数字;(a-zA-Z0-9(,或";什么不是阿尔法;(a-zA-Z(。

在这里,我定义了一个非常长的字母数字字符串,并添加了一个点.,混洗。

import numpy as np
import string
mystr_no_punct = np.random.choice(list(string.ascii_letters) + 
list(string.digits), 1e8)
mystr_withpunct = np.append(mystr_no_punct, '.')
np.random.shuffle(mystr_no_punct)
mystr_withpunct = "".join(mystr_withpunct)
mystr_no_punct = "".join(mystr_no_punct)

下面是一个带有for循环的天真迭代的实现,以及一些可能的答案,根据您所寻找的,以及时间比较

def naive(mystr):
for x in mystr_no_punct:
if x in string.punctuation:
return False
return True
# naive solution
%timeit naive(mystr_withpunct)
%timeit naive(mystr_no_punct)
# check if string is only alnum
%timeit str.isalnum(mystr_withpunct) 
%timeit str.isalnum(mystr_no_punct)
# reduce to a set of the present characters, compare with the set of punctuation characters
%timeit len(set(mystr_withpunct).intersection(set(string.punctuation))) > 0
%timeit len(set(mystr_no_punct).intersection(set(string.punctuation))) > 0
# use regex
import re
%timeit len(re.findall(rf"[{re.escape(string.punctuation)}]+", mystr_withpunct)) > 0
%timeit len(re.findall(rf"[{re.escape(string.punctuation)}]+", mystr_no_punct)) > 0

结果如下

# naive
53.9 ms ± 928 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
53.1 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# str.isalnum
4.17 ms ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.47 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# sets intersection
8.26 ms ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.2 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# regex
8.43 ms ± 84 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.51 ms ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

因此,使用内置的isalnum显然是最快的。但如果您有特定的需求,regex或sets交集似乎很适合。

最新更新