我想计算包含在5个字母单词列表中的两个字符元音排列的数量。元音排列类似于'aa','ae','ai,...,'ui','uo','uu'
。
我使用apply()
成功地完成了这项工作,但速度很慢。我想看看是否有一种快速矢量化的方法来完成这项工作。我想不出一个。
以下是我所做的:
import pandas as pd
import itertools
vowels = list('aeiou')
vowel_perm = [x[0]+x[1] for x in itertools.product(vowels,vowels)]
def wide_contains(x):
return pd.Series(data=[c in x for c in vowel_perm], index=vowel_perm)
dfwd['word'].apply(wide_contains).sum()
aa 1
ae 2
ai 12
ao 2
au 8
ea 15
ee 15
ei 1
eo 5
eu 2
ia 7
ie 10
ii 0
io 3
iu 0
oa 2
oe 2
oi 3
oo 11
ou 7
ua 2
ue 9
ui 2
uo 0
uu 0
以上是使用以下数据的预期输出
word_lst = ['gaize', 'musie', 'dauts', 'orgue', 'tough', 'medio', 'roars', 'leath', 'quire', 'kaons', 'iatry', 'tuath', 'tarea', 'hairs', 'sloid',
'beode', 'fours', 'belie', 'qaids', 'cobia', 'cokie', 'wreat', 'spoom', 'soaps', 'usque', 'frees', 'rials', 'youve', 'dreed', 'feute',
'saugh', 'esque', 'revue', 'noels', 'seism', 'sneer', 'geode', 'vicua', 'maids', 'fiord', 'bread', 'squet', 'goers', 'sneap', 'teuch',
'arcae', 'roosa', 'spues', 'could', 'tweeg', 'coiny', 'cread', 'airns', 'gauds', 'aview', 'mudee', 'vario', 'spaid', 'pooka', 'bauge',
'beano', 'snies', 'boose', 'holia', 'doums', 'goopy', 'feaze', 'kneel', 'gains', 'acoin', 'crood', 'juise', 'gluey', 'zowie', 'biali',
'leads', 'twaes', 'fogie', 'wreak', 'keech', 'bairn', 'spies', 'ghoom', 'foody', 'jails', 'waird', 'iambs', 'woold', 'belue', 'bisie',
'hauls', 'deans', 'eaten', 'aurar', 'anour', 'utees', 'sayee', 'droob', 'gagee', 'roleo', 'burao', 'tains', 'daubs', 'geeky', 'civie',
'scoop', 'sidia', 'tuque', 'fairy', 'taata', 'eater', 'beele', 'obeah', 'feeds', 'feods', 'absee', 'meous', 'cream', 'beefy', 'nauch']
dfwd = pd.DataFrame(word_lst, columns=['word'])
好吧,如果根本不使用Pandas来做这个计算是可以的,那么对于给定的数据,普通的旧Counter()
在我的机器上看起来要快220倍。
from collections import Counter
import timeit
def timetest(func, name=None):
name = name or getattr(func, "__name__", None)
iters, time = timeit.Timer(func).autorange()
iters_per_sec = iters / time
print(f"{name=} {iters=} {time=:.3f} {iters_per_sec=:.2f}")
def counter():
ctr = Counter()
for word in dfwd['word']:
for perm in vowel_perm:
if perm in word:
ctr[perm] += 1
return ctr
timetest(original)
timetest(counter)
print(counter())
输出
name='original' iters=10 time=0.229 iters_per_sec=43.59
name='counter' iters=2000 time=0.212 iters_per_sec=9434.29
Counter({'ea': 15, 'ee': 15, 'ai': 12, 'oo': 11, 'ie': 10, 'ue': 9, 'au': 8, 'ou': 7, 'ia': 7, 'eo': 5, 'io': 3, 'oi': 3, 'oa': 2, 'ui': 2, 'ao': 2, 'ua': 2, 'eu': 2, 'oe': 2, 'ae': 2, 'ei': 1, 'aa': 1})
来点字典理解怎么样?它应该比使用apply
更快
{v: dfwd['word'].str.count(v).sum() for v in vowel_perm}
# 6.9 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
{'aa': 1,
'ae': 2,
'ai': 12,
'ao': 2,
'au': 8,
'ea': 15,
'ee': 15,
'ei': 1,
'eo': 5,
'eu': 2,
'ia': 7,
'ie': 10,
'ii': 0,
'io': 3,
'iu': 0,
'oa': 2,
'oe': 2,
'oi': 3,
'oo': 11,
'ou': 7,
'ua': 2,
'ue': 9,
'ui': 2,
'uo': 0,
'uu': 0}
另一个选项是简单地迭代元音对,并计算每对在word_lst
中的出现次数。注意,对于当前任务,也不需要创建显式列表:vowel_perm
,只需迭代映射对象:
out = pd.Series({pair: sum(True for w in word_lst if pair in w)
for pair in map(''.join, itertools.product(vowels,vowels))})
在我的机器上,一个基准显示:
>>> %timeit out = pd.Series({pair: sum(True for w in word_lst if pair in w) for pair in map(''.join, itertools.product(vowels,vowels))})
492 µs ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit vowel_perm = [x[0]+x[1] for x in itertools.product(vowels,vowels)]; out = dfwd['word'].apply(wide_contains).sum()
40.6 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
这里有另一种方法:
df['word'].str.extractall('([aeiou]{2})').groupby([0]).size()
输出:
0
aa 1
ae 2
ai 12
ao 2
au 8
ea 15
ee 15
ei 1
eo 5
eu 2
ia 7
ie 10
io 3
oa 2
oe 2
oi 3
oo 11
ou 6
ua 2
ue 9
ui 2
您可以使用numpy.char.find
在字符串数组中搜索子字符串:
from itertools import product
word_lst = np.array([
'gaize', 'musie', 'dauts', 'orgue', 'tough', 'medio', 'roars', 'leath', 'quire', 'kaons', 'iatry', 'tuath', 'tarea', 'hairs', 'sloid',
'beode', 'fours', 'belie', 'qaids', 'cobia', 'cokie', 'wreat', 'spoom', 'soaps', 'usque', 'frees', 'rials', 'youve', 'dreed', 'feute',
'saugh', 'esque', 'revue', 'noels', 'seism', 'sneer', 'geode', 'vicua', 'maids', 'fiord', 'bread', 'squet', 'goers', 'sneap', 'teuch',
'arcae', 'roosa', 'spues', 'could', 'tweeg', 'coiny', 'cread', 'airns', 'gauds', 'aview', 'mudee', 'vario', 'spaid', 'pooka', 'bauge',
'beano', 'snies', 'boose', 'holia', 'doums', 'goopy', 'feaze', 'kneel', 'gains', 'acoin', 'crood', 'juise', 'gluey', 'zowie', 'biali',
'leads', 'twaes', 'fogie', 'wreak', 'keech', 'bairn', 'spies', 'ghoom', 'foody', 'jails', 'waird', 'iambs', 'woold', 'belue', 'bisie',
'hauls', 'deans', 'eaten', 'aurar', 'anour', 'utees', 'sayee', 'droob', 'gagee', 'roleo', 'burao', 'tains', 'daubs', 'geeky', 'civie',
'scoop', 'sidia', 'tuque', 'fairy', 'taata', 'eater', 'beele', 'obeah', 'feeds', 'feods', 'absee', 'meous', 'cream', 'beefy', 'nauch'
], dtype="U")
dfwd = pd.Series({
perm: (np.char.find(word_lst, perm) != -1).sum()
for perm in ["".join(p) for p in product(list("aoeui"), repeat=2)]
})