有没有一种快速(矢量化)的方法来计算另一个字符串列表中包含的字符串列表



我想计算包含在5个字母单词列表中的两个字符元音排列的数量。元音排列类似于'aa','ae','ai,...,'ui','uo','uu'

我使用apply()成功地完成了这项工作,但速度很慢。我想看看是否有一种快速矢量化的方法来完成这项工作。我想不出一个。

以下是我所做的:

import pandas as pd 
import itertools
vowels = list('aeiou')
vowel_perm = [x[0]+x[1] for x in itertools.product(vowels,vowels)]  
def wide_contains(x):
return pd.Series(data=[c in x for c in vowel_perm], index=vowel_perm) 
dfwd['word'].apply(wide_contains).sum()
aa     1
ae     2
ai    12
ao     2
au     8
ea    15
ee    15
ei     1
eo     5
eu     2
ia     7
ie    10
ii     0
io     3
iu     0
oa     2
oe     2
oi     3
oo    11
ou     7
ua     2
ue     9
ui     2
uo     0
uu     0

以上是使用以下数据的预期输出

word_lst = ['gaize', 'musie', 'dauts', 'orgue', 'tough', 'medio', 'roars', 'leath', 'quire', 'kaons', 'iatry', 'tuath', 'tarea', 'hairs', 'sloid', 
'beode', 'fours', 'belie', 'qaids', 'cobia', 'cokie', 'wreat', 'spoom', 'soaps', 'usque', 'frees', 'rials', 'youve', 'dreed', 'feute', 
'saugh', 'esque', 'revue', 'noels', 'seism', 'sneer', 'geode', 'vicua', 'maids', 'fiord', 'bread', 'squet', 'goers', 'sneap', 'teuch', 
'arcae', 'roosa', 'spues', 'could', 'tweeg', 'coiny', 'cread', 'airns', 'gauds', 'aview', 'mudee', 'vario', 'spaid', 'pooka', 'bauge', 
'beano', 'snies', 'boose', 'holia', 'doums', 'goopy', 'feaze', 'kneel', 'gains', 'acoin', 'crood', 'juise', 'gluey', 'zowie', 'biali', 
'leads', 'twaes', 'fogie', 'wreak', 'keech', 'bairn', 'spies', 'ghoom', 'foody', 'jails', 'waird', 'iambs', 'woold', 'belue', 'bisie', 
'hauls', 'deans', 'eaten', 'aurar', 'anour', 'utees', 'sayee', 'droob', 'gagee', 'roleo', 'burao', 'tains', 'daubs', 'geeky', 'civie', 
'scoop', 'sidia', 'tuque', 'fairy', 'taata', 'eater', 'beele', 'obeah', 'feeds', 'feods', 'absee', 'meous', 'cream', 'beefy', 'nauch']
dfwd = pd.DataFrame(word_lst, columns=['word'])

好吧,如果根本不使用Pandas来做这个计算是可以的,那么对于给定的数据,普通的旧Counter()在我的机器上看起来要快220倍。

from collections import Counter
import timeit

def timetest(func, name=None):
name = name or getattr(func, "__name__", None)
iters, time = timeit.Timer(func).autorange()
iters_per_sec = iters / time
print(f"{name=} {iters=} {time=:.3f} {iters_per_sec=:.2f}")

def counter():
ctr = Counter()
for word in dfwd['word']:
for perm in vowel_perm:
if perm in word:
ctr[perm] += 1
return ctr

timetest(original)
timetest(counter)
print(counter())

输出

name='original' iters=10 time=0.229 iters_per_sec=43.59
name='counter' iters=2000 time=0.212 iters_per_sec=9434.29
Counter({'ea': 15, 'ee': 15, 'ai': 12, 'oo': 11, 'ie': 10, 'ue': 9, 'au': 8, 'ou': 7, 'ia': 7, 'eo': 5, 'io': 3, 'oi': 3, 'oa': 2, 'ui': 2, 'ao': 2, 'ua': 2, 'eu': 2, 'oe': 2, 'ae': 2, 'ei': 1, 'aa': 1})

来点字典理解怎么样?它应该比使用apply更快

{v: dfwd['word'].str.count(v).sum() for v in vowel_perm}
# 6.9 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
{'aa': 1,
'ae': 2,
'ai': 12,
'ao': 2,
'au': 8,
'ea': 15,
'ee': 15,
'ei': 1,
'eo': 5,
'eu': 2,
'ia': 7,
'ie': 10,
'ii': 0,
'io': 3,
'iu': 0,
'oa': 2,
'oe': 2,
'oi': 3,
'oo': 11,
'ou': 7,
'ua': 2,
'ue': 9,
'ui': 2,
'uo': 0,
'uu': 0}

另一个选项是简单地迭代元音对,并计算每对在word_lst中的出现次数。注意,对于当前任务,也不需要创建显式列表:vowel_perm,只需迭代映射对象:

out = pd.Series({pair: sum(True for w in word_lst if pair in w) 
for pair in map(''.join, itertools.product(vowels,vowels))})

在我的机器上,一个基准显示:

>>> %timeit out = pd.Series({pair: sum(True for w in word_lst if pair in w) for pair in map(''.join, itertools.product(vowels,vowels))})
492 µs ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit vowel_perm = [x[0]+x[1] for x in itertools.product(vowels,vowels)]; out = dfwd['word'].apply(wide_contains).sum()
40.6 ms ± 2.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

这里有另一种方法:

df['word'].str.extractall('([aeiou]{2})').groupby([0]).size()

输出:

0
aa     1
ae     2
ai    12
ao     2
au     8
ea    15
ee    15
ei     1
eo     5
eu     2
ia     7
ie    10
io     3
oa     2
oe     2
oi     3
oo    11
ou     6
ua     2
ue     9
ui     2

您可以使用numpy.char.find在字符串数组中搜索子字符串:

from itertools import product
word_lst = np.array([
'gaize', 'musie', 'dauts', 'orgue', 'tough', 'medio', 'roars', 'leath', 'quire', 'kaons', 'iatry', 'tuath', 'tarea', 'hairs', 'sloid', 
'beode', 'fours', 'belie', 'qaids', 'cobia', 'cokie', 'wreat', 'spoom', 'soaps', 'usque', 'frees', 'rials', 'youve', 'dreed', 'feute', 
'saugh', 'esque', 'revue', 'noels', 'seism', 'sneer', 'geode', 'vicua', 'maids', 'fiord', 'bread', 'squet', 'goers', 'sneap', 'teuch', 
'arcae', 'roosa', 'spues', 'could', 'tweeg', 'coiny', 'cread', 'airns', 'gauds', 'aview', 'mudee', 'vario', 'spaid', 'pooka', 'bauge', 
'beano', 'snies', 'boose', 'holia', 'doums', 'goopy', 'feaze', 'kneel', 'gains', 'acoin', 'crood', 'juise', 'gluey', 'zowie', 'biali', 
'leads', 'twaes', 'fogie', 'wreak', 'keech', 'bairn', 'spies', 'ghoom', 'foody', 'jails', 'waird', 'iambs', 'woold', 'belue', 'bisie', 
'hauls', 'deans', 'eaten', 'aurar', 'anour', 'utees', 'sayee', 'droob', 'gagee', 'roleo', 'burao', 'tains', 'daubs', 'geeky', 'civie', 
'scoop', 'sidia', 'tuque', 'fairy', 'taata', 'eater', 'beele', 'obeah', 'feeds', 'feods', 'absee', 'meous', 'cream', 'beefy', 'nauch'
], dtype="U")
dfwd = pd.Series({
perm: (np.char.find(word_lst, perm) != -1).sum()
for perm in ["".join(p) for p in product(list("aoeui"), repeat=2)]
})

最新更新