NLP如何在充满短信的147k行上加速拼写更正

尝试在147k行的大型数据集上加快拼写检查。下面的函数已经运行了整个下午，现在仍在运行。有没有办法加快拼写检查的速度?这些消息已经进行了大小写处理，删除了标点符号，采用了词序化，并且它们都是字符串格式。

import autocorrect
from autocorrect import Speller
spell = Speller()
def spell_check(x):
correct_word = []
mispelled_word = x.split()
for word in mispelled_word:
correct_word.append(spell(word))
return ' '.join(correct_word)
df['clean'] = df['old'].apply(spell_check)

autocorrect库不是很有效，并且不适合您所呈现的任务。它所做的是生成具有一个或两个拼写错误的所有可能的候选词，并检查其中哪些是有效的单词-并且在普通Python中完成。

以六个字母的单词为例，如"source":

from autocorrect.typos import Word
print(sum(1 for c in Word('source').typos()))
# => 349
print(sum(1 for c in Word('source').double_typos()))
# => 131305

autocorrect生成多达131654个候选项来测试，只是为了这个词。如果它更长呢?让我们试试"transcompilation":

print(sum(1 for c in Word('').typos()))
# => 889
print(sum(1 for c in Word('').double_typos()))
# => 813325

这是814214个候选人，只是为了一个词!请注意，numpy不能加速，因为值是Python字符串，并且您在每行上调用Python函数。加快这一速度的唯一方法是改变您用于拼写检查的方法:例如，使用aspell-python-py3库代替(aspell的包装，AFAIK是Unix上最好的免费拼写检查器)。

除了@Amadan所说的和绝对正确的(自动更正以一种非常无效的方式进行更正):

您将巨型数据集中的每个单词视为首次查找的所有单词，因为您对每个单词调用spell()。实际上(至少在一段时间之后)几乎所有的单词都是预先查找的，所以存储这些结果并加载它们会更有效率。

有一种方法:

import autocorrect
from autocorrect import Speller
spell = Speller()
# get all unique words in the data as a set (first split each row into words, then put them all in a flat set)
unique_words = {word for words in df["old"].apply(str.split) for word in words}
# get the corrected version of each unique word and put this mapping in a dictionary
corrected_words = {word: spell(word) for word in unique_words}
# write the cleaned row by looking up the corrected version of each unique word
df['clean'] = [" ".join([corrected_words[word] for word in row.split()]) for row in df["old"]]

相关内容

最新更新

热门标签：