如何纠正熊猫数据框中的单词?



我正在尝试纠正包含句子的CSV文件中的拼写错误。

input_csv:

id  text
0   my telephon not working
1   I have mobil in my bag
2   car is expensiv

这里使用enchant提供的代码通过提供建议来纠正单词:

我想用这个拼写纠正方法来纠正熊猫数据框里面的单词。我有以下代码,其中每个句子首先被标记,然后检查拼写并建议最好的:

import enchant, difflib, nltk
from nltk.tokenize import word_tokenize
import pandas as pd
text = "telephon mobil" # This is only a sample
token = word_tokenize(text)
for word in token:
best_words = []
best_ratio = 0
a = set(d.suggest(word))
for b in a:
tmp = difflib.SequenceMatcher(None, word, b).ratio()
if tmp > best_ratio:
best_words = [b]
best_ratio = tmp
elif tmp == best_ratio:
best_words.append(b)
print('word:[', word, '] -> best suggest:[', best_words[0],']')
word:[ telephon ] -> best suggest:[ telephone ]
word:[ mobil ] -> best suggest:[ mobile ]

现在我的问题是,我如何将此应用于我的熊猫数据框架并纠正每行中的拼写错误,并得到如下输出:

output_csv:

id  text
0   my telephone not working
1   I have mobile in my bag
2   car is expensive

将代码放入函数中,然后使用apply:

对每一行调用它
def word_suggest(word):
d = enchant.Dict("en_US")
if d.check(word):
return word
best_words = []
best_ratio = 0
a = set(d.suggest(word))
for b in a:
tmp = difflib.SequenceMatcher(None, word, b).ratio()
if tmp > best_ratio:
best_words = [b]
best_ratio = tmp
elif tmp == best_ratio:
best_words.append(b)
return best_words[0]
>>> df["text"].apply(lambda x: " ".join(word_suggest(word) for word in word_tokenize(x)))
0    my telephone not working
1     I have mobile in my bag
2            car is expensive
Name: text, dtype: object

最新更新