我正在尝试自动拼写检查data.table/data.frame的字符串列。
环顾四周,我发现有几种方法都给出了"out of bounds"字样。如果hunspell.suggest
没有返回任何建议(即一个空列表,例如"pippasnjfjsfiadjg"),请参阅此处的方法(此处接受的答案产生NA,因此原则上有效)和此处
我们似乎需要unlist
来识别这些空建议,然后将它们从选择第一个建议的代码部分排除,但我不知道如何。
library(dplyr)
library(stringi)
library(hunspell)
df1 <- data.frame("Index" = 1:7, "Text" = c("pippasnjfjsfiadjg came to dinner with us tonigh.",
"Wuld you like to trave with me?",
"There is so muh to undestand.",
"Sentences cone in many shaes and sizes.",
"Learnin R is fun",
"yesterday was Friday",
"bing search engine"),
stringsAsFactors = FALSE)
# Get bad words.
badwords <- hunspell(df1$Text) %>% unlist
# Extract the first suggestion for each bad word.
suggestions <- sapply(hunspell_suggest(badwords), "[[", 1)
mutate(df1, Text = stri_replace_all_fixed(str = Text,
pattern = badwords,
replacement = suggestions,
vectorize_all = FALSE)) -> out
你需要过滤坏词和建议列表,删除那些没有建议的
badwords <- hunspell(df1$Text) %>% unlist()
# note use of '[' rather than '[['
suggestions <- sapply(hunspell_suggest(badwords), '[', 1)
badwords <- badwords[!is.na(suggestions)]
suggestions <- suggestions[!is.na(suggestions)]