如何在Ocred文本中分离错误的单词



我有一个长文档的文字,该文本由其他人所包含的,其中包含许多实例,其中间距未正确识别,并且两个单词一起运行(ex:excionbetweew中,hasalade,每个人(。是否有一种相对较快的方法使用尴尬,sed或类似的方法来找到不是单词的字符串,并检查它们是否可以分为合法的单词?

还是还有其他一些快速方法来修复它们?例如,我注意到Chrome能够将组合单词标记为拼写错误,而当您右键单击时,建议的更正几乎总是我想要的,但是我不知道一种快速的方法来将它们全部自动使用它们(还有数千个(。

谢谢!

matt,您可能会在修复其他尝试使用命令行工具执行此操作时会产生错误patsplit()和Multi-Char RS的尴尬,以防您的任何文件都有DOS行结尾:

$ cat words
bar
disco
discontent
exchange
experts
foo
is
now
of
tent
winter
$ cat file
now is the freezing winter
of ExPeRtSeXcHaNgE discontent

$ cat tst.awk
BEGIN {
    RS = "r?n"
    minSubLgth = 2
    minWordLgth = minSubLgth * 2
}
NR==FNR {
    realWords[tolower($0)]
    next
}
{
    n = patsplit($0,words,"[[:alpha:]]{"minWordLgth",}+",seps)
    printf "%s", seps[0]
    for (i=1; i<=n; i++) {
        word = words[i]
        lcword = tolower(word)
        if ( !(lcword in realWords) ) {
            found = 0
            for (j=length(lcword)-minSubLgth; j>=minSubLgth; j--) {
                head = substr(lcword,1,j)
                tail = substr(lcword,j+1)
                if ( (head in realWords) && (tail in realWords) ) {
                    found = 1
                    break
                }
            }
            word = (found ? "[[[" substr(word,1,j) " " substr(word,j+1) "]]]" : "<<<" word ">>>")
        }
        printf "%s%s", word, seps[i]
    }
    print ""
}

$ awk -f tst.awk words file
now is the <<<freezing>>> winter
of [[[ExPeRtS eXcHaNgE]]] discontent

识别不在单词列表中的不敏感的字母字符串,然后从每个单词列表中迭代创建成对的子字符串,并查看这些子字符串是否在" realwords [] []中"。它会有些缓慢而近似,并且仅在2个单词合并时才能使用,而不是3个或更多,但也许足够好。考虑一下算法,因为它可能是拆分子字符串的最佳方法(我没有考虑太多(,调整不要查找少于某些字母的单词(我使用了4个字母(,而不是将比其他数量的字母少的子字符串分为少(我在上面使用了2个(,并且您可能真的想突出显示在realWords[]中不会出现的单词,但您也不能分为出现的子字符串(freezing(上面(。

fwiw我从https://github.com/dwyl/english/english-words/blob/master/words/words_alpha.txt下载了单词列表(您可能希望google获取更好的列表 - 诸如wasnll的单词(,并在您的问题中使用文本的版本并删除了一些其他空间,您可以看到一些可以捕获的东西,有些无法解决,有些是错误的:

$ cat file
I have the textof a long document that was OCRed by someoneelse that contains
a lot ofinstances where the spacingwasn't recognized properly and two words
are run together (ex: divisionbetween, hasalready, everyoneelse). Is there a
relatively quickway using awk, sed, or the like tofind strings that are not
words andcheck if they can separatedintolegitimate words?
Or is there someother quick way to fix them? Forinstance, Inotice that
Chrome is able toflag the combined words asmisspellings and when you right
click, thesuggested correction is pretty much always the oneIwant, but I
don't know a quickway to just auto-fix themall(and there are thousands).
$ awk -f tst.awk words_alpha.txt file
I have the [[[text of]]] a long document that was [[[OC Red]]] by [[[someone else]]] that contains
a lot [[[of instances]]] where the [[[spacing wasn]]]'t recognized properly and two words
are run together (ex: [[[division between]]], [[[has already]]], [[[everyone else]]]). Is there a
relatively [[[quick way]]] using awk, sed, or the like [[[to find]]] strings that are not
words [[[and check]]] if they can <<<separatedintolegitimate>>> words?
Or is there [[[some other]]] quick way to fix them? [[[For instance]]], [[[Ino tice]]] that
Chrome is able [[[to flag]]] the combined words [[[as misspellings]]] and when you right
click, [[[the suggested]]] correction is pretty much always the <<<oneIwant>>>, but I
don't know a [[[quick way]]] to just auto-fix [[[thema ll]]](and there are thousands).

fwiw在我的[功能不足]笔记本电脑上花了大约半秒的时间。

最新更新