在匹配的情况下，用其他文件的相关缩写替换术语

我有两个文件：
1.模式文件=模式.txt
2.包含不同术语的文件= terms.txt

dattern.txt包含两个列，由 ;
隔开在第一列中，我有几个术语，在第二列缩写中，
与第一列相关，同一行。

terms.txt包含由单个单词定义的单词和术语通过单词的组合。

dattern.txt

Berlin;Brln
Barcelona;Barcln
Checkpoint Charly;ChckpntChrl
Friedrichstrasse;Fridrchstr
Hall of Barcelona;HllOfBarcln
Paris;Prs
Yesterday;Ystrdy

terms.txt

Berlin  
The Berlinale ended yesterday  
Checkpoint Charly is still in Friedrichstrasse  
There will be a fiesta in the Hall of Barcelona  
Paris is a very nice city

目标是用标准化的缩写替换条款，并找出哪些术语
没有缩写。
结果，我想拥有两个文件。
第一个文件是一个新的术语文件，术语被缩写替换为可以更换的缩写。
第二个文件，其中包含所有没有缩写的列表。
输出是不敏感的，我在" The"one_answers" the"。

之间没有区别

new_terms.txt

Brln  
The Berlinale ended Ystrdy  
ChckpntChrl is still in Fridrchstr  
There will be a fiesta in the HllOfBarcln  
Prs is a very nice city

TERM_WITHOUT_ABBREVIATIONS.TXT

a  
be  
Berlinale  
city  
ended  
fiesta  
in  
is  
nice  
of  
still  
The  
There  
very  
will

我将感谢您的帮助，并在此先感谢您的时间和提示！

这主要是您需要的：

BEGIN { FS=";"; }
FNR==NR { dict[tolower($1)] = $2; next }
{
    line = "";
    count = split($0, words, / +/);
    for (i = 1; i <= count; i++) {
        key = tolower(words[i]);
        if (key in dict) {
            words[i] = dict[key];
        } else {
            result[key] = words[i];
        }
        line = line " " words[i];
    }
    print substr(line, 2);
}
END {
    count = asorti(result, sorted);
    for (i = 1; i <= count; i++) {
        print result[sorted[i]];
    }
}

好吧，所以我有点破解，但会解释问题：

如果您对模式有多次更改。TXT可以与一行有关，则第一个更改将进行更改，而第二个更改将不会（例如，巴塞罗那；巴塞罗那的Barclln和Hall of Barcelone; Hllofbarcln，显然，如果Barcln已经当您到达更长的版本时，它将不再存在，因此不会进行更改）
与上述类似，"霍尔"一词没有缩写，因此，如果我们以上假设是真实的，并且仅进行了第一个更改，则您的新更改文件将包括hall，因为没有缩写

#!/usr/bin/awk -f

BEGIN{
    FS = ";"
    IGNORECASE = 1
}
FNR == NR{
    abbr[tolower($1)] = $2
    next
}
FNR == 1{ FS = " " }
{
    for(i = 1; i <= NF; i++){
        item = tolower($i)
        if(!(item in abbr) && !(item in twa)){
            twa[item]
            print item > "terms_without_abbreviations.txt"
        }
    }
    for(i in abbr)
        gsub("\<"i"\>", abbr[i])
    print > "new_terms.txt"
}

可能还有其他垃圾要寻找，但这是一个模糊的方向。不确定您将如何绕过我的观点？

相关内容

最新更新

热门标签：