Python:TaggedCorpusReader如何从stts到通用标签集

我正在使用python和keras在POS标记处工作。我拥有的数据是使用STTS标签，但我应该为通用标签集创建标签器。所以我需要翻译这个。

首先，我想到制作字典，然后搜索替换标签，但后来我看到了使用标签的CorpusReader设置标签集的选项。(例如'棕色'(

但我错过了可以在那里使用的可能使用的标签列表。我可以以某种方式使用stts标签集，还是我自己制作词典？

示例来源：代码＃3：映射语料标签到通用标签集https://www.geeksforgeeks.org/nlp-customization-using-tagged-corpus-reader/

corpus = TaggedCorpusReader(filePath, "standard_pos_tagged.txt", tagset='STTS') #?? doesn't work sadly
# ....
trainingCorpus.tagged_sents(tagset='universal')[1]

最终看起来像这样：(非常感谢Alexis(

with open(resultFileName, "w") as output:
    for sent in stts_corpus.tagged_sents():
        for word, tag in sent:
            try:
                newTag = mapping_dict[tag];
                output.write(word+"/"+newTag+" ")               
            except:
                print("except "  + str(word) + " - " + str(tag))
        output.write("n")

只需创建一个字典并替换标签，就像您考虑的那样。NLTK的通用标签集支持由模块nltk/tag/mapping.py提供。它依赖于一组映射文件，您将在NLTK_DATA/taggers/universal_tagset中找到。例如，在en-brown.map中，您会找到这样的行，将一堆标签映射到PRT，ABX到DET，依此类推：

ABL     PRT
ABN     PRT
ABN-HL  PRT
ABN-NC  PRT
ABN-TL  PRT
ABX     DET
AP      ADJ

这些文件被读取为用于翻译的字典。通过以相同格式创建映射文件，您可以使用NLTK的功能执行翻译，但是说实话，如果您的任务仅仅是为了以通用格式生产语料库，我只会手工进行翻译。但不是通过"搜索replace"：与NLTK读取器提供的元素一起工作，只需通过映射字典中的直接查找替换POS标签。

让我们假设您知道如何说服NLTK TaggedCorpusReader读取您的语料库，现在您拥有一个带有方法tagged_words()，tagged_sents()等的stts_corpus读取器对象。标签；如果ABL是STTS标签，则mapping_dict["ABL"]应返回值PRT。然后您的重新映射是这样的：

for filename in stts_corpus.fileids():
    with open("new_dir/"+filename, "w") as output:
        for word, tag in stts_corpus.tagged_words():
            output.write(word+"/"+mapping_dict[tag]+" ")
        output.write("n")

，这实际上就是全部，除非您想添加奢侈品(例如将文字分为线条。

(

相关内容

最新更新

热门标签：