没有NLTK的python中的POS标记器



我正试图为索拉尼库尔德语的限定词和介词制作一个POS标记。我使用以下代码将每个标签放在我的库尔德文本中的每个命题或限定词之后。

import os
SOR = open("SOR-1.txt", "r+", encoding = 'utf-8')
old_text = SOR.read()
punkt = [".", "!", ",", ":", ";"]
text = ""
for i in old_text:
if i in punkt:
text+=" "+i
else:
text += i
d = {"DET":["ئێمە" , "ئێوە" , "ئەم" , "ئەو" , "ئەوان" , "ئەوەی", "چەند" ], "PREP":["بۆ","بێ","بێجگە","بە","بەبێ","بەدەم","بەردەم","بەرلە","بەرەوی","بەرەوە","بەلای","بەپێی","تۆ","تێ","جگە","دوای","دەگەڵ","سەر","لێ","لە","لەبابەت","لەباتی","لەبارەی","لەبرێتی","لەبن","لەبەینی","لەبەر","لەدەم","لەرێ","لەرێگا","لەرەوی","لەسەر","لەلایەن","لەناو","لەنێو","لەو","لەپێناوی","لەژێر","لەگەڵ","ناو","نێوان","وەک","وەک","پاش","پێش","" ], "punkt":[".", ",", "!"]}
text = text.split()
for w in text:
for pos in d:
if w in d[pos]:
SOR.write(w+"/"+pos+" ")
SOR.close()

我想做的是在定义的字典中的每个单词后面的文本中添加POS标签,但结果是在文件末尾有一个单独的单词和POS标签列表。

请记住,old_text是一个单独的字符串。所以当你像在中一样循环它时
for i in old_text:
if i in punkt:

您正在循环使用字符。我认为您打算循环通过old_text的行。如果是这种情况,可以使用指定readwrite模式的with语句打开文件。类似于:

with open("SOR-1.txt", 'r+', encoding = 'utf-8') as f:
old_text = f.readlines()
for line in old_text:
for punctuationMark in punct:
if punctuationMark in line.strip('n'):     #when you read the file, every line will be terminated with newline character `'n'`
#give more instructions

最新更新