我正在使用python、NLTK和WordNetLemmatizer开发一个lemmatizer。这是一个随机文本,输出我所期望的
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective
输出:'bad'
lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb
输出:'worse'
这里一切都很好。行为与其他形容词相同,如'better'
(表示不规则形式)或'older'
(注意,与'elder'
相同的测试永远不会输出'old'
,但我猜wordnet并不是所有现有英语单词的详尽列表)
我的问题出现在尝试使用单词'furter'
:时
lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective
输出:'further'
lem.lemmatize('further', pos=wordnet.ADV) // as an adverb
输出:'far'
这与'worse'
单词的行为完全相反!
有人能解释一下为什么吗?这是来自wordnet-synsets数据的错误,还是来自我对英语语法的误解?
如果问题已经得到回答,请原谅,我在谷歌和SO上搜索过,但当指定关键字"进一步"时,我可以找到任何相关的东西,但由于这个词的流行,混乱。。。
提前感谢您,罗曼G.
WordNetLemmatizer
使用._morphy
函数来访问其a词引理;从…起http://www.nltk.org/_modules/nltk/stem/wordnet.html并返回具有最小长度的可能引理。
def lemmatize(self, word, pos=NOUN):
lemmas = wordnet._morphy(word, pos)
return min(lemmas, key=len) if lemmas else word
并且._morphy
函数迭代地应用规则得到引理;规则不断减少单词的长度并用CCD_ 14替换词缀。然后它查看是否有其他单词更短,但与缩减后的单词相同:
def _morphy(self, form, pos):
# from jordanbg:
# Given an original string x
# 1. Apply rules once to the input to get y1, y2, y3, etc.
# 2. Return all that are in the database
# 3. If there are no matches, keep applying rules until you either
# find a match or you can't go any further
exceptions = self._exception_map[pos]
substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]
def apply_rules(forms):
return [form[:-len(old)] + new
for form in forms
for old, new in substitutions
if form.endswith(old)]
def filter_forms(forms):
result = []
seen = set()
for form in forms:
if form in self._lemma_pos_offset_map:
if pos in self._lemma_pos_offset_map[form]:
if form not in seen:
result.append(form)
seen.add(form)
return result
# 0. Check the exception lists
if form in exceptions:
return filter_forms([form] + exceptions[form])
# 1. Apply rules once to the input to get y1, y2, y3, etc.
forms = apply_rules([form])
# 2. Return all that are in the database (and check the original too)
results = filter_forms([form] + forms)
if results:
return results
# 3. If there are no matches, keep applying rules until we find a match
while forms:
forms = apply_rules(forms)
results = filter_forms(forms)
if results:
return results
# Return an empty list if we can't find anything
return []
但是,如果该单词在异常列表中,则它将返回一个保留在exceptions
中的固定值,请参阅中的_load_exception_map
http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html:
def _load_exception_map(self):
# load the exception file data into memory
for pos, suffix in self._FILEMAP.items():
self._exception_map[pos] = {}
for line in self.open('%s.exc' % suffix):
terms = line.split()
self._exception_map[pos][terms[0]] = terms[1:]
self._exception_map[ADJ_SAT] = self._exception_map[ADJ]
回到您的示例,worse
->bad
和further
->far
不能从规则中实现,因此必须从异常列表中实现。由于这是一个例外列表,所以必然会有不一致的地方。
异常列表保存在~/nltk_data/corpora/wordnet/adv.exc
和~/nltk_data/corpora/wordnet/adv.exc
中。
来自adv.exc
:
best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard
来自adj.exc
:
...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...