WordNet:遍历同义词集

对于一个项目，我想测量文本中"以人为中心"的单词数量。我打算用WordNet来做这件事。我从未使用过它，也不太确定如何处理这项任务。我想使用WordNet来计算属于某些系统集的单词数量，例如系统网"human"one_answers"person"。

我想出了以下(简单的)代码：

word = 'girlfriend'
word_synsets = wn.synsets(word)[0]
hypernyms = word_synsets.hypernym_paths()[0]
for element in hypernyms:
print element

结果：

Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('causal_agent.n.01')
Synset('person.n.01')
Synset('friend.n.01')
Synset('girlfriend.n.01')

我的第一个问题是，如何正确地迭代超名称？在上面的代码中，它打印得很好。但是，当使用"if"语句时，例如：

count_humancenteredness = 0
for element in hypernyms:
if element == 'person':
print 'found person hypernym'
count_humancenteredness +=1

我得到"AttributeError:"str"对象没有属性"_name"。当一个单词确实属于"person"或"human"同义词集时，我可以使用什么方法来迭代我的单词的同义词并执行操作(例如，增加以人为中心的计数)。

第二，这是一种有效的方法吗？我认为，迭代几个文本和迭代每个名词的同义词需要相当长的时间。。也许还有另一种方法可以使用WordNet更有效地执行我的任务。

谢谢你的帮助！

写入错误消息

hypernyms = word_synsets.hypernym_paths()返回SynSets的列表列表。

因此

if element == 'person':

尝试将CCD_ 3对象与字符串进行比较。SynSet不支持这种比较。

试试之类的东西

target_synsets = wn.synsets('person')
if element in target_synsets:
...

或

if u'person' in element.lemma_names():
...

相反。

wrt效率

目前，您对输入文本中的每个单词都进行超名称查找。正如你所注意到的，这并不一定有效。然而，如果这足够快，就到此为止，不要优化未损坏的内容。

为了加快查找速度，您可以通过使用上义词上的传递闭包提前预编译一个">与人相关的"单词列表，如下所述。

类似的东西

person_words = set(w for s in p.closure(lambda s: s.hyponyms()) for w in s.lemma_names())

应该做到这一点。这将返回一组~10,000字，这些字不会太多，无法存储在主存储器中。

一个简单版本的单词计数器就变成了上的东西

from collections import Counter
word_count = Counter()
for word in (w.lower() for w in words if w in person_words):         
word_count[word] += 1

不过，在将输入单词传递到WordNet之前，您可能还需要使用词干或其他形态学缩减来预处理输入单词。

要获得一个synset的所有同义词，可以使用以下函数(用NLTK 3.0.3测试，dhke的闭包技巧在这个版本上不起作用)：

def get_hyponyms(synset):
hyponyms = set()
for hyponym in synset.hyponyms():
hyponyms |= set(get_hyponyms(hyponym))
return hyponyms | set(synset.hyponyms())

示例：

from nltk.corpus import wordnet
food = wordnet.synset('food.n.01')
print(len(get_hyponyms(food))) # returns 1526

相关内容

最新更新

热门标签：