如何在Python中使用WordNet获取单词域



如何使用nltk Python模块和WordNet查找单词的域?

假设我有像(transaction,demanddraft,check,passbook)这样的单词,所有这些单词的域都是"BANK"。我们如何在Python中使用nltk和WordNet来实现这一点?

我正在尝试通过假名和假名的关系:

例如:

from nltk.corpus import wordnet as wn
sports = wn.synset('sport.n.01')
sports.hyponyms()
[Synset('judo.n.01'), Synset('athletic_game.n.01'), Synset('spectator_sport.n.01'),    Synset('contact_sport.n.01'), Synset('cycling.n.01'), Synset('funambulism.n.01'), Synset('water_sport.n.01'), Synset('riding.n.01'), Synset('gymnastics.n.01'), Synset('sledding.n.01'), Synset('skating.n.01'), Synset('skiing.n.01'), Synset('outdoor_sport.n.01'), Synset('rowing.n.01'), Synset('track_and_field.n.01'), Synset('archery.n.01'), Synset('team_sport.n.01'), Synset('rock_climbing.n.01'), Synset('racing.n.01'), Synset('blood_sport.n.01')]

bark = wn.synset('bark.n.02')
bark.hypernyms()
[Synset('noise.n.01')]

普林斯顿WordNet和NLTK的WN API中都没有明确的域信息。

我建议您获得WordNet域资源的副本,然后使用域链接您的系统集,请参阅http://wndomains.fbk.eu/

注册并完成下载后,您将看到一个wn-domains-3.2-20070223文本文件,它是一个以制表符分隔的文件,第一列是偏移的PartofSpeech标识符,第二列包含用空格分隔的域标记,例如

00584282-v  military pedagogy
00584395-v  military school university
00584526-v  animals pedagogy
00584634-v  pedagogy
00584743-v  school university
00585097-v  school university
00585271-v  pedagogy
00585495-v  pedagogy
00585683-v  psychological_features

然后使用以下脚本访问synsets的域:

from collections import defaultdict
from nltk.corpus import wordnet as wn
# Loading the Wordnet domains.
domain2synsets = defaultdict(list)
synset2domains = defaultdict(list)
for i in open('wn-domains-3.2-20070223', 'r'):
ssid, doms = i.strip().split('t')
doms = doms.split()
synset2domains[ssid] = doms
for d in doms:
domain2synsets[d].append(ssid)
# Gets domains given synset.
for ss in wn.all_synsets():
ssid = str(ss.offset).zfill(8) + "-" + ss.pos()
if synset2domains[ssid]: # not all synsets are in WordNet Domain.
print ss, ssid, synset2domains[ssid]
# Gets synsets given domain.        
for dom in sorted(domain2synsets):
print dom, domain2synsets[dom][:3]

还要查找wn-affect,它对于在WordNet域资源中消除情感单词的歧义非常有用。


随着NLTK v3.0的更新,它附带了开放多语言WordNet(http://compling.hss.ntu.edu.sg/omw/),并且由于法语同义词集共享相同的偏移ID,因此您可以简单地将WND用作跨语言资源。法语引理名称可以这样访问:

# Gets domains given synset.
for ss in wn.all_synsets():
ssid = str(ss.offset()).zfill(8) + "-" + ss.pos()
if synset2domains[ssid]: # not all synsets are in WordNet Domain.
print ss, ss.lemma_names('fre'), ssid, synset2domains[ssid]

注意,NLTK的最新版本将synset属性更改为"get"函数:Synset.offset->Synset.offset()

正如@alvas所建议的,您可以使用WordNetDomains。您必须同时下载WordNet2.0(在其当前状态下,WordNetDomains不支持WordNet3.0的感知清单,这是NLTK使用的WordNet的默认版本)和WordNetDomain。

  • WordNet2.0可以从这里下载

  • WordNetDomains可以从这里下载(在获得许可后)。

我创建了一个非常简单的Python API,它加载Python3.x中的两个资源,并提供一些您可能需要的常见例程(例如获取一组链接到给定术语或给定synset的域等)。WordNetDomains的数据加载来自@alvas。

这就是它的样子(省略了大多数评论):

from collections import defaultdict
from nltk.corpus import WordNetCorpusReader
from os.path import exists

class WordNetDomains:
def __init__(self, wordnet_home):
#This class assumes you have downloaded WordNet2.0 and WordNetDomains and that they are on the same data home.
assert exists(f'{wordnet_home}/WordNet-2.0'), f'error: missing WordNet-2.0 in {wordnet_home}'
assert exists(f'{wordnet_home}/wn-domains-3.2'), f'error: missing WordNetDomains in {wordnet_home}'
# load WordNet2.0
self.wn = WordNetCorpusReader(f'{wordnet_home}/WordNet-2.0/dict', 'WordNet-2.0/dict')
# load WordNetDomains (based on https://stackoverflow.com/a/21904027/8759307)
self.domain2synsets = defaultdict(list)
self.synset2domains = defaultdict(list)
for i in open(f'{wordnet_home}/wn-domains-3.2/wn-domains-3.2-20070223', 'r'):
ssid, doms = i.strip().split('t')
doms = doms.split()
self.synset2domains[ssid] = doms
for d in doms:
self.domain2synsets[d].append(ssid)
def get_domains(self, word, pos=None):
word_synsets = self.wn.synsets(word, pos=pos)
domains = []
for synset in word_synsets:
domains.extend(self.get_domains_from_synset(synset))
return set(domains)
def get_domains_from_synset(self, synset):
return self.synset2domains.get(self._askey_from_synset(synset), set())
def get_synsets(self, domain):
return [self._synset_from_key(key) for key in self.domain2synsets.get(domain, [])]
def get_all_domains(self):
return set(self.domain2synsets.keys())
def _synset_from_key(self, key):
offset, pos = key.split('-')
return self.wn.synset_from_pos_and_offset(pos, int(offset))
def _askey_from_synset(self, synset):
return self._askey_from_offset_pos(synset.offset(), synset.pos())
def _askey_from_offset_pos(self, offset, pos):
return str(offset).zfill(8) + "-" + pos

我认为您也可以使用spacy库,请参阅下面的代码:

代码取自spacy wordnet官方网站https://pypi.org/project/spacy-wordnet/:

import spacy
from spacy_wordnet.wordnet_annotator import WordnetAnnotator 
# Load an spacy model (supported models are "es" and "en")  nlp = spacy.load('en') nlp.add_pipe(WordnetAnnotator(nlp.lang), after='tagger') token = nlp('prices')[0]
# wordnet object link spacy token with nltk wordnet interface by giving acces to
# synsets and lemmas  token._.wordnet.synsets() token._.wordnet.lemmas()
# And automatically tags with wordnet domains token._.wordnet.wordnet_domains()
# Imagine we want to enrich the following sentence with synonyms sentence = nlp('I want to withdraw 5,000 euros')
# spaCy WordNet lets you find synonyms by domain of interest
# for example economy economy_domains = ['finance', 'banking'] enriched_sentence = []
# For each token in the sentence for token in sentence:
# We get those synsets within the desired domains
synsets = token._.wordnet.wordnet_synsets_for_domain(economy_domains)
if synsets:
lemmas_for_synset = []
for s in synsets:
# If we found a synset in the economy domains
# we get the variants and add them to the enriched sentence
lemmas_for_synset.extend(s.lemma_names())
enriched_sentence.append('({})'.format('|'.join(set(lemmas_for_synset))))
else:
enriched_sentence.append(token.text)
# Let's see our enriched sentence print(' '.join(enriched_sentence))
# >> I (need|want|require) to (draw|withdraw|draw_off|take_out) 5,000 euros

从@sel的答案分支出来,我使用了spacy_wordnet(在引擎盖下使用nltk.wordnet)。

import spacy
from spacy_wordnet.wordnet_annotator import WordnetAnnotator  # must be imported for pipe creation
nlp = spacy.load("en_core_web_md")  # I was using medium, but may be able to get away with small
# this adds `wordnet` capabilities to your tokens when processed by the `nlp` pipeline
nlp.add_pipe("spacy_wordnet", after="tagger", config={"lang": nlp.lang})
# your words
words = ["transaction", "Demand Draft", "cheque", "passbook"]
for word in words:
# process text with spacy
doc: spacy.tokens.Doc = nlp(word)

for token in doc:
# get all wordnet domains for token
token_wordnet_domains = token._.wordnet.wordnet_domains()
print(token, token_wordnet_domains)

作为单词"的例子;交易";,这将打印出来:

transaction ['social', 'diplomacy', 'book_keeping', 'money', 'finance', 'industry', 'economy', 'telephony', 'tax', 'exchange', 'betting', 'law', 'commerce', 'insurance', 'banking', 'enterprise']

您可以检查是否";银行业务;在具有条件的域中:

for word in words:
# convert each word into a spacy.tokens.Doc
doc: spacy.tokens.Doc = nlp(word)

for token in doc:
# get all wordnet domains for token
token_wordnet_domains = token._.wordnet.wordnet_domains()
# print(token, token_wordnet_domains)
print(token, "banking" in token_wordnet_domains)

输出:

transaction True
Demand True
Draft True
cheque True
passbook True

最新更新