基于标签分离 NLTK 子树



我有一个NLTK解析树,我想只根据"S"标签来分离树的叶子。请注意,S 不应与叶子重叠。

给定一句话"他赢得了古舍尔马拉顿,在30分钟内完成。

来自 corenlp 的树形式是

tree = '(S
(NP (PRP He))
(VP
(VBD won)
(NP (DT the) (NNP Gusher) (NNP Marathon))
(, ,)
(S (VP (VBG finishing) (PP (IN in) (NP (CD 30) (NNS minutes))))))
(. .))'

想法是提取 2 个"S"及其叶子,但不相互重叠。所以预期的输出应该是"他赢得了古舍马拉松,"。 和"30分钟内完成"。

# Tree manipulation
# Extract phrases from a parsed (chunked) tree
# Phrase = tag for the string phrase (sub-tree) to extract
# Returns: List of deep copies;  Recursive
def ExtractPhrases( myTree, phrase):
myPhrases = []
if (myTree.label() == phrase):
myPhrases.append( myTree.copy(True) )
for child in myTree:
if (type(child) is Tree):
list_of_phrases = ExtractPhrases(child, phrase)
if (len(list_of_phrases) > 0):
myPhrases.extend(list_of_phrases)
return myPhrases
subtexts = set()
sep_tree = ExtractPhrases( Tree.fromstring(tree), 'S')
for sep in sep_tree:
for subtree in sep.subtrees():
if subtree.label()=="S":
print(subtree)
subtexts.add(' '.join(subtree.leaves()))
#break
subtexts = list(subtexts)
print(subtexts)

我得到了输出

['He won the Gusher Marathon , finishing in 30 minutes .', 'finishing in 30 minutes']

我不想在字符串级别操作它,而是在树级别操作它,因此预期的输出将是-

["He won the Gusher Marathon ,.",  "finishing in 30 minutes."]

这是我的示例输入:

a = 
'''
FREEDOM FROM RELIGION FOUNDATION
Darwin fish bumper stickers and assorted other atheist paraphernalia are
available from the Freedom From Religion Foundation in the US.
EVOLUTION DESIGNS
Evolution Designs sell the "Darwin fish".  It's a fish symbol, like the ones
Christians stick on their cars, but with feet and the word "Darwin" written
inside.  The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.
'''

sentences = nltk.sent_tokenize(a)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
tagged_sentences = nltk.pos_tag_sents(sentences)
chunked_sentences = list(nltk.ne_chunk_sents(tagged_sentences))
for sent in chunked_sentences:
for subtree in sent.subtrees(filter=lambda t: t.label()=='S'):
print(subtree)

这是我的输出:

(S
(ORGANIZATION FREEDOM/NN)
(ORGANIZATION FROM/NNP)
RELIGION/NNP
FOUNDATION/NNP
Darwin/NNP
fish/JJ
bumper/NN
stickers/NNS
and/CC
assorted/VBD
other/JJ
atheist/JJ
paraphernalia/NNS
are/VBP
available/JJ
from/IN
the/DT
(ORGANIZATION Freedom/NN From/NNP Religion/NNP Foundation/NNP)
in/IN
the/DT
(GSP US/NNP)
./.)
(S
(ORGANIZATION EVOLUTION/NNP)
(ORGANIZATION DESIGNS/NNP Evolution/NNP)
Designs/NNP
sell/VB
the/DT
``/``
(PERSON Darwin/NNP)
fish/NN
''/''
./.)
(S
It/PRP
's/VBZ
a/DT
fish/JJ
symbol/NN
,/,
like/IN
the/DT
ones/NNS
Christians/NNPS
stick/VBP
on/IN
their/PRP$
cars/NNS
,/,
but/CC
with/IN
feet/NNS
and/CC
the/DT
word/NN
``/``
(PERSON Darwin/NNP)
''/''
written/VBN
inside/RB
./.)
(S
The/DT
deluxe/NN
moulded/VBD
3D/CD
plastic/JJ
fish/NN
is/VBZ
$/$
4.95/CD
postpaid/NN
in/IN
the/DT
(GSP US/NNP)
./.)

相关内容

  • 没有找到相关文章

最新更新