我试图从文本中提取关键字。通过使用";en_core_sci_lg";模型中,我得到了一个短语/单词的元组类型,其中有一些重复,我试图从中删除。我尝试了列表和元组的重复功能,但失败了。有人能帮忙吗?我真的很感激。
text = """spaCy is an open-source software library for advanced natural language processing,
written in the programming languages Python and Cython. The MIT library is published under the MIT license and its main developers are Matthew Honnibal and Ines Honnibal, the founders of the software company Explosion."""
我试过的一组代码:
import spacy
nlp = spacy.load("en_core_sci_lg")
doc = nlp(text)
my_tuple = list(set(doc.ents))
print('original tuple', doc.ents, len(doc.ents))
print('after set function', my_tuple, len(my_tuple))
输出:
original tuple: (spaCy, open-source software library, programming languages, Python, Cython, MIT, library, published, MIT, license, developers, Matthew Honnibal, Ines, Honnibal, founders, software company Explosion) 16
after set function: [Honnibal, MIT, Ines, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, MIT, published, open-source software library, spaCy] 16
所需的输出是(应该有一个MIT,名称Ines Honnibal应该在一起(:
[Ines Honnibal, MIT, software company Explosion, founders, programming languages, library, Matthew Honnibal, license, Cython, Python, developers, published, open-source software library, spaCy]
doc.ents
不是字符串列表。它是Span
对象的列表。当你打印一个时,它会打印它的内容,但它们确实是单独的对象,这就是为什么set
看不到它们是重复的。这方面的线索是你的打印声明中没有引号。如果这些是字符串,你会看到引号。
您应该尝试使用doc.words
而不是doc.ents
。如果出于某种原因,这对你不起作用,你可以这样做:
my_tuple = list(set(e.text for e in doc.ents))