新的NLP帮助需要使用spacy来获得POS



下面有一个列表。我想针对每个代币获得相应的POS。我给出了低于的样本输出

processed_lst = [['The', 'wild', 'is', 'dangerous'], ['The', 'rockstar', 'is', 'wild']]
I want to use the spacy library and get output like
final_lst = [[(The, DET), (wild, NOUN), (is, AUX), (dangerous, ADJ)], [(The, DET), (rockstar, NOUN), (is, AUX), (wild, ADJ) ]]

在将令牌转换为spaCy文档后,可以使用令牌的.pos_属性来执行此操作。下面的代码是从这篇关于词性标记的文章中提取的。

import spacy

nlp = spacy.load("en_core_web_sm")

text = """This is where the calculation can get tricky. Here’s the thing about solar energy. Solar energy comes from the sun. That means solar panels cannot produce energy 24 hours a day. They only produce energy during sunlight hours. That energy then has to be stored somewhere while it is not being used. Energy storage is a whole other topic in and of itself. Let me get back to the point, there’s only an average of 4 peak sunlight hours a day. A solar panel may get more than that, but let’s take a conservative estimate of our solar power generation and confine it to those 4 hours only.
Back to the calculations. At 4 acres of solar panels to generate a megawatt-hour and 4 hours of power generation time a day, a 1 MW solar farm would generate 4 MWh of power over 4 acres every day. At 110,000 megawatt-hours of power needed a day to power America, we would need about 110,000 acres of solar farm. 110,000 acres? That sounds huge, that’s more land than the entire Mojave desert. It’s not as daunting as it sounds, there are 1.9 billion acres in the continental United States, and 110,000 acres is only slightly more than 0.5 percent of the total land of the continental US."""

doc = nlp(text)

for token in doc:
print(token.text, token.pos_, token.tag_)

最新更新