CSV文件中的Python中的令牌化和POS标记



我是Python的新手,想在我本地计算机导入CSV文件后进行POS标记。我从在线查找了一些资源,发现以下代码有效。

text = 'Senator Elizabeth Warren from Massachusetts announced her support of 
Social Security in Washington, D.C. on Tuesday. Warren joined other 
Democrats in support.'  
import nltk
from nltk import tokenize
sentences = tokenize.sent_tokenize(text)
sentences
from nltk.tokenize import TreebankWordTokenizer
texttokens = []
for sent in sentences:
 texttokens.append(TreebankWordTokenizer().tokenize(sent))
texttokens
from nltk.tag import pos_tag
taggedsentences = []
for sentencetokens in texttokens:
 taggedsentences.append(pos_tag(sentencetokens))
taggedsentences
print(taggedsentences)

由于我打印了它,因此上面代码的结果看起来像这样。

[[('Senator', 'NNP'), ('Elizabeth', 'NNP'), ('Warren', 'NNP'), ('from', 
'IN'), ('Massachusetts', 'NNP'), ('announced', 'VBD'), ('her', 'PRP$'), 
('support', 'NN'), ('of', 'IN'), ('Social', 'NNP'), ('Security', 'NNP'), 
('in', 'IN'), ('Washington', 'NNP'), (',', ','), ('D.C.', 'NNP'), ('on', 
'IN'), ('Tuesday', 'NNP'), ('.', '.')], [('Warren', 'NNP'), ('joined', 
'VBD'), ('other', 'JJ'), ('Democrats', 'NNPS'), ('in', 'IN'), ('support', 
'NN'), ('.', '.')]]

这是我想获得的理想结果,但是在导入包含几行的CSV文件之后,我想获得结果(每行,有几个句子。)。例如,CSV文件看起来像这样:

---------------------------------------------------------------
I like this product. This product is beautiful. I love it. 
---------------------------------------------------------------
This product is awesome. It have many convenient features.
---------------------------------------------------------------
I went this restaurant three days ago. The food is too bad.
---------------------------------------------------------------

最后,我想保存导入CSV文件后上面显示的理想的POS标记结果。我想将(写)作为CSV格式保存(POS标记)每个句子。

可能有两种格式。第一个可能如下(没有标题,一行中的每个(POS标记)句子)。

----------------------------------------------------------------------------
[[('I', 'PRON'), ('like', 'VBD'), ('this', 'PRON'), ('product', 'NN')]]
----------------------------------------------------------------------------
[[('This', 'PRON'), ('product', 'NN'), ('is', 'VERB'), ('beautiful', 'ADJ')]]
---------------------------------------------------------------------------
[[('I', 'PRON'), ('love', 'VERB'), ('it', 'PRON')]]
----------------------------------------------------------------------------
...

第二种格式可能看起来像这样(没有标题,保存在一个单元格中的每组令牌和POS标记器):

----------------------------------------------------------------------------
('I', 'PRON')    | ('like', 'VBD')   | ('this', 'PRON') | ('product', 'NN')
----------------------------------------------------------------------------
('This', 'PRON') | ('product', 'NN') | ('is', 'VERB')   | ('beautiful', 'ADJ')
---------------------------------------------------------------------------
('I', 'PRON')    | ('love', 'VERB')  | ('it', 'PRON')   |
----------------------------------------------------------------------------
...

我更喜欢第二种格式。

我在这里编写的Python代码非常有效,但我想为CSV文件做同样的事情,并最终将其保存在本地计算机中。

最终目的是,我想从句子中仅提取名词类型(例如NN,NNP)。

有人可以帮助我如何修复Python代码?

请参阅此处已经回答的问题。您只需进行一些标记即可过滤出帖子中所述的名词。

最新更新