我有一个段落列表:
paragraphs = ['I do not like green eggs and ham. I am hungry, but I do not find anything to eat', '5.2. I do not like them Sam-I-am. I am Sam.', 'Blah, Blah, Blah']
我想在 punkt ("."( 处分隔这些段落,并获取每个句子的列表,并因此编写了以下代码:
sentences = []
for paragraph in paragraphs:
sentence = nltk.tokenize.sent_tokenize(paragraph)
sentences.append(sentence)
我得到了一个列表列表:
sentences = [['I do not like green eggs and ham.', 'I am hungry, but I do not find anything to eat'], ['5.2.', 'I do not like them Sam-I-am.', 'I am Sam.'], ['Blah, Blah, Blah']]
相反,我想得到:
sentences = ['I do not like green eggs and ham.', 'I am hungry, but I do not find anything to eat', '5.2.', 'I do not like them Sam-I-am.', 'I am Sam.', 'Blah, Blah, Blah']
我怎样才能得到这个?
在代码变量中,sentence
本身就是一个字符串列表。您可以通过将sentence
的每个元素附加到sentences
来解决此问题。
sentences = []
for paragraph in paragraphs:
sentence = nltk.tokenize.sent_tokenize(paragraph)
for i in sentence:
sentences.append(i)