使用stanford-nlp列出超出范围的索引



我试图从一个大的。txt文件中删除所有空白行,但无论我使用什么方法,它总是返回这个trace:

Traceback (most recent call last):
File "C:Userssvp12PycharmProjectspractiquesmain.py", line 53, in <module>
doc = nlp(texts[line])
IndexError: list index out of range

如果我不删除这些空格,然后我得到IndexErrors在随后的2 for循环(或者至少我认为这是原因),这就是为什么我使用try/except像这样:

try:
for word in doc.sentences[0].words:
noun.append(word.text)
lemma.append(word.lemma)
pos.append(word.pos)
xpos.append(word.xpos)
deprel.append(word.deprel)
except IndexError:
errors += 1
pass

我想能够删除所有的空白行,而不必避免这样的IndexErrors,任何关于如何修复的想法?

完整的代码如下:

import io
import stanza
import os

def linecount(filename):
ffile = open(filename, 'rb')
lines = 0
buf_size = 1024 * 1024
read_f = ffile.read
buf = read_f(buf_size)
while buf:
lines += buf.count(b'n')
buf = read_f(buf_size)
return lines

errors = 0
with io.open('@_Calvia_2018-01-01_2022-04-01.txt', 'r+', encoding='utf-8') as f:
text = f.read()
# replacing eos with n, numbers and symbols
texts = text.replace('eos', '.n')
texts = texts.replace('0', ' ').replace('1', ' ').replace('2', ' ').replace('3', ' ').replace('4', ' ')
.replace('5', ' ').replace('6', ' ').replace('7', ' ').replace('8', ' ').replace('9', ' ').replace(',', ' ')
.replace('"', ' ').replace('·', ' ').replace('?', ' ').replace('¿', ' ').replace(':', ' ').replace(';', ' ')
.replace('-', ' ').replace('!', ' ').replace('¡', ' ').replace('.', ' ').splitlines()
os.system("sed -i '/^$/d' @_Calvia_2018-01-01_2022-04-01.txt")            # removing empty lines to avoid IndexError
nlp = stanza.Pipeline(lang='ca')
nouns = []
lemmas = []
poses = []
xposes = []
heads = []
deprels = []
total_lines = linecount('@_Calvia_2018-01-01_2022-04-01.txt') - 1
for line in range(50):                                                  # range should be total_lines which is 6682
noun = []
lemma = []
pos = []
xpos = []
head = []
deprel = []
# print('analyzing: '+str(line+1)+' / '+str(len(texts)), end='r')
doc = nlp(texts[line])
try:
for word in doc.sentences[0].words:
noun.append(word.text)
lemma.append(word.lemma)
pos.append(word.pos)
xpos.append(word.xpos)
deprel.append(word.deprel)
except IndexError:
errors += 1
pass
try:
for word in doc.sentences[0].words:
head.extend([lemma[word.head-1] if word.head > 0 else "root"])
except IndexError:
errors += 1
pass
nouns.append(noun)
lemmas.append(lemma)
poses.append(pos)
xposes.append(xpos)
heads.append(head)
deprels.append(deprel)
print(nouns)
print(lemmas)
print(poses)
print(xposes)
print(heads)
print(deprels)
print("errors: " + str(errors))                                                         # wierd, seems to be range/2-1

作为附带问题,值得为这一行导入os吗?(也就是去掉空白行

os.system("sed -i '/^$/d' @_Calvia_2018-01-01_2022-04-01.txt")

我不能保证它能工作,因为我无法测试它,但它应该让您了解如何在Python中完成此任务。我省略了head处理/这里的第二个循环,这是为您计算的。

我建议你在那里扔一些print,看看输出,确保你明白发生了什么(特别是不同的数据类型),看看使用斯坦福NLP的应用程序的例子,看一些在线教程(从头到尾,不要跳过),等等。

import stanza
import re
def clean(line):
# function that does the text cleaning
line = line.replace('eos', '.n')
line = re.sub(r'[d,"·?¿:;!¡.-]', ' ', line)

return line.strip()
nlp = stanza.Pipeline(lang='ca')
# instead of individual variables, you could keep the values in a dictionary
# (or just leave them as they are - your call)
values_to_extract = ['text', 'lemma', 'pos', 'xpos', 'deprel']
data = {v:[] for v in values_to_extract}
with open('@_Calvia_2018-01-01_2022-04-01.txt', 'r', encoding='utf-8') as f:
for line in f:
# clean the text
line = clean(line)
# skip empty lines
if not line:
continue

doc = nlp(line)
# loop over sentences – this will work even if it's an empty list
for sentence in doc.sentences:
# append a new list to the dictionary entries
for v in values_to_extract:
data[v].append([])
for word in sentence.words:
for v in values_to_extract:
# extract the attribute (e.g., 
# a surface form, a lemma, a pos tag, etc.)
attribute = getattr(word, v)
# and add it to its slot
data[v][-1].append(attribute)
for v in values_to_extract:
print('Value:', v)
print(data[v])
print()

因为文本没有50行,为什么要硬编码50?

如果你只需要删除空行,你只需要做text = text.replace("nn","n")

如果你需要删除空白行,你可以这样做:

text = 'n'.join(line.rstrip() for line in text.split('n') if line.strip())

最新更新