Pytorch 错误"运行时错误:索引超出范围:尝试访问包含 511 行的表外的索引 512"



我有一些句子,我使用BiobertEmbedding python模块(https://pypi.org/project/biobert-embedding/(的sentence_vector((方法进行矢量化。对于某些句子组我没有问题,但对于其他一些句子,我有以下错误消息:

文件 "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py", 133号线,sentence_vector encoded_layers = self.eval_fwdprop_biobert(tokenized_text( 文件 "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py", 82号线,eval_fwdprop_biobert encoded_layers, _ = self.model(tokens_tensor, segments_tensors( 文件 "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", 547号线,__call__ result = self.forward(*input, **kwargs( File "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", 730行,向前 embedding_output = self.embeddings(input_ids, token_type_ids( 文件 "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", 547号线,__call__ result = self.forward(*input, **kwargs( File "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", 268行,向前 position_embeddings = self.position_embeddings(position_ids( 文件 "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", 547号线,__call__ result = self.forward(*input, **kwargs( File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/sparse.py", 114行,向前 self.norm_type, self.scale_grad_by_freq, self.sparse( 文件 "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/functional.py", 第 1467 行,嵌入 返回火炬.嵌入(权重、输入、padding_idx、scale_grad_by_freq、稀疏( 运行时错误:索引超出范围:已尝试 访问索引 512 出表,包含 511 行。在/pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237

我发现对于某些句子组,问题与例如<tb>等标签有关。但对于其他人,即使删除了标签,错误消息仍然存在.
(不幸的是,出于保密原因,我无法共享代码(

您对可能出现的问题有任何想法吗?

提前谢谢你

编辑:你是对的克罗诺伊克,有一个例子会更好。

例:

sentences = ["This is the first sentence.", "This is the second sentence.", "This is the third sentence."
biobert = BiobertEmbedding(model_path='./biobert_v1.1_pubmed_pytorch_model')
vectors = [biobert.sentence_vector(doc) for doc in sentences]

在我看来,最后一行代码是导致错误消息的原因。

问题是生物嵌入模块没有照顾到最大序列长度 512(标记而不是单词!这是相关的源代码。请查看以下示例以强制处理您收到的错误:

from biobert_embedding.embedding import BiobertEmbedding
#sentence has 385 words
sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'
biobert = BiobertEmbedding()
print('sentence has {} tokens'.format(len(biobert.process_text(sentence))))
#works
biobert.sentence_vector(sentence)
print('longersentence has {} tokens'.format(len(biobert.process_text(longersentence))))
#didn't work
biobert.sentence_vector(longersentence)

输出:

sentence has 512 tokens
longersentence has 513 tokens
#your error message....

您应该做的是实现滑动窗口方法来处理这些文本:

import torch
from biobert_embedding.embedding import BiobertEmbedding
maxtokens = 512
startOffset = 0
docStride = 200
sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'
sentences = [sentence, longersentence, 'small test sentence']
vectors = []
biobert = BiobertEmbedding()
#https://github.com/Overfitter/biobert_embedding/blob/b114e3456de76085a6cf881ff2de48ce868e6f4b/biobert_embedding/embedding.py#L127
def sentence_vector(tokenized_text, biobert):
encoded_layers = biobert.eval_fwdprop_biobert(tokenized_text)
# `encoded_layers` has shape [12 x 1 x 22 x 768]
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = encoded_layers[11][0]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
return sentence_embedding

for doc in sentences:
#tokenize your text
docTokens = biobert.process_text(doc)

while startOffset < len(docTokens):
print(startOffset)
length = min(len(docTokens) - startOffset, maxtokens)
#now we calculate the sentence_vector for the document slice
vectors.append(sentence_vector(
docTokens[startOffset:startOffset+length]
, biobert)
)
#stop when the whole document is processed (document has less than 512
#or the last document slice was processed)
if startOffset + length == len(docTokens):
break
startOffset += min(length, docStride)
startOffset = 0

PS:您删除<tb>的部分成功是可能的,因为删除<tb>将删除 4 个令牌("<"、"t"、"##b"、">"(。

由于原始 BERT 具有 512(0 - 511( 大小的位置编码和 bioBERT 源自 BERT,因此获得 512 的索引错误也就不足为奇了。但是,对于您提到的某些句子,您可以访问 512,这有点奇怪。

相关内容

  • 没有找到相关文章

最新更新