在类似fundd的数据集上微调LayoutLM -索引超出self的范围



我正在尝试用huggingface变压器在我的自定义数据集上通过AutoModelForTokenClassification微调microsoft/layoutlmv2-base-uncased,该数据集类似于fundd(预处理和规范化)。经过几次迭代训练,我得到了这个错误:

Traceback (most recent call last):
File "layoutlmV2/train.py", line 137, in <module>
trainer.train()
File "..../lib/python3.8/site-packages/transformers/trainer.py", line 1409, in train
return inner_training_loop(
File "..../lib/python3.8/site-packages/transformers/trainer.py", line 1651, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "..../lib/python3.8/site-packages/transformers/trainer.py", line 2345, in training_step
loss = self.compute_loss(model, inputs)
File "..../lib/python3.8/site-packages/transformers/trainer.py", line 2377, in compute_loss
outputs = model(**inputs)
File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 1228, in forward
outputs = self.layoutlmv2(
File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 902, in forward
text_layout_emb = self._calc_text_embeddings(
File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 753, in _calc_text_embeddings
spatial_position_embeddings = self.embeddings._calc_spatial_position_embeddings(bbox)
File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 93, in _calc_spatial_position_embeddings
h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "..../lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "..../lib/python3.8/site-packages/torch/nn/functional.py", line 2203, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

进一步检查后(词汇大小,盒,尺寸,类…)我注意到输入张量中有负值导致了误差。而之前成功迭代的输入张量只有无符号整数。这些负数由modeling_layoutlmv2.py

中的_calc_spatial_position_embeddings(self, bbox)

返回。第92行:

h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
  • 什么可能导致返回的输入值为负?
  • 我能做些什么来防止这个错误发生?

触发torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)错误的输入张量示例:

tensor([[ 0, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11,  9,  9,  9,  9,  9,  9,  9,  9,  9,
9,  9,  9,  9,  9,  9,  9, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12,
12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,
8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,
8,  5,  5,  5,  5,  5,  5, -6, -6, -6, -6, -6, -6,  1,  1,  1,  1,  1,
5,  5,  5,  5,  5,  5,  7,  5,  7,  7,  0,  0,  0,  0,  0,  0,  0,  0,
0,  0,  0,  0,  0,  0,  0,  0]])

在仔细检查数据集,特别是标签的坐标后,我发现一些行bbox坐标导致宽度或高度为零。下面是一个简单的例子:

x1, y1, x2, y2 = dataset_row["bbox"]
print((x2-x1 < 1) or (y2-y1 < 1)) #output is sometimes True

从数据集中删除这些标签后,问题得到解决。

更普遍的问题是打破任何标准,如超出图像范围。下面是在将盒子和单词传递给嵌入之前删除任何非法盒子和相关单词的代码。它假设您有两个有序列表,分别具有规范化的边界框和关联的单词。它可能不是详尽的。

使用诸如PaddleOCR之类的工具更有可能生成这些非常规的边界框,因为该工具可以返回比PyTesseract更广泛的边界框,例如在查找垂直文本时。

方框必须为x1,y1,x2,y2格式,或者为边界框的左上角,右下角,其中(0,0)为图像的左上角。

注意:之前有一件事让我困惑,这意味着y坐标必须倒过来,即y =到图像顶部的距离。

for enum, box in enumerate(boxes_norm):
if (
box[0] >= box[2] # left coordinate actually on the right
or box[1] >= box[3] # bottom coordinate actually on top
or box[0] < 0 # off the page
or box[1] < 0 # off the page
or box[2] < 0 # off the page
or box[3] < 0 # off the page
or box[2] > 1000 # off the page
or box[3] > 1000 # off the page
or box[0] > 1000 # off the page
or box[1] > 1000 # off the page
):
# print(
#     "removing invalid box and associated word from image - ",
#     example["image_path"],
# )
# print("box - ", box)
del boxes_norm[enum]
del words[enum]

最新更新