SimpleTransformers "max_seq_length"参数导致 CUDA 在 Kaggle 和 Google Colab 中出现内存不足错误



在使用SimpleTransformers的多类分类任务中,基于CamemBERT微调sloBERTa Transformer模型时,我希望使用模型参数"max_seq_length":512,因为之前的工作表明它给出了比128更好的结果,但包含这个参数会触发下面的错误。Kaggle和Google Colab环境中的错误是相同的,终止执行并重新调整它没有帮助。无论训练时期的数量有多小,都会触发错误,并且数据集仅包含600个实例(文本为字符串,标签为整数(。我曾尝试将max_seq_length降低到509500和128,但错误仍然存在。

没有这个参数的设置可以正常工作,并且可以进行90个时期的训练,所以我有足够的内存。

from simpletransformers.classification import ClassificationModel
# define hyperparameter
model_args ={"overwrite_output_dir": True,
             "num_train_epochs": 90,
             "labels_list": LABELS_NUM,
             "learning_rate": 1e-5,
             "train_batch_size": 32,
             "no_cache": True,
             "no_save": True,
             #"max_seq_length": 512,
             "save_steps": -1,
             }
model = ClassificationModel(
    "camembert", "EMBEDDIA/sloberta",
    use_cuda = device,
    num_labels = NUM_LABELS,
    args = model_args)
model.train_model(train_df)

这就是错误:

RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_34/2529369927.py in <module>
    19     args = model_args)
    20 
---> 21 model.train_model(train_df)
/opt/conda/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py in train_model(self, train_df, multi_label, output_dir, show_running_loss, args, eval_df, verbose, **kwargs)
   610             eval_df=eval_df,
   611             verbose=verbose,
--> 612             **kwargs,
   613         )
   614 
/opt/conda/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py in train(self, train_dataloader, output_dir, multi_label, show_running_loss, eval_df, test_df, verbose, **kwargs)
   883                             loss_fct=self.loss_fct,
   884                             num_labels=self.num_labels,
--> 885                             args=self.args,
   886                         )
   887                 else:
/opt/conda/lib/python3.7/site-packages/simpletransformers/classification/classification_model.py in _calculate_loss(self, model, inputs, loss_fct, num_labels, args)
  2256 
  2257     def _calculate_loss(self, model, inputs, loss_fct, num_labels, args):
-> 2258         outputs = model(**inputs)
  2259         # model outputs are always tuple in pytorch-transformers (see doc)
  2260         loss = outputs[0]
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   720             result = self._slow_forward(*input, **kwargs)
   721         else:
--> 722             result = self.forward(*input, **kwargs)
   723         for hook in itertools.chain(
   724                 _global_forward_hooks.values(),
/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict)
  1210             output_attentions=output_attentions,
  1211             output_hidden_states=output_hidden_states,
-> 1212             return_dict=return_dict,
  1213         )
  1214         sequence_output = outputs[0]
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   720             result = self._slow_forward(*input, **kwargs)
   721         else:
--> 722             result = self.forward(*input, **kwargs)
   723         for hook in itertools.chain(
   724                 _global_forward_hooks.values(),
/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   859             output_attentions=output_attentions,
   860             output_hidden_states=output_hidden_states,
--> 861             return_dict=return_dict,
   862         )
   863         sequence_output = encoder_outputs[0]
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   720             result = self._slow_forward(*input, **kwargs)
   721         else:
--> 722             result = self.forward(*input, **kwargs)
   723         for hook in itertools.chain(
   724                 _global_forward_hooks.values(),
/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   531                     encoder_attention_mask,
   532                     past_key_value,
--> 533                     output_attentions,
   534                 )
   535 
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   720             result = self._slow_forward(*input, **kwargs)
   721         else:
--> 722             result = self.forward(*input, **kwargs)
   723         for hook in itertools.chain(
   724                 _global_forward_hooks.values(),
/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
   415             head_mask,
   416             output_attentions=output_attentions,
--> 417             past_key_value=self_attn_past_key_value,
   418         )
   419         attention_output = self_attention_outputs[0]
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   720             result = self._slow_forward(*input, **kwargs)
   721         else:
--> 722             result = self.forward(*input, **kwargs)
   723         for hook in itertools.chain(
   724                 _global_forward_hooks.values(),
/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
   344             encoder_attention_mask,
   345             past_key_value,
--> 346             output_attentions,
   347         )
   348         attention_output = self.output(self_outputs[0], hidden_states)
/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   720             result = self._slow_forward(*input, **kwargs)
   721         else:
--> 722             result = self.forward(*input, **kwargs)
   723         for hook in itertools.chain(
   724                 _global_forward_hooks.values(),
/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
   273             attention_probs = attention_probs * head_mask
   274 
--> 275         context_layer = torch.matmul(attention_probs, value_layer)
   276 
   277         context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 15.90 GiB total capacity; 15.04 GiB already allocated; 15.75 MiB free; 15.12 GiB reserved in total by PyTorch)

额外的代码(如果有帮助的话——我已经尝试了在网上找到的关于pytorch的一切——完整的代码可以访问https://www.kaggle.com/tajakuz/0-sloberta-example-max-seq-length-error):

!conda install --yes pytorch>=1.6 cudatoolkit=11.0 -c pytorch
# install simpletransformers
!pip install -q transformers
!pip install --upgrade transformers
!pip install -q simpletransformers
# check installed version
!pip freeze | grep simpletransformers
!pip uninstall -q torch -y
!pip install -q torch==1.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
# pytorch libraries
import torch # the main pytorch library
import torch.nn as nn # the sub-library containing Softmax, Module and other useful functions
import torch.optim as optim # the sub-library containing the common optimizers (SGD, Adam, etc.)
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
#importing other necessary packages and ClassificationModel for bert
from tqdm import tqdm
import warnings
warnings.simplefilter('ignore')
from scipy.special import softmax

非常感谢你的帮助,真的很感激!

之所以发生这种情况,是因为max_seq_length定义了模型的输入神经元数量,从而增加了可训练参数的数量,这将需要它分配更多的内存,而这些内存可能会超过您在这些平台上的内存限制。

大多数时候,max_seq_length在数据集上,有时添加太多可能会浪费训练时间和模型大小。

您可以做的是在训练数据集中找到每个样本的最大单词数,并将其用作max_seq_length

最新更新