RuntimeError:索引3处的输入张量形状[2,2,16,16,128,64]无效,但形状为[2,4,16,128



在SageMaker - ml.p3.8xlarge实例中使用Huggingface库对预训练GPT2-medium模型进行调优时出现错误。

finetuning_gpt2_script.py包含以下内容,

库:

from transformers import Trainer, TrainingArguments
from transformers import EarlyStoppingCallback
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import TextDataset,DataCollatorForLanguageModeling

Pretrained模型:

gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2-medium")
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")

训练和测试数据构建:

train_dataset = TextDataset(
tokenizer=gpt2_tokenizer,
file_path=train_path,
block_size=128)

test_dataset = TextDataset(
tokenizer=gpt2_tokenizer,
file_path=test_path,
block_size=128)

data_collator = DataCollatorForLanguageModeling(
tokenizer=gpt2_tokenizer, mlm=False,
)

train_path&test_path是非结构化文本数据文件,大小为145万,数据行数为200K

训练参数:

training_args = TrainingArguments(
output_dir="./gpt2-finetuned-models", #The output directory
overwrite_output_dir=True, #overwrite the content of the output directory
num_train_epochs=1, # number of training epochs
per_device_train_batch_size=8, # batch size for training #32
per_device_eval_batch_size=8,  # batch size for evaluation #64
save_steps=100, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
prediction_loss_only=True,
metric_for_best_model = "eval_loss",
load_best_model_at_end = True,
evaluation_strategy="epoch",
)

training_args是用来训练模型的训练参数。

教练:

trainer = Trainer(
model=gpt2_model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset,
callbacks = [early_stop_callback],
)
early_stop_callback = EarlyStoppingCallback(early_stopping_patience  = 3)

培训:

trainer.train()
trainer.save_model(model_path)

在这里,使用ml.p3.8xlarge实例在4个gpu中只进行了1 epoch的训练。

培训是通过火炬分配完成的,如下所示,

python -m torch.distributed.launch finetuning_gpt2_script.py

在epoch结束时进行训练,观察到以下错误,

RuntimeError: Input tensor at index 3 has invalid shape [2, 2, 16, 128, 64] but expected [2, 4, 16, 128, 64]

  1. RuntimeError是因为train_datasettest_dataset使用TextData构建的方式吗?
  2. 我在torch-distribution做错了吗?

这可能与这里建议的批大小不匹配有关(期望批大小为4,但收到批大小为2)?提供的解决方案是在DataLoader中设置参数drop_last,如下所示:

tain_text = DataLoader(train_dataset, batch_size=args.batch_size, shuffle=True, drop_last=True)

相关内容

  • 没有找到相关文章

最新更新