使用huggingface Trainer与分布式数据并行

为了提高性能，我研究了pytorches DistributedDataParallel，并试图将其应用于transformer Trainer。

DDP的pytorch示例指出，这应该至少更快：

DataParallel是单进程、多线程的，仅在单台机器上工作，而DistributedDataParallel是多进程的，可用于单机和多机训练。DataParallel通常比DistributedDataParaollel慢，即使在单台机器上也是如此，这是由于线程之间的GIL争用、每次迭代的复制模型以及分散输入和收集输出带来的额外开销。

我的DataParallel训练器如下所示：

import os
from datetime import datetime
import sys
import torch
from transformers import Trainer, TrainingArguments, BertConfig
training_args = TrainingArguments(
output_dir=os.path.join(path_storage, 'results', "mlm"),  # output directory
num_train_epochs=1,  # total # of training epochs
gradient_accumulation_steps=2,  # for accumulation over multiple steps
per_device_train_batch_size=4,  # batch size per device during training
per_device_eval_batch_size=4,  # batch size for evaluation
logging_dir=os.path.join(path_storage, 'logs', "mlm"),  # directory for storing logs
evaluate_during_training=False,
max_steps=20,
)
mlm_train_dataset = ProteinBertMaskedLMDataset(
path_vocab, os.path.join(path_storage, "data", "uniparc", "uniparc_train_sorted.h5"),
)
mlm_config = BertConfig(
vocab_size=mlm_train_dataset.tokenizer.vocab_size,
max_position_embeddings=mlm_train_dataset.input_size
)
mlm_model = ProteinBertForMaskedLM(mlm_config)
trainer = Trainer(
model=mlm_model,  # the instantiated 🤗 Transformers model to be trained
args=training_args,  # training arguments, defined above
train_dataset=mlm_train_dataset,  # training dataset
data_collator=mlm_train_dataset.collate_fn,
)
print("build trainer with on device:", training_args.device, "with n gpus:", training_args.n_gpu)
start = datetime.now()
trainer.train()
print(f"finished in {datetime.now() - start} seconds")

输出：

build trainer with on device: cuda:0 with n gpus: 4
finished in 0:02:47.537038 seconds

我的DistributedDataParallel培训师是这样构建的：

def create_transformer_trainer(rank, world_size, train_dataset, model):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
os.environ["RANK"] = str(rank)
os.environ["WORLD_SIZE"] = str(world_size)
training_args = TrainingArguments(
output_dir=os.path.join(path_storage, 'results', "mlm"),  # output directory
num_train_epochs=1,  # total # of training epochs
gradient_accumulation_steps=2,  # for accumulation over multiple steps
per_device_train_batch_size=4,  # batch size per device during training
per_device_eval_batch_size=4,  # batch size for evaluation
logging_dir=os.path.join(path_storage, 'logs', "mlm"),  # directory for storing logs
local_rank=rank,
max_steps=20,
)
trainer = Trainer(
model=model,  # the instantiated 🤗 Transformers model to be trained
args=training_args,  # training arguments, defined above
train_dataset=train_dataset,  # training dataset
data_collator=train_dataset.collate_fn,
)
print("build trainer with on device:", training_args.device, "with n gpus:", training_args.n_gpu)
start = datetime.now()
trainer.train()
print(f"finished in {datetime.now() - start} seconds")

mlm_train_dataset = ProteinBertMaskedLMDataset(
path_vocab, os.path.join(path_storage, "data", "uniparc", "uniparc_train_sorted.h5"))
mlm_config = BertConfig(
vocab_size=mlm_train_dataset.tokenizer.vocab_size,
max_position_embeddings=mlm_train_dataset.input_size
)
mlm_model = ProteinBertForMaskedLM(mlm_config)
torch.multiprocessing.spawn(create_transformer_trainer,
args=(4, mlm_train_dataset, mlm_model),
nprocs=4,
join=True)

输出：

The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
The current process just got forked. Disabling parallelism to avoid deadlocks...
To disable this warning, please explicitly set TOKENIZERS_PARALLELISM=(true | false)
build trainer with on device: cuda:1 with n gpus: 1
build trainer with on device: cuda:2 with n gpus: 1
build trainer with on device: cuda:3 with n gpus: 1
build trainer with on device: cuda:0 with n gpus: 1
finished in 0:04:15.937331 seconds
finished in 0:04:16.899411 seconds
finished in 0:04:16.938141 seconds
finished in 0:04:17.391887 seconds

关于初始分叉警告：什么是exaclty分叉，这是预期的吗？

关于由此产生的时间：训练器的使用是否错误，因为它似乎比DataParallel方法慢得多？

派对有点晚了，但无论如何。我将在这里留下这条评论，以帮助任何想知道是否有可能在标记化器中保持并行性的人。

根据github上的这条评论，FastTokenizer似乎是问题所在。此外，根据gitmemory上的另一条评论，在分叉进程之前，不应该使用标记化器(这基本上意味着在通过数据加载器迭代之前(

因此，解决方案是在训练/微调或使用普通令牌之前不要使用FastTokenizer。

查看huggingface文档，了解您是否真的需要FastTokenizer。

相关内容

最新更新

热门标签：