当我移动到另一台电脑时,我失去了所有的训练成果,我不明白为什么,它应该在每个区块后保存模型,它这样做是因为当我在同一台电脑上重新启动它时,损失是一样的(低),但当我将它移动到另另一台具有相同数据的电脑时,预测的效果几乎没有那么好,训练又从高损失开始了。当我移动到新电脑时,我只是在复制/gpt2_folder,对吗?
import os
import math
import numpy as np
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
import random
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch_size = 32 # Batch size for training.
epochs = 1 # Number of epochs to train for.
chunk_size = 250000 # Number of samples to train on.
data_path = "data.txt"
start_token = "<start>"
end_token = "<end>"
# Read the number of lines in the data file
with open(data_path, "r", encoding="utf-8") as f:
num_lines = sum(1 for line in f)
# Calculate the number of chunks needed
num_chunks = math.ceil(num_lines / chunk_size)
# Use GPT-2's tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("./gpt2_model")
# Add the start and end tokens to the tokenizer
tokenizer.add_tokens([start_token, end_token])
# Load GPT-2's pre-trained model
model = GPT2LMHeadModel.from_pretrained("./gpt2_model")
# Resize the model's token embeddings to include the new tokens
model.resize_token_embeddings(len(tokenizer))
for chunk in range(num_chunks):
# Clear previous data
input_texts = []
filename = ""
# ...
with open(data_path, "r", encoding="utf-8") as f:
# Skip lines that have already been read
for _ in range(chunk * chunk_size):
next(f)
# Read the lines for this chunk
for _ in range(chunk_size):
line = f.readline().strip()
if not line:
break
input_text, target_text = line.split(":")
input_texts.append(start_token + input_text + ":" + target_text + end_token)
filename = "processed_data.txt"
# Save the processed data into a new file
with open(str(chunk) + filename, "w", encoding="utf-8") as f:
for text in input_texts:
f.write(text + "n")
print("file created:" + str(chunk) + filename)
# Create the dataset using the tokenizer
dataset = TextDataset(tokenizer=tokenizer, file_path=str(chunk) + filename, block_size=128)
# Create a data collator for language modeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
# Define training arguments
training_args = TrainingArguments(
output_dir="./gpt2_model",
overwrite_output_dir=True,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
save_steps=10_000,
save_total_limit=2,
# Add these arguments to enable multi-GPU training
)
# Create a Trainer instance with the model, dataset, and training arguments
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
# Train the model on this chunk
trainer.train()
model.save_pretrained("./gpt2_model")
tokenizer.save_pretrained("./gpt2_model")
输出目录不是绝对意义上的指定,而是作为对当前目录的引用。
也许新机器上的安装略有不同,而且你看错了地方。
或者,您可能有不同的软件(例如依赖项)造成问题。
我不相信任何东西都保存在内存中,所以只要你在磁盘上传输所有东西并正确引用它,代码就不会注意到它在另一台计算机上运行。
排除此特定代码可能隐式写入第二个位置的假设根本原因的简单方法是,让一个简单的脚本将某些内容写入您在此处使用的路径,并在迁移数据后尝试在新计算机上读取数据。