微调GPT2 -注意掩码和pad令牌id错误



我一直在尝试在wikitext-2数据集上微调GPT2(只是为了帮助自己学习这个过程),我遇到了一个我以前从未见过的警告消息:

"注意掩码和pad令牌id未设置。因此,您可能会观察到意想不到的行为。请传递您输入的attention_mask以获得可靠的结果。设置pad_token_ideos_token_id:50256,用于开放端生成。">

这看起来很奇怪,因为我在实例化标记器时在代码中明确指定了EOS令牌:

tokenizer = GPT2Tokenizer.from_pretrained('gpt2', bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>')

训练在没有崩溃的情况下完成,我的损失改善了每个epoch,但是当我对模型进行推理时,它输出的是绝对的胡言乱语——有时只生成一个单词,别的什么都没有。我在想,我收到的这个警告信息和模型性能不佳之间存在联系。

我得到了我的培训,有效的,测试数据从这里(我使用的。raw文件)- https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/

我手动添加了<|startoftext|>和& lt; | endoftext |祝辞在数据集的原始TXT文件中。生成的训练数据类似于以下两个示例(取自文本文件的中间):

...
<|startoftext|>
= Perfect Dark ( 2010 video game ) = 

Perfect Dark is a remastered release of the first @-@ person shooter video game by the same name . Developed by 4J Studios and published by Microsoft Game Studios a decade after the original 's 2000 release , the remaster features several technical improvements , including higher resolution textures and models , a higher frame rate , and a multiplayer mode that supports the Xbox Live online service . It was released for the Xbox 360 video game console in March 2010 , through the Xbox Live Arcade download service . The story of the game follows Joanna Dark , an agent of the Carrington Institute organization , as she attempts to stop a conspiracy by rival corporation dataDyne . 
Perfect Dark was under development for nearly a year and its game engine was completely re @-@ written from scratch to support several Xbox 360 features . Therefore , although the game plays exactly the same as the original , the code and renderer is different . The game received generally favorable reviews . Some critics considered the relatively unchanged game to be outdated , but most agreed that the title was a solid revival of a classic . As of the end of 2011 , the game had sold nearly 410 @,@ 000 units . 

= = Gameplay = = 

Perfect Dark is a first @-@ person shooter with elements of stealth games . In the game 's campaign mode , the player controls Joanna Dark through a series of nonlinear levels collected together into missions . Each level requires the player to complete a certain number of objectives , ranging from disguising oneself to hacking computers , collecting objects , and defeating enemies , among others . Players can carry an unlimited number of weapons and almost all of the weapons have two firing modes . The levels in Perfect Dark have no checkpoints , meaning that if Joanna is killed or fails an objective , the player has to start the level from the beginning . Every level can be played on three difficulty settings and several aspects , such as the enemies aggressiveness and the number of objectives that must be completed , among others , can vary in function of the chosen difficulty . Two players can also play the campaign co @-@ operatively or through a " counter @-@ operative " mode , in which one player controls the protagonist , while the other controls enemies throughout the level , attempting to stop the first player from completing objectives . 

= = = Enhancements = = = 

The remaster offers several improvements over the original Perfect Dark that was released for the Nintendo 64 in 2000 . The most remarkable change is that any of the multiplayer modes , including co @-@ operative and counter @-@ operative , can now be played in either splitscreen or through the Xbox Live online service . Combat Simulator matches are still capped at 12 entities , but the game can now comprise eight players online simultaneously , an improvement to the original 's cap of four players and eight Simulants . Players can also play against more than eight Simulants as long as there are enough slots available in a match ; for example , a single player can play against 11 Simulants ; such a feature was not possible in the original game . Unlike the original game , all the multiplayer content is unlocked from the beginning , and weapons from the game 's predecessor , which were originally only available in the missions , are now available to use in multiplayer . The game features an online leaderboard system and players can earn achievements and in @-@ game crowns by accomplishing certain tasks . The game also includes two new control set @-@ ups , entitled " Spartan " and " Duty Calls " , which are based on the popular first @-@ person shooter franchises Halo and Call of Duty respectively . 

<|endoftext|>
<|startoftext|>
= First Ostend Raid = 

The First Ostend Raid ( part of Operation ZO ) was the first of two attacks by the Royal Navy on the German @-@ held port of Ostend during the late spring of 1918 during the First World War . Ostend was attacked in conjunction with the neighbouring harbour of Zeebrugge on 23 April in order to block the vital strategic port of Bruges , situated 6 mi ( 5 @.@ 2 nmi ; 9 @.@ 7 km ) inland and ideally sited to conduct raiding operations on the British coastline and shipping lanes . Bruges and its satellite ports were a vital part of the German plans in their war on Allied commerce ( Handelskrieg ) because Bruges was close to the troopship lanes across the English Channel and allowed much quicker access to the Western Approaches for the U @-@ boat fleet than their bases in Germany . 
The plan of attack was for the British raiding force to sink two obsolete cruisers in the canal mouth at Ostend and three at Zeebrugge , thus preventing raiding ships leaving Bruges . The Ostend canal was the smaller and narrower of the two channels giving access to Bruges and so was considered a secondary target behind the Zeebrugge Raid . Consequently , fewer resources were provided to the force assaulting Ostend . While the attack at Zeebrugge garnered some limited success , the assault on Ostend was a complete failure . The German marines who defended the port had taken careful preparations and drove the British assault ships astray , forcing the abortion of the operation at the final stage . 
Three weeks after the failure of the operation , a second attack was launched which proved more successful in sinking a blockship at the entrance to the canal but ultimately did not close off Bruges completely . Further plans to attack Ostend came to nothing during the summer of 1918 , and the threat from Bruges would not be finally stopped until the last days of the war , when the town was liberated by Allied land forces . 

= = Bruges = = 

Bruges had been captured by the advancing German divisions during the Race for the Sea and had been rapidly identified as an important strategic asset by the German Navy . Bruges was situated 6 mi ( 5 @.@ 2 nmi ; 9 @.@ 7 km ) inland at the centre of a network of canals which emptied into the sea at the small coastal towns of Zeebrugge and Ostend . This land barrier protected Bruges from bombardment by land or sea by all but the very largest calibre artillery and also secured it against raiding parties from the Royal Navy . Capitalising on the natural advantages of the port , the German Navy constructed extensive training and repair facilities at Bruges , equipped to provide support for several flotillas of destroyers , torpedo boats and U @-@ boats . 
By 1916 , these raiding forces were causing serious concern in the Admiralty as the proximity of Bruges to the British coast , to the troopship lanes across the English Channel and for the U @-@ boats , to the Western Approaches ; the heaviest shipping lanes in the World at the time . In the late spring of 1915 , Admiral Reginald Bacon had attempted without success to destroy the lock gates at Ostend with monitors . This effort failed , and Bruges became increasingly important in the Atlantic Campaign , which reached its height in 1917 . By early 1918 , the Admiralty was seeking ever more radical solutions to the problems raised by unrestricted submarine warfare , including instructing the " Allied Naval and Marine Forces " department to plan attacks on U @-@ boat bases in Belgium . 
The " Allied Naval and Marine Forces " was a newly formed department created with the purpose of conducting raids and operations along the coastline of German @-@ held territory . The organisation was able to command extensive resources from both the Royal and French navies and was commanded by Admiral Roger Keyes and his deputy , Commodore Hubert Lynes . Keyes , Lynes and their staff began planning methods of neutralising Bruges in late 1917 and by April 1918 were ready to put their plans into operation . 

= = Planning = = 

To block Bruges , Keyes and Lynes decided to conduct two raids on the ports through which Bruges had access to the sea . Zeebrugge was to be attacked by a large force consisting of three blockships and numerous supporting warships . Ostend was faced by a similar but smaller force under immediate command of Lynes . The plan was for two obsolete cruisers — HMS Sirius and Brilliant — to be expended in blocking the canal which emptied at Ostend . These ships would be stripped to essential fittings and their lower holds and ballast filled with rubble and concrete . This would make them ideal barriers to access if sunk in the correct channel at the correct angle . 
When the weather was right , the force would cross the English Channel in darkness and attack shortly after midnight to coincide with the Zeebrugge Raid a few miles up the coast . By coordinating their operations , the assault forces would stretch the German defenders and hopefully gain the element of surprise . Covering the Inshore Squadron would be heavy bombardment from an offshore squadron of monitors and destroyers as well as artillery support from Royal Marine artillery near Ypres in Allied @-@ held Flanders . Closer support would be offered by several flotillas of motor launches , small torpedo boats and Coastal Motor Boats which would lay smoke screens to obscure the advancing blockships as well as evacuate the crews of the cruisers after they had blocked the channel . 
<|endoftext|> ...

我非常严格地遵循了这个教程- https://colab.research.google.com/drive/13dZVYEOMhXhkXWfvSMVM1TTtUDrT6Aeh?usp=sharing#scrollTo=pBEVY2PYSTXJ

下面是我的完整代码:
import random
import time
import datetime
import torch
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup, GPT2Config
smallest_gpt2 = 'gpt2'  # 124M weights (parameters)
# load training texts
with open('wikitext-2-raw/wiki.train.raw', 'r') as o:
raw_train_text = o.read()  # readlines() returns a list of strings separated by 'n'
with open('wikitext-2-raw/wiki.valid.raw', 'r') as o:
raw_validation_text = o.read()
with open('wikitext-2-raw/wiki.test.raw', 'r') as o:
raw_test_text = o.read()
# PRE-PROCESSING TRAINING, VALIDATION, AND TEST TEXTS
preprocessed_train = raw_train_text.split('<|startoftext|>')
preprocessed_train = [i for i in preprocessed_train if i]  # removes empty list entries
preprocessed_train = ['<|startoftext|>' + 'n' + entry for entry in preprocessed_train]  # adds <|startoftext|> to start
preprocessed_valid = raw_validation_text.split('<|startoftext|>')
preprocessed_valid = [i for i in preprocessed_valid if i]
preprocessed_valid = ['<|startoftext|>' + 'n' + entry for entry in preprocessed_valid]
preprocessed_test = raw_test_text.split('<|startoftext|>')
preprocessed_test = [i for i in preprocessed_test if i]
preprocessed_test = ['<|startoftext|>' + 'n' + entry for entry in preprocessed_test]
# HYPER PARAMETERS
EPOCHS = 5
BATCH_SIZE = 2  # GPT2 is a large model, so higher batch sizes can lead to memory problems
WARMUP_STEPS = 100
LEARNING_RATE = 5e-4
DECAY = 0
EPSILON = 1e-8

class GPT2Dataset(Dataset):
def __init__(self, txt_list, _tokenizer, gpt2_type=smallest_gpt2, max_length=768):
self.tokenizer = _tokenizer
self.input_ids = []
self.attn_masks = []
# this loop will wrap all training data examples in BOS and EOS tokens (beginning/end of sequence)
# this, again, helps the model understand the "format" of what you're training it for
# note however, that if a training example is longer than the max length, the EOS token will be truncated, and
#   this is not a problem for the model's training process
for txt in txt_list:
# pre_processed_text = '<|startoftext|>' + txt + '<|endoftext|>'  # i did this manually, so I skip it here
# print(txt)
# i handled most of the pre-processing for the training data further up in the code
encodings_dict = _tokenizer(txt, truncation=True, max_length=max_length, padding="max_length")
self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))
def __len__(self):
return len(self.input_ids)
def __getitem__(self, idx):
return self.input_ids[idx], self.attn_masks[idx]

# loading tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2', bos_token='<|startoftext|>', eos_token='<|endoftext|>',
pad_token='<|pad|>')  # gpt2-medium
print("The max model length is {} for this model, although the actual embedding size for GPT small is 768".format(tokenizer.model_max_length))
print("The beginning of sequence token {} token has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.bos_token_id), tokenizer.bos_token_id))
print("The end of sequence token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.eos_token_id), tokenizer.eos_token_id))
print("The padding token {} has the id {}".format(tokenizer.convert_ids_to_tokens(tokenizer.pad_token_id), tokenizer.pad_token_id))
# create dataset objects
train_dataset = GPT2Dataset(preprocessed_train, tokenizer, max_length=768)
valid_dataset = GPT2Dataset(preprocessed_valid, tokenizer, max_length=768)
test_dataset = GPT2Dataset(preprocessed_test, tokenizer, max_length=768)
# getting size of datasets
train_size = len(train_dataset)
val_size = len(valid_dataset)
print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))
# Create the DataLoaders for our training and validation datasets.
# We'll take training samples in random order.
train_dataloader = DataLoader(  # todo learn how dataloader creates targets
train_dataset,  # The training samples.
sampler=RandomSampler(train_dataset),  # Select batches randomly
batch_size=BATCH_SIZE  # Trains with this batch size.
)
# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
valid_dataset,  # The validation samples.
sampler=SequentialSampler(valid_dataset),  # Pull out batches sequentially.
batch_size=BATCH_SIZE  # Evaluate with this batch size.
)
# config
configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)
# instantiate model
model = GPT2LMHeadModel.from_pretrained(smallest_gpt2, config=configuration)
# this step is necessary because I've added some tokens (bos_token, etc) to the embeddings
# otherwise the tokenizer and model tensors won't match up. NOTE these tokens are already added to tokenizer above
model.resize_token_embeddings(len(tokenizer))
# this produces sample output every 50 steps
sample_every = 50
# Note: AdamW is a class from the huggingface library (as opposed to pytorch)
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, eps=EPSILON)
# Total number of training steps is [number of batches] x [number of epochs].
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * EPOCHS
# Create the learning rate scheduler.
# This changes the learning rate as the training loop progresses
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=total_steps)
training_stats = []
total_t0 = time.time()
# device config
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

def format_time(_elapsed):
return str(datetime.timedelta(seconds=int(round(_elapsed))))

for epoch_i in range(0, EPOCHS):
# ========================================
#               Training
# ========================================
print("")
print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, EPOCHS))
print('Training...')
t0 = time.time()
total_train_loss = 0
model.train()  # puts model in training mode
for step, batch in enumerate(train_dataloader):
b_input_ids = batch[0].to(device)
b_labels = batch[0].to(device)  # training targets
b_masks = batch[1].to(device)
model.zero_grad()
# feeding the input to the model
outputs = model(b_input_ids,
labels=b_labels,
attention_mask=b_masks,
token_type_ids=None
)
loss = outputs[0]  # how "wrong" was the model?
batch_loss = loss.item()
total_train_loss += batch_loss
# Get sample every x batches. This is just a check to see how the model is doing.
if step % sample_every == 0 and not step == 0:
elapsed = format_time(time.time() - t0)
print('  Batch {:>5,}  of  {:>5,}. Loss: {:>5,}.   Elapsed: {:}.'.format(step, len(train_dataloader),
               batch_loss, elapsed))
model.eval()  # puts model in evaluation mode, where the necessary layers are turned off for inference
# normally you would use a context manager here so the gradients don't get modified during this inference. However the tutorial I follow does not do this.
# with torch.no_grad():
# ... do inference eval ...
# Here we are simply using the model to get an output. This is called inference.
sample_outputs = model.generate(
bos_token_id=random.randint(1, 30000),  # todo why do we do this line?
do_sample=True,  # switches on sampling, where model will randomly select next word from the sample pool
top_k=50,  # only 50 words will be considered for the next word in the sequence
max_length=200,  # max tokens for total generation
top_p=0.95,  # smallest set of words whose probabilities summed together reach/exceed top_p value
num_return_sequences=1  # we only want model to generate one complete response (sequence of words)
# temperature=1
)
# temperature is another parameter we can use when running inference
# temperature of 0 will choose the highest-probability word each time
# temperature of 1 is default, and uses the model's base confidence to choose the next word
# temperature above 1 will make the model choose less-likely words. More creative, but more risk of nonsense
# we only sample for one return sequence so this for is sort of unnecessary, but whatever
for i, sample_output in enumerate(sample_outputs):
print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))
model.train()  # we have to put model back in train mode after eval mode
loss.backward()  # change weights with backprop
optimizer.step()
scheduler.step()
# Calculate the average loss over all of the batches.
avg_train_loss = total_train_loss / len(train_dataloader)
# Measure how long this epoch took.
training_time = format_time(time.time() - t0)
print("")
print("  Average training loss: {0:.2f}".format(avg_train_loss))
print("  Training epoch took: {:}".format(training_time))
# ========================================
#               Validation
# ========================================
print("")
print("Running Validation...")
t0 = time.time()
model.eval()
total_eval_loss = 0
nb_eval_steps = 0
# Evaluate data for one epoch
for batch in validation_dataloader:
b_input_ids = batch[0].to(device)
b_labels = batch[0].to(device)
b_masks = batch[1].to(device)
with torch.no_grad():  # weights are not updated
outputs = model(b_input_ids,
# token_type_ids=None,
attention_mask=b_masks,
labels=b_labels)
loss = outputs[0]
batch_loss = loss.item()
total_eval_loss += batch_loss
avg_val_loss = total_eval_loss / len(validation_dataloader)
validation_time = format_time(time.time() - t0)
print("  Validation Loss: {0:.2f}".format(avg_val_loss))
print("  Validation took: {:}".format(validation_time))
# Record all statistics from this epoch.
training_stats.append(
{
'epoch': epoch_i + 1,
'Training Loss': avg_train_loss,
'Valid. Loss': avg_val_loss,
'Training Time': training_time,
'Validation Time': validation_time
}
)
print("")
print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time() - total_t0)))

我不认为这与你的模型表现不好有关,但是回答你的问题,警告与生成例程有关。

如本文所述,只需在调用generate时将pad_token_id设置为标记器的eos_token_id就可以解决这个问题。

我将只评论下面的警告:

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

上面的意味着当你试图在模型上调用generate时,它不知道你正在使用的pad令牌是什么。generate方法使用pad和eos token有多种用途,例如,找出应该是注意掩码(即在输入序列中忽略哪些token),以及各种解码策略。不幸的是,许多流行的标记器没有设置这个标记,人们最终会收到这些警告。

要解决这个问题,首先在加载预训练的标记器后添加以下代码:
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

然后像这样传递给生成方法:

gen_ids = model.generate(**encodings, pad_token_id=tokenizer.pad_token_id, max_new_tokens=200)

你可以在这里看到HuggigFace代码产生这个错误。

你可以在这里看到完整的工作示例:https://github.com/sytelus/jupyter_nbs/blob/main/codegen_decoding.ipynb

相关内容

  • 没有找到相关文章

最新更新