model.train()和model.eval()导致nan值

嘿，所以我正在尝试使用猴子物种数据集和带有修改的最终fc层的resnet50进行图像分类/迁移学习，以仅预测10个类别。在我使用model.train((和model.eval((之前，每件事都在起作用，然后在第一个历元之后，它开始返回nans，精度下降，如下所示。我很好奇为什么只有在切换到train/eval…时才会出现这种情况。。。。？

首先我导入模型并附加分类器并冻结参数

%%capture
resnet = models.resnet50(pretrained=True)
for param in resnet.parameters():
param.required_grad = False
in_features = resnet.fc.in_features

# Build custom classifier
classifier = nn.Sequential(OrderedDict([('fc1', nn.Linear(in_features, 512)),
('relu', nn.ReLU()),
('drop', nn.Dropout(0.05)),
('fc2', nn.Linear(512, 10)),
]))
# ('output', nn.LogSoftmax(dim=1))
resnet.classifier = classifier
resnet.to(device)

然后设置我的损失函数、优化器和shceduler

# Step : Define criterion and optimizer
criterion = nn.CrossEntropyLoss()
# pass the optimizer to the appended classifier layer
optimizer = torch.optim.SGD(resnet.parameters(), lr=0.01)
# Scheduler
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[10], gamma=0.05)

然后设置训练和验证循环

epochs = 20

tr_losses = []
avg_epoch_tr_loss = []
tr_accuracy = []

val_losses = []
avg_epoch_val_loss = []
val_accuracy = []
val_loss_min = np.Inf

resnet.train()
for epoch in range(epochs):
for i, batch in enumerate(train_loader):
# Pull the data and labels from the batch
data, label = batch
# If available push data and label to GPU
if train_on_gpu:
data, label = data.to(device), label.to(device)
# Compute the logit
logit = resnet(data)
# Compte loss
loss = criterion(logit, label)
# Clearing the gradient
resnet.zero_grad()
# Backpropagate the gradients (accumulte the partial derivatives of loss)
loss.backward()
# Apply the updates to the optimizer step in the opposite direction to the gradient
optimizer.step()
# Store the losses of each batch
# loss.item() seperates the loss from comp graph
tr_losses.append(loss.item())
# Detach and store the average accuracy of each batch
tr_accuracy.append(label.eq(logit.argmax(dim=1)).float().mean())
# Print the rolling batch training loss every 20 batches
if i % 40 == 0 and not i == 1:
print(f'Batch No: {i} tAverage Training Batch Loss: {torch.tensor(tr_losses).mean():.2f}')
# Print the average loss for each epoch
print(f'nEpoch No: {epoch + 1},Training Loss: {torch.tensor(tr_losses).mean():.2f}')
# Print the average accuracy for each epoch
print(f'Epoch No: {epoch + 1}, Training Accuracy: {torch.tensor(tr_accuracy).mean():.2f}n')
# Store the avg epoch loss for plotting
avg_epoch_tr_loss.append(torch.tensor(tr_losses).mean())

resnet.eval()
for i, batch in enumerate(val_loader):
# Pull the data and labels from the batch
data, label = batch
# If available push data and label to GPU
if train_on_gpu:
data, label = data.to(device), label.to(device)
# Compute the logits without computing the gradients
with torch.no_grad():
logit = resnet(data)
# Compte loss
loss = criterion(logit, label)
# Store test loss
val_losses.append(loss.item())
# Store the accuracy for each batch
val_accuracy.append(label.eq(logit.argmax(dim=1)).float().mean())
if i % 20 == 0 and not i == 1:
print(f'Batch No: {i+1} tAverage Val Batch Loss: {torch.tensor(val_losses).mean():.2f}')
# Print the average loss for each epoch
print(f'nEpoch No: {epoch + 1}, Epoch Val Loss: {torch.tensor(val_losses).mean():.2f}')
# Print the average accuracy for each epoch    
print(f'Epoch No: {epoch + 1}, Epoch Val Accuracy: {torch.tensor(val_accuracy).mean():.2f}n')
# Store the avg epoch loss for plotting
avg_epoch_val_loss.append(torch.tensor(val_losses).mean())
# Checpoininting the model using val loss threshold
if torch.tensor(val_losses).float().mean() <= val_loss_min:
print("Epoch Val Loss Decreased... Saving model")
# save current model
torch.save(resnet.state_dict(), '/content/drive/MyDrive/1. Full Projects/Intel Image Classification/model_state.pt')
val_loss_min = torch.tensor(val_losses).mean()
# Step the scheduler for the next epoch
scheduler.step()
# Print the updated learning rate
print('Learning Rate Set To: {:.5f}'.format(optimizer.state_dict()['param_groups'][0]['lr']),'n')

模型开始训练，但随后慢慢变为nan值

Batch No: 0     Average Training Batch Loss: 9.51
Batch No: 40    Average Training Batch Loss: 1.71
Batch No: 80    Average Training Batch Loss: 1.15
Batch No: 120   Average Training Batch Loss: 0.94
Epoch No: 1,Training Loss: 0.83
Epoch No: 1, Training Accuracy: 0.78
Batch No: 1     Average Val Batch Loss: 0.39
Batch No: 21    Average Val Batch Loss: 0.56
Batch No: 41    Average Val Batch Loss: 0.54
Batch No: 61    Average Val Batch Loss: 0.54
Epoch No: 1, Epoch Val Loss: 0.55
Epoch No: 1, Epoch Val Accuracy: 0.81
Epoch Val Loss Decreased... Saving model
Learning Rate Set To: 0.01000 
Batch No: 0     Average Training Batch Loss: 0.83
Batch No: 40    Average Training Batch Loss: nan
Batch No: 80    Average Training Batch Loss: nan

我看到resnet.zero_grad()在logit = resnet(data)之后，这会导致梯度在您的情况下爆炸。

请按以下步骤操作：

# Clearing the gradient
optimizer.zero_grad()
logit = resnet(data)
# Compute loss
loss = criterion(logit, label)

相关内容

最新更新

热门标签：