实现了模型网络,但既没有训练误差,也没有降低val误差



由于我是Pytorch的新手,这个问题可能是一个非常琐碎的问题,但我想请您帮助解决这个问题。

我从一篇论文中实现了一个网络,并使用了论文中描述的所有超参数和所有层。

但当它开始训练时,尽管我将学习率衰减设置为0.001,但错误并没有减少。在100个时期中,训练错误约为3.3~3.4,测试错误约为3.5~3.6。。!

我可以更改超参数来改进模型,但由于论文给出了确切的数字,我想看看我实现的训练代码中是否存在错误。

下面的代码是我用于培训的代码。

from torch.utils.data.sampler import SubsetRandomSampler
import torch.nn.functional as F
import torch.optim as optim
import torch.nn as nn 
import json
import torch
import math
import time
import os 
model = nn.Sequential(Baseline(), Classification(40)).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.5)
batch = 32
train_path = '/content/mtrain'
train_data = os.listdir(train_path)
test_path = '/content/mtest'
test_data = os.listdir(test_path)
train_loader = torch.utils.data.DataLoader(train_data, batch, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_data, batch, shuffle=True)
train_loss, val_loss = [], []
epochs = 100
now = time.time()
print('training start!')
for epoch in range(epochs):
running_loss = 0.0
for bidx, trainb32 in enumerate(train_loader):
bpts, blabel = [], []

for i, data in enumerate(trainb32):
path = os.path.join(train_path, data)

with open(path, 'r') as f:
jdata = json.load(f)

label = jdata['label']
pts = jdata['pts']
bpts.append(pts) 
blabel.append(label)
bpts = torch.tensor(bpts).transpose(1, 2).to(device)
blabel = torch.tensor(blabel).to(device)

input = data_aug(bpts).to(device)
optimizer.zero_grad()
y_pred, feat_stn, glob_feat = model(input)
# print(f'global_feat is {global_feat}')
loss = F.nll_loss(y_pred, blabel) + 0.001 * regularizer(feat_stn)
loss.backward()
optimizer.step()
running_loss += loss.item()
if bidx % 10 == 9:
vrunning_loss = 0
vacc = 0
model.eval()
with torch.no_grad():

# val batch
for vbidx, testb32 in enumerate(test_loader):
bpts, blabel = [], []

for j, data in enumerate(testb32):
path = os.path.join(test_path, data)

with open(path, 'r') as f:
jdata = json.load(f)

label = jdata['label']
pts = jdata['pts']
bpts.append(pts) 
blabel.append(label)
bpts = torch.tensor(bpts).transpose(1, 2).to(device)
blabel = torch.tensor(blabel).to(device)

input = data_aug(bpts).to(device)
vy_pred, vfeat_stn, vglob_feat = model(input)
# print(f'global_feat is {vglob_feat}')
vloss = F.nll_loss(vy_pred, blabel) + 0.001 * regularizer(vfeat_stn)
_, vy_max = torch.max(vy_pred, dim=1) 
vy_acc = torch.sum(vy_max == blabel) / batch

vacc += vy_acc
vrunning_loss += vloss
# print every training 10th batch
train_loss.append(running_loss / len(train_loader))
val_loss.append(vrunning_loss / len(test_loader))
print(f"Epoch {epoch+1}/{epochs} {bidx}/{len(train_loader)}.. "
f"Train loss: {running_loss / 10:.3f}.."
f"Val loss: {vrunning_loss / len(test_loader):.3f}.."
f"Val Accuracy: {vacc/len(test_loader):.3f}.."
f"Time: {time.time() - now}")
now = time.time()
running_loss = 0
model.train()


print(f'training finish! training time is {time.time() - now}')
print(model.parameters())
savePath = '/content/modelpath.pth'
torch.save(model.state_dict(), '/content/modelpath.pth')

很抱歉出现基本问题,但如果此训练代码中没有错误,我们将非常高兴地告诉我,如果有,请给出任何解决提示。。

我已经实现了pointNet代码,完整的代码可在https://github.com/RaraKim/PointNet/blob/master/PointNet_pytorch.ipynb

谢谢!

我看到了您的代码,我相信您有一些手动声明的张量。在torch张量中;requires_grad"标志为错误。我认为因此你的反向传播不能正常工作,你能试着解决这个问题吗?如果问题仍然存在,我很乐意进一步帮助您。