RuntimeError:mat1和mat2形状不能在Windows中的Google Colab中相乘(1x1792和2



我正在尝试训练一个神经网络,以使用不同情况的3D图像作为输入来预测一个值。根据配置参数,我传递给神经网络的输入图像的大小为(8,1,96,96,96(,输出为标量值。

当我运行这个单元格时。。。

# Init model
model = BrainAgeCNN().to(config.device)
config.lr = 0.01
config.betas = (0.9, 0.999)
config.num_steps = 1400

# Init optimizers
optimizer = torch.optim.AdamW(
model.parameters(),
lr=config.lr,
betas=config.betas
)
# Init tensorboard
writer = TensorboardLogger(config.log_dir, config)
# Train
model, step = train(
config=config,
model=model,
optimizer=optimizer,
train_loader=dataloaders['train'],
val_loader=dataloaders['val'],
writer=writer
)

这是我在训练结束时得到的错误,但在训练过程中我没有得到任何错误:

Training:   0%|          | 0/50 [00:00<?, ?it/s]
Training:   2%|▏         | 1/50 [00:00<00:16,  2.89it/s]
Training:   4%|▍         | 2/50 [00:00<00:17,  2.79it/s]
Training:   6%|▌         | 3/50 [00:00<00:14,  3.33it/s]
Training:   8%|▊         | 4/50 [00:01<00:12,  3.67it/s]
Training:  10%|█         | 5/50 [00:01<00:11,  3.87it/s]
Training:  12%|█▏        | 6/50 [00:01<00:10,  4.02it/s]
Training:  14%|█▍        | 7/50 [00:01<00:10,  4.12it/s]
Training:  16%|█▌        | 8/50 [00:02<00:10,  4.15it/s]
Training:  18%|█▊        | 9/50 [00:02<00:09,  4.21it/s]
Training:  20%|██        | 10/50 [00:02<00:09,  4.23it/s]
Training:  22%|██▏       | 11/50 [00:02<00:09,  4.29it/s]
Training:  24%|██▍       | 12/50 [00:03<00:08,  4.26it/s]
Training:  26%|██▌       | 13/50 [00:03<00:08,  4.30it/s]
Training:  28%|██▊       | 14/50 [00:03<00:08,  4.33it/s]
Training:  30%|███       | 15/50 [00:03<00:08,  4.34it/s]
Training:  32%|███▏      | 16/50 [00:03<00:07,  4.30it/s]
Training:  34%|███▍      | 17/50 [00:04<00:07,  4.30it/s]
Training:  36%|███▌      | 18/50 [00:04<00:07,  4.31it/s]
Training:  38%|███▊      | 19/50 [00:04<00:07,  4.33it/s]
Training:  40%|████      | 20/50 [00:04<00:06,  4.33it/s]
Training:  42%|████▏     | 21/50 [00:05<00:06,  4.35it/s]
Training:  44%|████▍     | 22/50 [00:05<00:06,  4.34it/s]
Training:  46%|████▌     | 23/50 [00:05<00:06,  4.36it/s]
Training:  48%|████▊     | 24/50 [00:05<00:05,  4.37it/s]
Training:  50%|█████     | 25/50 [00:06<00:05,  4.37it/s]
Training:  52%|█████▏    | 26/50 [00:06<00:05,  4.36it/s]
Training:  54%|█████▍    | 27/50 [00:06<00:05,  4.38it/s]
Training:  56%|█████▌    | 28/50 [00:06<00:05,  4.36it/s]
Training:  58%|█████▊    | 29/50 [00:06<00:04,  4.34it/s]
Training:  60%|██████    | 30/50 [00:07<00:04,  4.35it/s]
Training:  62%|██████▏   | 31/50 [00:07<00:04,  4.34it/s]
Training:  64%|██████▍   | 32/50 [00:07<00:04,  4.32it/s]
Training:  66%|██████▌   | 33/50 [00:07<00:03,  4.29it/s]
Training:  68%|██████▊   | 34/50 [00:08<00:03,  4.23it/s]
Training:  70%|███████   | 35/50 [00:08<00:03,  4.26it/s]
Training:  72%|███████▏  | 36/50 [00:08<00:03,  4.25it/s]
Training:  74%|███████▍  | 37/50 [00:08<00:03,  4.25it/s]
Training:  76%|███████▌  | 38/50 [00:09<00:02,  4.27it/s]
Training:  78%|███████▊  | 39/50 [00:09<00:02,  4.25it/s]
Training:  80%|████████  | 40/50 [00:09<00:02,  4.22it/s]
Training:  82%|████████▏ | 41/50 [00:09<00:02,  4.27it/s]
Training:  84%|████████▍ | 42/50 [00:09<00:01,  4.24it/s]
Training:  86%|████████▌ | 43/50 [00:10<00:01,  4.25it/s]
Training:  88%|████████▊ | 44/50 [00:10<00:01,  4.25it/s]
Training:  90%|█████████ | 45/50 [00:10<00:01,  4.27it/s]
Training:  92%|█████████▏| 46/50 [00:10<00:00,  4.27it/s]
Training:  94%|█████████▍| 47/50 [00:11<00:00,  4.27it/s]
Training:  96%|█████████▌| 48/50 [00:11<00:00,  4.28it/s]
Training:  98%|█████████▊| 49/50 [00:11<00:00,  4.27it/s]
Training: 100%|██████████| 50/50 [00:11<00:00,  4.25it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-59-ba791e9bf3a2> in <module>
14     train_loader=dataloaders['train'],
15     val_loader=dataloaders['val'],
---> 16     writer=writer
17 )
5 frames
<ipython-input-29-98abf7b06208> in train(config, model, optimizer, train_loader, val_loader, writer)
41                     model,
42                     val_loader,
---> 43                     config,
44                 )
45 
<ipython-input-29-98abf7b06208> in validate(model, val_loader, config, show_plot)
76 
77         with torch.no_grad(): # Context-manager that disabled gradient calculation
---> 78             loss, pred = model.train_step(x, y, return_prediction=True)
79         avg_val_loss.add(loss.item())
80         preds.append(pred.cpu())
/content/ai-in-medicine-practical-session1/models.py in train_step(self, imgs, labels, return_prediction)
112         :return pred
113         """
--> 114         pred = torch.squeeze(self.forward(imgs.float()))  # (N)
115 
116         # ----------------------- ADD YOUR CODE HERE --------------------------
/content/ai-in-medicine-practical-session1/models.py in forward(self, imgs)
93 
94         x = x.view(-1, x.shape[0]*x.shape[1]*x.shape[2]*x.shape[3]*x.shape[4])
---> 95         pred = self.relu1_5(self.fc1(x))
96 
97         # ------------------------------- END ---------------------------------
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1128         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1129                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130             return forward_call(*input, **kwargs)
1131         # Do not call functions when jit is used
1132         full_backward_hooks, non_full_backward_hooks = [], []
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/linear.py in forward(self, input)
112 
113     def forward(self, input: Tensor) -> Tensor:
--> 114         return F.linear(input, self.weight, self.bias)
115 
116     def extra_repr(self) -> str:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x1792 and 2048x1)

根据我在模型参数中看到的情况。这个模型应该训练得很好,似乎我把批量设置为8。然而,在训练结束时,这个值变为7(我不知道为什么(,并给出了上面的错误。

这是训练的功能:

def train(config, model, optimizer, train_loader, val_loader, writer):
model.train()
step = 0
pbar = tqdm(total=config.val_freq,
desc=f'Training') # Progress bar
avg_loss = AvgMeter() # Computes and stores the average and current value.

while True:
for x, y in train_loader:
x = x.to(config.device) 
y = y.to(config.device)
pbar.update(1) # Update progress bar 1 value
# Training step
optimizer.zero_grad() # Sets the gradients of all optimized torch.Tensor s to zero
loss = model.train_step(x, y) # Calculate the loss
loss.backward() # Computes dloss/dx for every parameter x which has requires_grad=True (x.grad += dloss/dx)
optimizer.step() # Updates the value of x using the gradient x.grad (x += -lr * x.grad)
# optimizer.zero_grad() clears x.grad for every parameter x in the optimizer. It’s important to call this before loss.backward(), 
# otherwise you’ll accumulate the gradients from multiple passes.
avg_loss.add(loss.detach().item())
# .detach() will return a tensor, which is detached from the computation graph, while .item() will return the Python scalar
# Increment step
step += 1
if step % config.log_freq == 0 and not step % config.val_freq == 0:
train_loss = avg_loss.compute()
writer.log({'train/loss': train_loss}, step=step)
# Validate and log at validation frequency
if step % config.val_freq == 0:
# Reset avg_loss
train_loss = avg_loss.compute()
avg_loss = AvgMeter()
# Get validation results
val_results = validate(
model,
val_loader,
config,
)
# Print current performance
print(f"Finished step {step} of {config.num_steps}. "
f"Train loss: {train_loss} - "
f"val loss: {val_results['val/loss']:.4f} - "
f"val MAE: {val_results['val/MAE']:.4f}")
# Write to tensorboard
writer.log(val_results, step=step)
# Reset progress bar
pbar = tqdm(total=config.val_freq, desc='Training')
if step >= config.num_steps:
print(f'nFinished training after {step} stepsn')
return model, step

def validate(model, val_loader, config, show_plot=False):
model.eval()
# model.eval() is a kind of switch for some specific layers/parts of the model that behave differently during training 
# and inference (evaluating) time. For example, Dropouts Layers, BatchNorm Layers etc. You need to turn off them during model 
# evaluation, and .eval() will do it for you. In addition, the common practice for evaluating/validation is using torch.no_grad() 
# in pair with model.eval() to turn off gradients computation
avg_val_loss = AvgMeter()
preds = []
targets = []
for x, y in val_loader:
x = x.to(config.device)
y = y.to(config.device)
with torch.no_grad(): # Context-manager that disabled gradient calculation
loss, pred = model.train_step(x, y, return_prediction=True)
avg_val_loss.add(loss.item())
preds.append(pred.cpu())
targets.append(y.cpu())
# torch.cat() Concatenates the given sequence of seq tensors in the given dimension
# All tensors must either have the same shape (except in the concatenating dimension) or be empty
preds = torch.cat(preds)
targets = torch.cat(targets)
mae = mean_absolute_error(preds, targets)
f = plot_results(preds, targets, show_plot)
model.train()
return {
'val/loss': avg_val_loss.compute(),
'val/MAE': mae,
'val/MAE_plot': f
}

def plot_results(preds: Tensor, targets: Tensor, show_plot: bool = False):
# Compute the mean absolute error
mae_test = mean_absolute_error(preds, targets)
# Sort preds and targets to ascending targets
sort_inds = targets.argsort() # It returns an array of indices along the given axis of the same shape as the input array, in sorted order
targets = targets[sort_inds].numpy() # Converts a tensor object into an numpy.ndarray object
preds = preds.view(targets.shape)
preds = preds[sort_inds].numpy() # Converts a tensor object into an numpy.ndarray object
f = plt.figure()
plt.plot(targets, targets, 'r.')
plt.plot(targets, preds, '.')
plt.plot(targets, targets + mae_test, 'gray')
plt.plot(targets, targets - mae_test, 'gray')
plt.suptitle('Mean Average Error')
plt.xlabel('True Age')
plt.ylabel('Age predicted')
if show_plot:
plt.show()
return f

这是我用来训练的神经网络。它是一个具有3D卷积、批量归一化、ReLU((和末端完全连接层的神经网络。

from typing import Optional
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor

class BrainAgeCNN(nn.Module):
"""
The BrainAgeCNN predicts the age given a brain MR-image.
"""
def __init__(self) -> None:
super().__init__()
self.loss = torch.nn.MSELoss()

# Feel free to also add arguments to __init__ if you want.
# ----------------------- ADD YOUR CODE HERE --------------------------
self.conv1_1 = nn.Conv3d(in_channels = 1, out_channels = 4, kernel_size = 3, stride = 1, padding = 0)
self.relu1_1 = nn.ReLU()
self.conv2_1 = nn.Conv3d(in_channels = 4, out_channels = 4, kernel_size = 3, stride = 1, padding = 0)
self.bnn1_1 = nn.BatchNorm3d(num_features = 4)
self.relu2_1 = nn.ReLU()
self.maxp1_1 = nn.MaxPool3d(kernel_size = 2, stride=2, padding=0)
self.conv1_2 = nn.Conv3d(in_channels = 4, out_channels = 8, kernel_size = 3, stride = 1, padding = 0)
self.relu1_2 = nn.ReLU()
self.conv2_2 = nn.Conv3d(in_channels = 8, out_channels = 8, kernel_size = 3, stride = 1, padding = 0)
self.bnn1_2 = nn.BatchNorm3d(num_features = 8)
self.relu2_2 = nn.ReLU()
self.maxp1_2 = nn.MaxPool3d(kernel_size = 2, stride=2, padding=0)
self.conv1_3 = nn.Conv3d(in_channels = 8, out_channels = 16, kernel_size = 3, stride = 1, padding = 0)
self.relu1_3 = nn.ReLU()
self.conv2_3 = nn.Conv3d(in_channels = 16, out_channels = 16, kernel_size = 3, stride = 1, padding = 0)
self.bnn1_3 = nn.BatchNorm3d(num_features = 16)
self.relu2_3 = nn.ReLU()
self.maxp1_3 = nn.MaxPool3d(kernel_size = 2, stride=2, padding=0)
self.conv1_4 = nn.Conv3d(in_channels = 16, out_channels = 32, kernel_size = 3, stride = 1, padding = 0)
self.relu1_4 = nn.ReLU()
self.conv2_4 = nn.Conv3d(in_channels = 32, out_channels = 32, kernel_size = 3, stride = 1, padding = 0)
self.bnn1_4 = nn.BatchNorm3d(num_features = 32)
self.relu2_4 = nn.ReLU()
self.maxp1_4 = nn.MaxPool3d(kernel_size = 2, stride=2, padding=0)
self.fc1 = nn.Linear(2048, 1)
self.relu1_5 = nn.ReLU()

# ------------------------------- END ---------------------------------
def forward(self, imgs: Tensor) -> Tensor:
"""
Forward pass of your model.
:param imgs: Batch of input images. Shape (N, 1, H, W, D)
:return pred: Batch of predicted ages. Shape (N)
"""
# ----------------------- ADD YOUR CODE HERE --------------------------

x = self.relu1_1(self.conv1_1(imgs))
x = self.maxp1_1(self.relu2_1(self.bnn1_1(self.conv2_1(x))))
x = self.relu1_2(self.conv1_2(x))
x = self.maxp1_2(self.relu2_2(self.bnn1_2(self.conv2_2(x))))
x = self.relu1_3(self.conv1_3(x))
x = self.maxp1_3(self.relu2_3(self.bnn1_3(self.conv2_3(x))))
x = self.relu1_4(self.conv1_4(x))
x = self.maxp1_4(self.relu2_4(self.bnn1_4(self.conv2_4(x))))
x = x.view(-1, x.shape[0]*x.shape[1]*x.shape[2]*x.shape[3]*x.shape[4])
pred = self.relu1_5(self.fc1(x))

# ------------------------------- END ---------------------------------
return pred
def train_step(
self,
imgs: Tensor,
labels: Tensor,
return_prediction: Optional[bool] = False
):
"""Perform a training step. Predict the age for a batch of images and
return the loss.
:param imgs: Batch of input images (N, 1, H, W, D)
:param labels: Batch of target labels (N)
:return loss: The current loss, a single scalar.
:return pred
"""
pred = torch.squeeze(self.forward(imgs.float()))  # (N)
# ----------------------- ADD YOUR CODE HERE --------------------------

loss = self.loss(labels.float(), pred)
# ------------------------------- END ---------------------------------
if return_prediction:
return loss, pred
else:
return loss

欢迎你给我任何帮助。

我试着改变批量大小,但它给出了相同的错误,但使用了矩阵乘法的其他值。

我期待着在谷歌Colab中进行神经网络的训练:从大小为(8,1,96,96,96(的输入图像中获得标量值,但没有任何错误。

从错误来看,列车内部的验证似乎有错误。这意味着你应该查看你的数据加载器(你有两个,训练和验证数据加载器数据加载器['train']和['val'](,如果你对这两个都使用批次,那么你的验证集的数据大小可能不是批次大小的倍数,上一个批次也不完整。您可以在DataLoader中使用drop_last来忽略最后一个。

https://pytorch.org/docs/stable/data.html

对于那些在Colab上工作但在本地不工作的LLM推理的答案:

  • 确保您的torch版本cuda与安装的cuda版本相同(bitsandbytes使用它(
  • 用手电筒检查你的手电筒版本版本
  • 检查路径显示的CUDA版本CUDA_HOME=/usr/local/cuda通常cuda是象征性的链接
ls -altr /usr/local/cuda
lrwxrwxrwx 1 root root 21 Tem  2 11:45 /usr/local/cuda -> /usr/local/cuda-11.8/
  • 如果您没有使用虚拟环境,请确保您不会意外地将torch或库安装到其他安装路径(例如pip安装和pip3安装可能安装到不同的路径(

  • nvidiaSMI报告安装的图形驱动程序cuda版本不是实际安装的cuda

4位LLM加载问题(猎鹰、羊驼等((产生这种形状的错误不能相乘(

  • 用于运行配置如下的4位型号:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True,
)
  • 打印模型并检查线性层如果你要使用4位,应该有Linear4bit层,而不是Linear。。如果你看不到Linear4bit图层,那么你的基本模型没有加载为4bit,很可能你不能将Lora与它一起使用,并且会得到mat1和mat2形状不能相乘的错误。。

  • 有关解决方案,请检查peft和bitsandbytes库上是否存在任何错误/加载问题。

最新更新