Solution for What happens when the label’s dimension is different from neural network’s output layer’s dimension in PyTorch?
is Given Below:
It makes intuitive sense to me that the label’s dimension should be the same as the neural network’s last layer’s dimension. However, with some experiments using PyTorch, it turns out that it somehow works.
import torch import torch.nn as nn X = torch.tensor([,,,], dtype=torch.float32) # training input Y = torch.tensor([,,,], dtype=torch.float32) # training label model = nn.Linear(1,3) learning_rate = 0.01 optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate) for epoch in range(10): y_pred model(X) loss = nn.MSELoss(Y, y_pred) loss.backward() optimizer.zero_grad() optimizer.step()
In the above code,
model = nn.Linear(1,3) is used instead of
model = nn.Linear(1,1). As a result, while Y.shape is (4,1), y_pred.shape is (4,3).
The code works with a warning saying that “Using a target size that is different to the input size will likely lead to incorrect results due to broadcasting. “
I got the following output when I executed
tensor([20.0089, 19.6121, 19.1967], grad_fn=<AddBackward0>)
All three outputs seems correct to me. But how is the loss calculated if the sizes of the data are different?
Should we in any case use a target size that is different to the input size? Is there a benefit for this?
Assuming you are working with
batch_size=4, you are using a target with
1 component vs
3 for your predicted tensor. You don’t actually see the intermediate results when computing the loss with
nn.MSELoss, using the
reduction='none' option will allow you to do so:
>>> criterion = nn.MSELoss(reduction='none') >>> y = torch.rand(2,1) >>> y_hat = torch.rand(2,3) >>> criterion(y_hat, y).shape (2, 3)
Considering this, you can conclude that the target
y, being too small, has been broadcasted to the predicted tensor
y_hat. Essentially, in your example, you will get the same result (without the warning) as:
>>> y_repeat = y.repeat(1, 3) >>> criterion(y_hat, y_repeat)
This means that, for each batch, you are L2-optimizing all its components against a single value:
MSE(y_hat[0,1], y), and
MSE(y_hat[0,2], y), same goes for
The warning is there to make sure you’re conscious of this broadcast operation. Maybe this is what you’re looking to do, in this case, you should broadcast the target tensor yourself. Otherwise, it won’t make sense to do so.