What happens when the label’s dimension is different from neural network’s output layer’s dimension in PyTorch?

Solution for What happens when the label’s dimension is different from neural network’s output layer’s dimension in PyTorch?
is Given Below:

It makes intuitive sense to me that the label’s dimension should be the same as the neural network’s last layer’s dimension. However, with some experiments using PyTorch, it turns out that it somehow works.

Code:

import torch
import torch.nn as nn

X = torch.tensor([[1],[2],[3],[4]], dtype=torch.float32) # training input
Y = torch.tensor([[2],[4],[6],[8]], dtype=torch.float32) # training label

model = nn.Linear(1,3)
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in range(10):
    y_pred model(X)
    loss = nn.MSELoss(Y, y_pred)
    loss.backward()
    optimizer.zero_grad()
    optimizer.step()

In the above code, model = nn.Linear(1,3) is used instead of model = nn.Linear(1,1). As a result, while Y.shape is (4,1), y_pred.shape is (4,3).

The code works with a warning saying that “Using a target size that is different to the input size will likely lead to incorrect results due to broadcasting. “

I got the following output when I executed model(torch.tensor([10], dtype=torch.float32)):

tensor([20.0089, 19.6121, 19.1967], grad_fn=<AddBackward0>)

All three outputs seems correct to me. But how is the loss calculated if the sizes of the data are different?

Should we in any case use a target size that is different to the input size? Is there a benefit for this?

Assuming you are working with batch_size=4, you are using a target with 1 component vs 3 for your predicted tensor. You don’t actually see the intermediate results when computing the loss with nn.MSELoss, using the reduction='none' option will allow you to do so:

>>> criterion = nn.MSELoss(reduction='none')
>>> y = torch.rand(2,1)
>>> y_hat = torch.rand(2,3)
>>> criterion(y_hat, y).shape
(2, 3)

Considering this, you can conclude that the target y, being too small, has been broadcasted to the predicted tensor y_hat. Essentially, in your example, you will get the same result (without the warning) as:

>>> y_repeat = y.repeat(1, 3)
>>> criterion(y_hat, y_repeat)

This means that, for each batch, you are L2-optimizing all its components against a single value: MSE(y_hat[0,0], y[0]), MSE(y_hat[0,1], y[0]), and MSE(y_hat[0,2], y[0]), same goes for y[1] and y[2].

The warning is there to make sure you’re conscious of this broadcast operation. Maybe this is what you’re looking to do, in this case, you should broadcast the target tensor yourself. Otherwise, it won’t make sense to do so.

What happens when the label’s dimension is different from neural network’s output layer’s dimension in PyTorch?

Solution for What happens when the label’s dimension is different from neural network’s output layer’s dimension in PyTorch?
is Given Below:

It makes intuitive sense to me that the label’s dimension should be the same as the neural network’s last layer’s dimension. However, with some experiments using PyTorch, it turns out that it somehow works.

Code:

import torch
import torch.nn as nn

X = torch.tensor([[1],[2],[3],[4]], dtype=torch.float32) # training input
Y = torch.tensor([[2],[4],[6],[8]], dtype=torch.float32) # training label

model = nn.Linear(1,3)
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

for epoch in range(10):
    y_pred model(X)
    loss = nn.MSELoss(Y, y_pred)
    loss.backward()
    optimizer.zero_grad()
    optimizer.step()

In the above code, model = nn.Linear(1,3) is used instead of model = nn.Linear(1,1). As a result, while Y.shape is (4,1), y_pred.shape is (4,3).

The code works with a warning saying that “Using a target size that is different to the input size will likely lead to incorrect results due to broadcasting. “

I got the following output when I executed model(torch.tensor([10], dtype=torch.float32)):

tensor([20.0089, 19.6121, 19.1967], grad_fn=<AddBackward0>)

All three outputs seems correct to me. But how is the loss calculated if the sizes of the data are different?

Should we in any case use a target size that is different to the input size? Is there a benefit for this?

Assuming you are working with batch_size=4, you are using a target with 1 component vs 3 for your predicted tensor. You don’t actually see the intermediate results when computing the loss with nn.MSELoss, using the reduction='none' option will allow you to do so:

>>> criterion = nn.MSELoss(reduction='none')
>>> y = torch.rand(2,1)
>>> y_hat = torch.rand(2,3)
>>> criterion(y_hat, y).shape
(2, 3)

Considering this, you can conclude that the target y, being too small, has been broadcasted to the predicted tensor y_hat. Essentially, in your example, you will get the same result (without the warning) as:

>>> y_repeat = y.repeat(1, 3)
>>> criterion(y_hat, y_repeat)

This means that, for each batch, you are L2-optimizing all its components against a single value: MSE(y_hat[0,0], y[0]), MSE(y_hat[0,1], y[0]), and MSE(y_hat[0,2], y[0]), same goes for y[1] and y[2].

The warning is there to make sure you’re conscious of this broadcast operation. Maybe this is what you’re looking to do, in this case, you should broadcast the target tensor yourself. Otherwise, it won’t make sense to do so.