How I can give input to the 3d Convolutional Neural Network?

Solution for How I can give input to the 3d Convolutional Neural Network?
is Given Below:

The 3d CNN works with the videos, MRI, and scan datasets. Can you tell me If I have to feed the input (video) to the proposed 3d CNN network, and train it’s weights, how can I able to do that? as 3d CNN expect 5 dimensional inputs;

[batch size, channels, depth, height, weight]

how can I extract depth from the videos?

If I have 10 video of 10 different classes. The duration of each video is 6 seconds. I extract 2 frames for each second and it goes around 12 frames for each video.

Size of RGB videos is 112×112 –> Height = 112, Width=112, and Channels=3

If I keep the batch size equals 2

1 video –> 6 seconds –> 12 frames (1sec == 2frames) [each frame (3,112,112)]

10 videos (10 classes) –> 60 seconds –> 120 frames

So the 5 dimensions will be something like this; [2, 3, 12, 112, 112]

2 –> Two videos will be processed for each batch size.

3 –> RGB channels

12 –> each video contains 12 frames

112 –> Height of each video

112 –> Width of each video

Am I right?

Yes, that seems to make sense if you’re looking to use a 3D CNN. You’re essentially adding a dimension to your input which is the temporal one, it is logical to use the depth dimension for it. This way you keep the channel axis as the feature channel (i.e. not a spatial-temporal dimension).

Keep in mind 3D CNNs are really memory intensive. There exist other methods to work with temporal dependent input. Here you are not really dealing with a third dimension (a ‘spatial’ dimension that is), so you’re not required to use a 3D CNN.


If I give the input of the above dimension to the 3d CNN, will it learn both features (spatial and temporal)? […] Can you make me understand, spatial and temporal features?

If you use a 3D CNN then your filters will have a 3D kernel, and the convolution will be three dimensional: along the two spatial dimensions (width and height) as well as the depth dimensions (here corresponding to the temporal dimensions, since you’re using depth dimension for the sequence of videos frames. A 3D CNN will allow you to capture local (‘local’ because the perception field is limited by the sizes of the kernels and the overall number of layers in the CNN) spatial and temporal information.