Tensorflow 2 Object Detection API Function call stack error

Solution for Tensorflow 2 Object Detection API Function call stack error
is Given Below:

I am trying to train a model from Tensorflow 2 Object Detection API but when I start the training process, I am getting the following error message

2021-08-01 08:38:32.187042: W tensorflow/core/common_runtime/bfc_allocator.cc:467] __________________________________________________________________________*****x__****x**_**********
2021-08-01 08:38:32.187117: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at gather_op.cc:158 : Resource exhausted: OOM when allocating tensor with shape[3763200,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "Tensorflow/models/research/object_detection/model_main_tf2.py", line 115, in <module>
    tf.compat.v1.app.run()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "Tensorflow/models/research/object_detection/model_main_tf2.py", line 112, in main
    record_summaries=FLAGS.record_summaries)
  File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 603, in train_loop
    train_input, unpad_groundtruth_tensors)
  File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 394, in load_fine_tune_checkpoint
    _ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
  File "/usr/local/lib/python3.7/dist-packages/object_detection/model_lib_v2.py", line 176, in _ensure_model_is_built
    labels,
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1285, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2833, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 679, in _call_for_each_replica
    self._container_strategy(), fn, args, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 86, in call_for_each_replica
    return wrapped(args, kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3024, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 1961, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 596, in call
    ctx=ctx)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[3763200,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node MultiLevelMatMulCropAndResize/MultiLevelRoIAlign/GatherV2_1 (defined at /local/lib/python3.7/dist-packages/object_detection/utils/spatial_transform_ops.py:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference__dummy_computation_fn_67462]

Errors may have originated from an input operation.
Input Source operations connected to node MultiLevelMatMulCropAndResize/MultiLevelRoIAlign/GatherV2_1:
 MultiLevelMatMulCropAndResize/MultiLevelRoIAlign/mul_14 (defined at /local/lib/python3.7/dist-packages/object_detection/utils/spatial_transform_ops.py:274)

Function call stack:
_dummy_computation_fn

I have tried many models in the API but I could not go far beyond this error message. I am using the Colab as working environment and I have also tried to start training on a local machine which has NVIDIA GTX 1660TI with 6GB RAM but the same error still went on. By the way when I changed the batch size of the model (especially when I decrease it to a low value such as 6 or 7) the error message has been changing (since it is a too long error message (greater than 5.000 lines) I could not share it). Could anyone help me out about this problem to overcome?

So here’s the most important part of that error message:

tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[3763200,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

You’re running out of memory while trying to allocate a tensor of shape 3763200 by 1024. Let’s work this out quickly… a matrix with 3,763,200 rows and 1024 columns should have 3,853,516,800 (3.8 billion!) entries. Assuming you’re using float32, each entry is 32 bits or 4 bytes, so there are 15.2 billion bytes, so the tensor you’re trying to store is 15.2 GB, and won’t fit on a 6 gigabyte GPU (and I don’t think Colab provides 16 GB GPUs either).

You’re right to consider changing the batch size – smaller batches should fit more easily onto whatever hardware you’re using. Try using a batch size of 1 – which should definitely work! – and increase from there until you see the error again.