Quantcast
Channel: TAO Toolkit - NVIDIA Developer Forums
Viewing all articles
Browse latest Browse all 497

Grounding dino : out of memory

$
0
0

I am a bit surprised, I can not evaluate model using grounding dino notebook

• Hardware RTX 4060 (8go/shared 16go)

docker and notebooks run inside wsl
2025-01-20 14:26:08,889 [TAO Toolkit] [INFO] root 160: Registry: [‘nvcr.io’]
2025-01-20 14:26:08,925 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2025-01-20 14:26:08,941 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
sys:1: UserWarning:
‘evaluate.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning:
‘evaluate.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
_run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See … for more information.
ret = run_job(
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results/evaluate/status.json
rank_zero_warn(
/usr/local/lib/python3.10/dist-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3553.)
Evaluate results will be saved at: /results/evaluatere[attr-defined]
final text_encoder_type: bert-base-uncased

tokenizer_config.json: 100%|██████████| 48.0/48.0 [00:00<00:00, 561kB/s]
config.json: 100%|██████████| 570/570 [00:00<00:00, 6.46MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.97MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 2.40MB/s]
final text_encoder_type: bert-base-uncased40M [03:56<00:00, 1.86MB/s]]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
The following callbacks returned in LightningModule.configure_callbacks will override existing callbacks passed to Trainer: ModelCheckpoint
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing DataLoader 0: 0%| | 0/25 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:993: FutureWarning: The device argument is deprecated and will be removed in v5 of Transformers. warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:91: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(

Testing DataLoader 0: 12%|█▏ | 3/25 [00:49<06:01, 0.06it/s]Error executing job with overrides: [‘evaluate.checkpoint=/workspace/tao-experiments/grounding_dino/grounding_dino_vgrounding_dino_swin_tiny_commercial_trainable_v1.0/grounding_dino_swin_tiny_commercial_trainable.pth’, ‘results_dir=/results’]Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py”, line 69, in _func
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py”, line 48, in _func
runner(cfg, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/grounding_dino/scripts/evaluate.py”, line 81, in main
run_experiment(experiment_config=cfg)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/grounding_dino/scripts/evaluate.py”, line 61, in run_experiment
trainer.test(model, datamodule=dm)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 753, in test
return call._call_and_handle_interrupt(
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py”, line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 793, in _test_impl
results = self._run(model, ckpt_path=ckpt_path)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 986, in _run
results = self._run_stage()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py”, line 1025, in _run_stage
return self._evaluation_loop.run()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py”, line 182, in _decorator
return loop_run(self, *args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/evaluation_loop.py”, line 128, in run
batch, batch_idx, dataloader_idx = next(data_fetcher)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fetchers.py”, line 133, in next
batch = super().next()
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/fetchers.py”, line 60, in next
batch = next(self.iterator)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/combined_loader.py”, line 341, in next
out = next(self._iterator)
File “/usr/local/lib/python3.10/dist-packages/pytorch_lightning/utilities/combined_loader.py”, line 142, in next
out = next(self.iterators[0]) File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py”, line 631, in next
data = self._next_data()
File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py”, line 1346, in _next_data
return self._process_data(data)
File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py”, line 1372, in _process_data
data.reraise()
File “/usr/local/lib/python3.10/dist-packages/torch/_utils.py”, line 705, in reraise
raise exception
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/pin_memory.py”, line 37, in do_one_step
data = pin_memory(data, device)
File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/pin_memory.py”, line 79, in pin_memory
return [pin_memory(sample, device) for sample in data] # Backwards compatibility. File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/pin_memory.py”, line 79, in
return [pin_memory(sample, device) for sample in data] # Backwards compatibility. File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/pin_memory.py”, line 58, in pin_memory
return data.pin_memory(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

2 posts - 1 participant

Read full topic


Viewing all articles
Browse latest Browse all 497

Trending Articles