Please provide the following information when requesting support.
• Nvidia Ada 2000
• Image Classification Pytorch
• Docker Image: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
Training seems to be using too much VRAM for my GPU and I am not seeing a way to reduce the batch size. However, it looks like from the output it’s using a batch size of 8. I have tried to reduce the batch size down to 1 as well by changing the parameter “samples_per_gpu” as this seems to correlate to batch size but it still gives the same error.
I am training using the TAO launcher command:
tao model classification_pyt train -e $SPECS_DIR/spec_pyt.yaml
I am getting the following error:
2024-08-27 20:33:16,165 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-08-27 20:33:16,392 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2024-08-27 20:33:17,077 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2024-08-28 00:33:24,781 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
Train results will be saved at: /workspace/tao-experiments/classification_pyt/output
08/28 00:33:34 - mmengine - INFO -
------------------------------------------------------------
System environment:
sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] CUDA available: True
MUSA available: False
numpy_random_seed: 49
GPU 0: NVIDIA RTX 2000 Ada Generation Laptop GPU
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.3.0a0+6ddf5cf85e.nv24.04
PyTorch compiling details: PyTorch built with:
- GCC 11.2
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2021.1-Product Build 20201104 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.3.2 (Git Hash N/A)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 12.4
- NVCC architecture flags: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_72,code=sm_72;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_87,code=sm_87;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_90,code=compute_90
- CuDNN 90.1
- Magma 2.6.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/gcc-toolset-11/root/usr/bin/c++, CXX_FLAGS=-fno-gnu-unique -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.3.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.18.0a0
OpenCV: 4.7.0
MMEngine: 0.10.4
Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 49
deterministic: False
Distributed launcher: pytorch
Distributed training: True
GPU number: 1
------------------------------------------------------------
08/28 00:33:34 - mmengine - INFO - Config:
auto_scale_lr = dict(base_batch_size=1024)
custom_hooks = [
dict(momentum=4e-05, priority='ABOVE_NORMAL', type='EMAHook'),
]
data_preprocessor = dict(
mean=[
123.675,
116.28,
103.53,
],
num_classes=5,
std=[
58.395,
57.12,
57.375,
],
to_rgb=True)
dataset_type = 'ImageNet'
default_hooks = dict(
checkpoint=dict(interval=1, type='CheckpointHook'),
logger=dict(interval=500, type='TaoTextLoggerHook'),
param_scheduler=dict(type='ParamSchedulerHook'),
sampler_seed=dict(type='DistSamplerSeedHook'),
timer=dict(type='IterTimerHook'),
visualization=dict(enable=False, type='VisualizationHook'))
default_scope = 'mmpretrain'
env_cfg = dict(
cudnn_benchmark=False,
dist_cfg=dict(backend='nccl'),
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
find_unused_parameters = False
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
model = dict(
backbone=dict(
drop_path=0.1,
freeze=False,
init_cfg=None,
pretrained='',
type='fan_small_12_p4_hybrid'),
head=dict(
binary=False,
head_init_scale=1,
in_channels=384,
loss=dict(loss_weight=1.0, type='CrossEntropyLoss', use_soft=False),
num_classes=5,
type='TAOLinearClsHead'),
neck=None,
train_cfg=dict(augments=None),
type='ImageClassifier')
optim_wrapper = dict(
optimizer=dict(lr=0.001, type='AdamW', weight_decay=0.05),
paramwise_cfg=None)
param_scheduler = [
dict(type='CosineAnnealingLR'),
]
randomness = dict(deterministic=False, seed=49)
resume = False
test_cfg = dict()
test_dataloader = dict(
batch_size=8,
collate_fn=dict(type='default_collate'),
dataset=dict(
ann_file=None,
classes=None,
data_prefix='/workspace/tao-experiments/test',
pipeline=[
dict(type='LoadImageFromFile'),
dict(scale=224, type='Resize'),
dict(crop_size=224, type='CenterCrop'),
dict(type='PackInputs'),
],
type='ImageNet'),
num_workers=4,
pin_memory=True,
sampler=dict(shuffle=True, type='DefaultSampler'))
test_evaluator = dict(topk=(1, ), type='Accuracy')
train_cfg = dict(by_epoch=True, max_epochs=40, val_interval=1)
train_dataloader = dict(
batch_size=8,
collate_fn=dict(type='default_collate'),
dataset=dict(
classes=None,
data_prefix='/workspace/tao-experiments/train',
pipeline=[
dict(type='LoadImageFromFile'),
dict(scale=224, type='RandomResizedCrop'),
dict(direction='horizontal', prob=0.5, type='RandomFlip'),
dict(
brightness=0.4,
contrast=0.4,
saturation=0.4,
type='ColorJitter'),
dict(erase_prob=0.3, type='RandomErasing'),
dict(type='PackInputs'),
],
type='ImageNet'),
num_workers=4,
pin_memory=True,
sampler=dict(shuffle=True, type='DefaultSampler'))
val_cfg = dict()
val_dataloader = dict(
batch_size=8,
collate_fn=dict(type='default_collate'),
dataset=dict(
ann_file=None,
classes=None,
data_prefix='/workspace/tao-experiments/val',
pipeline=[
dict(type='LoadImageFromFile'),
dict(scale=224, type='Resize'),
dict(crop_size=224, type='CenterCrop'),
dict(type='PackInputs'),
],
type='ImageNet'),
num_workers=4,
pin_memory=True,
sampler=dict(shuffle=True, type='DefaultSampler'))
val_evaluator = dict(topk=(1, ), type='Accuracy')
vis_backends = [
dict(type='LocalVisBackend'),
]
visualizer = dict(
type='UniversalVisualizer', vis_backends=[
dict(type='LocalVisBackend'),
])
work_dir = '/workspace/tao-experiments/classification_pyt/output'
08/28 00:33:34 - mmengine - INFO - Because batch augmentations are enabled, the data preprocessor automatically enables the `to_onehot` option to generate one-hot format labels.
No pretrained configuration specified for convnext_base_in22k model. Using a default. Please add a config to the model pretrained_cfg registry or pass explicitly.
Error executing job with overrides: []Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 69, in _func
raise e
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py", line 48, in _func
runner(cfg, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 88, in main
run_experiment(cfg)
File "/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 73, in run_experiment
runner = Runner.from_cfg(train_cfg)
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 462, in from_cfg
runner = cls(
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 431, in __init__
self.model = self.wrap_model(
File "/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py", line 898, in wrap_model
model = MMDistributedDataParallel(
File "/usr/local/lib/python3.10/dist-packages/mmengine/model/wrappers/distributed.py", line 93, in __init__
super().__init__(module=module, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 798, in __init__
_verify_param_shape_across_processes(self.process_group, parameters)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 269, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Failed to CUDA host alloc 2147483648 bytes
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
E0828 00:33:37.994000 140492393055360 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 1) local_rank: 0 (pid: 363) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]: time : 2024-08-28_00:33:37
host : 5ba3b212b4af
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 363)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
1 post - 1 participant