I'm trying to run the tutorial code from Kaggle on my computer. However, the kernel crashed in the model training part history=ConvNeXt_model.fit().
Here is the jupyter notebook log:
warn 16:45:23.988: StdErr from Kernel Process 2023-02-13 16:45:23.989108: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neu
warn 16:45:23.988: StdErr from Kernel Process ral Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
warn 16:45:24.253: StdErr from Kernel Process 2023-02-13 16:45:24.253410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/repli
warn 16:45:24.253: StdErr from Kernel Process ca:0/task:0/device:GPU:0 with 21348 MB memory: -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9
warn 16:45:44.398: StdErr from Kernel Process 2023-02-13 16:45:44.398973: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8100
warn 16:45:44.798: StdErr from Kernel Process 2023-02-13 16:45:44.799017: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: ptxas exited with non-zero error code -1, output:
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logge
warn 16:45:44.799: StdErr from Kernel Process d once.
warn 16:45:45.140: StdErr from Kernel Process 2023-02-13 16:45:45.141061: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x1e2b8a88750 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-13 16:45:45.141144: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (0
warn 16:45:45.141: StdErr from Kernel Process ): NVIDIA GeForce RTX 4090, Compute Capability 8.9
warn 16:45:45.191: StdErr from Kernel Process 2023-02-13 16:45:45.191262: F tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:453] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas exited with non-zero error code -1, output: ' If the error message indicates that a file could not be written, please verify that sufficient
warn 16:45:45.191: StdErr from Kernel Process filesystem space is provided.
error 16:45:45.530: Disposing session as kernel process died ExitCode: 3221226505, Reason: c:\Users\User\anaconda3\envs\tf\lib\site-packages\traitlets\traitlets.py:2548: FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.
warn(
c:\Users\User\anaconda3\envs\tf\lib\site-packages\traitlets\traitlets.py:2499: FutureWarning: Supporting extra quotes around Bytes is deprecated in traitlets 5.0. Use '00cfbd3c-ac34-43be-a838-9653221d1a82' instead of 'b"00cfbd3c-ac34-43be-a838-9653221d1a82"'.
warn(
2023-02-13 16:45:23.989108: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-13 16:45:24.253410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21348 MB memory: -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9
2023-02-13 16:45:44.398973: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8100
2023-02-13 16:45:44.799017: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: ptxas exited with non-zero error code -1, output:
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2023-02-13 16:45:45.141061: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x1e2b8a88750 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-13 16:45:45.141144: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2023-02-13 16:45:45.191262: F tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:453] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas exited with non-zero error code -1, output: ' If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
info 16:45:45.530: Dispose Kernel process 24268.
error 16:45:45.530: Raw kernel process exited code: 3221226505
error 16:45:45.531: Error in waiting for cell to complete [Error: Canceled future for execute_request message before replies were done
at t.KernelShellFutureHandler.dispose (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:33213)
at c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:52265
at Map.forEach (<anonymous>)
at y._clearKernelState (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:52250)
at y.dispose (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:45732)
at c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:17:139244
at Z (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:1608939)
at Kp.dispose (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:17:139221)
at qp.dispose (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:17:146518)
at process.processTicksAndRejections (node:internal/process/task_queues:96:5)]
warn 16:45:45.531: Cell completed with errors {
message: 'Canceled future for execute_request message before replies were done'
}
It is weird that I can successfully train other models (such as ResNet or EfficientNet) using the GPU but only failed in the ConvNext model. And I followed the instruction to install the TensorFlow.
I guess the error may happen in XLA implementation, but I do not know how the fix it.
All the codes are running on win10 VScode.
Device information:
Nvidia Driver 527.56
CUDA 11.2
cuDNN 8.1.0
Python 3.9.10
TensorFlow 2.10.1
GPU Nvidia RTX 4090
I enable GPU by going to >>runtime>> change runtime type >> then choose GPU.
But when I run my code I get this error:
usage: train.py [-h] [--pre PRETRAINED] TRAIN TEST GPU TASK
train.py: error: the following arguments are required: GPU, TASK
this is the part of the code that make error:
! python train.py part_A_train.json part_A_val.json
I also have this warning:
Warning: You are connected to a GPU runtime, but not utilising the GPU.
but by running this code looks like that GPU is active!
Some info;
ubutu 18.04
cuda-10.2
NVIDIA-SMI 440.33.01
Grapic card GeForce GTX 1080
An error occurs when running usbcam and YOLO at the same time.
darknet_ros:
/home/foscar/ISCC_2022/src/vision_team/darknet_ros/darknet/src/cuda.c:36:
check_error: Assertion `0' failed. [darknet_ros-2] process has died
[pid 5975, exit code -6, cmd
/home/foscar/ISCC_2022/devel/lib/darknet_ros/darknet_ros
camera/rgb/image_raw:=/usb_cam/image_raw __name:=darknet_ros
__log:=/home/foscar/.ros/log/456e13fe-0109-11ed-98f3-705dccfd163a/darknet_ros-2.log].
log file:
/home/foscar/.ros/log/456e13fe-0109-11ed-98f3-705dccfd163a/darknet_ros-2*.log
I am using tensorflow-gpu 2.3.0 with CUDA_VERSION=8.0.61 and CUDNN_VERSION=6.0.21.
I just run a tensorflow code and get FailedPreconditionError:
'tensorflow.python.framework.errors_impl.FailedPreconditionError: Failed to allocate scratch buffer for device 0 [Op:VarHandleOp] name: Variable/'
What can I do to fix this?
Thanks
i'm using google colab for the detection of object with Yolo.
in the step of the Train Custom YOLOv4 Detector, i have this error
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 841 : build time: Nov 26 2020 - 16:49:52
CUDA Error: no kernel image is available for execution on the device
CUDA Error: no kernel image is available for execution on the device: File exists
darknet: ./src/utils.c:325: error: Assertion `0' failed.
can you help me please