Running Tensorflow on GeForce 940M (Ubuntu) - tensorflow

I'm running the CIFAR-10 classification from the Tensorflow for the very first time on my laptop with GeForce 940M. I'm running the training with the pre-defined parameters as follows:
python cifar10_train.py
after step 1800 I'm getting the following errors:
E tensorflow/stream_executor/cuda/cuda_event.cc:33] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS
E tensorflow/stream_executor/cuda/cuda_driver.cc:1182] failed to enqueue async memcpy from device to host: CUDA_ERROR_ILLEGAL_ADDRESS; host dst: 0x7ff8e9bf26c0; GPU src: 0x5011c0600; size: 16=0x10
F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:105] Unexpected Event status: 1
I tensorflow/stream_executor/stream.cc:3304] stream 0x35e7190 did not block host until done; was already in an error state
Aborted (core dumped)
Does anybody have any idea?
Thanks a lot in advance for your help! Any advice is kindly appreciated!

Related

Jupyter kernel crashed only while training ConvNext model

I'm trying to run the tutorial code from Kaggle on my computer. However, the kernel crashed in the model training part history=ConvNeXt_model.fit().
Here is the jupyter notebook log:
warn 16:45:23.988: StdErr from Kernel Process 2023-02-13 16:45:23.989108: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neu
warn 16:45:23.988: StdErr from Kernel Process ral Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
warn 16:45:24.253: StdErr from Kernel Process 2023-02-13 16:45:24.253410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/repli
warn 16:45:24.253: StdErr from Kernel Process ca:0/task:0/device:GPU:0 with 21348 MB memory: -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9
warn 16:45:44.398: StdErr from Kernel Process 2023-02-13 16:45:44.398973: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8100
warn 16:45:44.798: StdErr from Kernel Process 2023-02-13 16:45:44.799017: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: ptxas exited with non-zero error code -1, output:
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logge
warn 16:45:44.799: StdErr from Kernel Process d once.
warn 16:45:45.140: StdErr from Kernel Process 2023-02-13 16:45:45.141061: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x1e2b8a88750 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-13 16:45:45.141144: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (0
warn 16:45:45.141: StdErr from Kernel Process ): NVIDIA GeForce RTX 4090, Compute Capability 8.9
warn 16:45:45.191: StdErr from Kernel Process 2023-02-13 16:45:45.191262: F tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:453] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas exited with non-zero error code -1, output: ' If the error message indicates that a file could not be written, please verify that sufficient
warn 16:45:45.191: StdErr from Kernel Process filesystem space is provided.
error 16:45:45.530: Disposing session as kernel process died ExitCode: 3221226505, Reason: c:\Users\User\anaconda3\envs\tf\lib\site-packages\traitlets\traitlets.py:2548: FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.
warn(
c:\Users\User\anaconda3\envs\tf\lib\site-packages\traitlets\traitlets.py:2499: FutureWarning: Supporting extra quotes around Bytes is deprecated in traitlets 5.0. Use '00cfbd3c-ac34-43be-a838-9653221d1a82' instead of 'b"00cfbd3c-ac34-43be-a838-9653221d1a82"'.
warn(
2023-02-13 16:45:23.989108: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-13 16:45:24.253410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21348 MB memory: -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9
2023-02-13 16:45:44.398973: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8100
2023-02-13 16:45:44.799017: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: ptxas exited with non-zero error code -1, output:
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2023-02-13 16:45:45.141061: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x1e2b8a88750 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-13 16:45:45.141144: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2023-02-13 16:45:45.191262: F tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:453] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas exited with non-zero error code -1, output: ' If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
info 16:45:45.530: Dispose Kernel process 24268.
error 16:45:45.530: Raw kernel process exited code: 3221226505
error 16:45:45.531: Error in waiting for cell to complete [Error: Canceled future for execute_request message before replies were done
at t.KernelShellFutureHandler.dispose (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:33213)
at c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:52265
at Map.forEach (<anonymous>)
at y._clearKernelState (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:52250)
at y.dispose (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:45732)
at c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:17:139244
at Z (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:1608939)
at Kp.dispose (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:17:139221)
at qp.dispose (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:17:146518)
at process.processTicksAndRejections (node:internal/process/task_queues:96:5)]
warn 16:45:45.531: Cell completed with errors {
message: 'Canceled future for execute_request message before replies were done'
}
It is weird that I can successfully train other models (such as ResNet or EfficientNet) using the GPU but only failed in the ConvNext model. And I followed the instruction to install the TensorFlow.
I guess the error may happen in XLA implementation, but I do not know how the fix it.
All the codes are running on win10 VScode.
Device information:
Nvidia Driver 527.56
CUDA 11.2
cuDNN 8.1.0
Python 3.9.10
TensorFlow 2.10.1
GPU Nvidia RTX 4090

In google colab I enable GPU but it isn't used

I enable GPU by going to >>runtime>> change runtime type >> then choose GPU.
But when I run my code I get this error:
usage: train.py [-h] [--pre PRETRAINED] TRAIN TEST GPU TASK
train.py: error: the following arguments are required: GPU, TASK
this is the part of the code that make error:
! python train.py part_A_train.json part_A_val.json
I also have this warning:
Warning: You are connected to a GPU runtime, but not utilising the GPU.
but by running this code looks like that GPU is active!

I get an error related to darknet when running yolo, how to fix it

Some info;
ubutu 18.04
cuda-10.2
NVIDIA-SMI 440.33.01
Grapic card GeForce GTX 1080
An error occurs when running usbcam and YOLO at the same time.
darknet_ros:
/home/foscar/ISCC_2022/src/vision_team/darknet_ros/darknet/src/cuda.c:36:
check_error: Assertion `0' failed. [darknet_ros-2] process has died
[pid 5975, exit code -6, cmd
/home/foscar/ISCC_2022/devel/lib/darknet_ros/darknet_ros
camera/rgb/image_raw:=/usb_cam/image_raw __name:=darknet_ros
__log:=/home/foscar/.ros/log/456e13fe-0109-11ed-98f3-705dccfd163a/darknet_ros-2.log].
log file:
/home/foscar/.ros/log/456e13fe-0109-11ed-98f3-705dccfd163a/darknet_ros-2*.log

tensorflow.python.framework.errors_impl.FailedPreconditionError

I am using tensorflow-gpu 2.3.0 with CUDA_VERSION=8.0.61 and CUDNN_VERSION=6.0.21.
I just run a tensorflow code and get FailedPreconditionError:
'tensorflow.python.framework.errors_impl.FailedPreconditionError: Failed to allocate scratch buffer for device 0 [Op:VarHandleOp] name: Variable/'
What can I do to fix this?
Thanks

Google colab: problem in the train of YOLOv4-tiny-Darknet-Roboflow

i'm using google colab for the detection of object with Yolo.
in the step of the Train Custom YOLOv4 Detector, i have this error
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 841 : build time: Nov 26 2020 - 16:49:52
CUDA Error: no kernel image is available for execution on the device
CUDA Error: no kernel image is available for execution on the device: File exists
darknet: ./src/utils.c:325: error: Assertion `0' failed.
can you help me please