!rm darknet
!make
!./darknet detector train yolo.data cfg/yolov4-custom.cfg yolov4.conv.137 -dont_show
When I train my dataset on yolov4 using darknet in colab, I received this error:
CUDA-version: 11000 (11020), cuDNN: 7.6.5, GPU count: 1
OpenCV version: 3.2.0
yolov4-custom
*** stack smashing detected ***: <unknown> terminated
I don't understand why this error exists, just a week ago, no problem when running code
Related
I'm trying to run the tutorial code from Kaggle on my computer. However, the kernel crashed in the model training part history=ConvNeXt_model.fit().
Here is the jupyter notebook log:
warn 16:45:23.988: StdErr from Kernel Process 2023-02-13 16:45:23.989108: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neu
warn 16:45:23.988: StdErr from Kernel Process ral Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
warn 16:45:24.253: StdErr from Kernel Process 2023-02-13 16:45:24.253410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/repli
warn 16:45:24.253: StdErr from Kernel Process ca:0/task:0/device:GPU:0 with 21348 MB memory: -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9
warn 16:45:44.398: StdErr from Kernel Process 2023-02-13 16:45:44.398973: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8100
warn 16:45:44.798: StdErr from Kernel Process 2023-02-13 16:45:44.799017: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: ptxas exited with non-zero error code -1, output:
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logge
warn 16:45:44.799: StdErr from Kernel Process d once.
warn 16:45:45.140: StdErr from Kernel Process 2023-02-13 16:45:45.141061: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x1e2b8a88750 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-13 16:45:45.141144: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (0
warn 16:45:45.141: StdErr from Kernel Process ): NVIDIA GeForce RTX 4090, Compute Capability 8.9
warn 16:45:45.191: StdErr from Kernel Process 2023-02-13 16:45:45.191262: F tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:453] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas exited with non-zero error code -1, output: ' If the error message indicates that a file could not be written, please verify that sufficient
warn 16:45:45.191: StdErr from Kernel Process filesystem space is provided.
error 16:45:45.530: Disposing session as kernel process died ExitCode: 3221226505, Reason: c:\Users\User\anaconda3\envs\tf\lib\site-packages\traitlets\traitlets.py:2548: FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.
warn(
c:\Users\User\anaconda3\envs\tf\lib\site-packages\traitlets\traitlets.py:2499: FutureWarning: Supporting extra quotes around Bytes is deprecated in traitlets 5.0. Use '00cfbd3c-ac34-43be-a838-9653221d1a82' instead of 'b"00cfbd3c-ac34-43be-a838-9653221d1a82"'.
warn(
2023-02-13 16:45:23.989108: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-13 16:45:24.253410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21348 MB memory: -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9
2023-02-13 16:45:44.398973: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8100
2023-02-13 16:45:44.799017: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: ptxas exited with non-zero error code -1, output:
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2023-02-13 16:45:45.141061: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x1e2b8a88750 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-13 16:45:45.141144: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2023-02-13 16:45:45.191262: F tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:453] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas exited with non-zero error code -1, output: ' If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
info 16:45:45.530: Dispose Kernel process 24268.
error 16:45:45.530: Raw kernel process exited code: 3221226505
error 16:45:45.531: Error in waiting for cell to complete [Error: Canceled future for execute_request message before replies were done
at t.KernelShellFutureHandler.dispose (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:33213)
at c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:52265
at Map.forEach (<anonymous>)
at y._clearKernelState (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:52250)
at y.dispose (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:45732)
at c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:17:139244
at Z (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:2:1608939)
at Kp.dispose (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:17:139221)
at qp.dispose (c:\Users\User\.vscode\extensions\ms-toolsai.jupyter-2023.1.2010391206\out\extension.node.js:17:146518)
at process.processTicksAndRejections (node:internal/process/task_queues:96:5)]
warn 16:45:45.531: Cell completed with errors {
message: 'Canceled future for execute_request message before replies were done'
}
It is weird that I can successfully train other models (such as ResNet or EfficientNet) using the GPU but only failed in the ConvNext model. And I followed the instruction to install the TensorFlow.
I guess the error may happen in XLA implementation, but I do not know how the fix it.
All the codes are running on win10 VScode.
Device information:
Nvidia Driver 527.56
CUDA 11.2
cuDNN 8.1.0
Python 3.9.10
TensorFlow 2.10.1
GPU Nvidia RTX 4090
I am new with Deep learning. I have a A100 GPU installed with CUDA 11.6. I installed using Conda tensor flow-1.15 and tensorflow gpu - 1.15, cudatoolkit 10.0, python 3.7 but the code I am trying to run from github has given a note as below and it shows errors which I am finding difficult to interpret where I went wrong.The error is displayed as
failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED 2022-06-30 09:37:12.049400: I tensorflow/stream_executor/stream.cc:4925] [stream=0x55d668879990,impl=0x55d668878ac0] did not memcpy device-to-host; source: 0x7f2fe2d0d400 2022-06-30 09:37:12.056385: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at iterator_ops.cc:867 : Cancelled: Operation was cancelled
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Blas GEMM launch failed : a.shape=(25, 25), b.shape=(25, 102400), m=25, n=102400, k=25 [[{{node Hyperprior/HyperAnalysis/layer_Hyperprior_1/MatMul}}]] (1) Internal: Blas GEMM launch failed : a.shape=(25, 25), b.shape=(25, 102400), m=25, n=102400, k=25 [[{{node Hyperprior/HyperAnalysis/layer_Hyperprior_1/MatMul}}]] [[Hyperprior/truediv_3/_3633]]
NOTE: At the moment, we only support CUDA 10.0, Python 3.6-3.7, TensorFlow 1.15, and Tensorflow Compression 1.3. TensorFlow must be installed via pip, not conda. Unfortunately, newer versions of Tensorflow or Python will not work due to various constraints in the dependencies and in the TF binary API.
I am using tensorflow-gpu 2.3.0 with CUDA_VERSION=8.0.61 and CUDNN_VERSION=6.0.21.
I just run a tensorflow code and get FailedPreconditionError:
'tensorflow.python.framework.errors_impl.FailedPreconditionError: Failed to allocate scratch buffer for device 0 [Op:VarHandleOp] name: Variable/'
What can I do to fix this?
Thanks
i'm using google colab for the detection of object with Yolo.
in the step of the Train Custom YOLOv4 Detector, i have this error
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 841 : build time: Nov 26 2020 - 16:49:52
CUDA Error: no kernel image is available for execution on the device
CUDA Error: no kernel image is available for execution on the device: File exists
darknet: ./src/utils.c:325: error: Assertion `0' failed.
can you help me please
I ran a simple keras script that trains a conv net on the MNIST database. This script works on my laptop yet not on my PC with the GeForce RTX 2070 graphics card.
The error is this:
File "/home/squall/spencer/kaggle/understanding_cloud_organization/mnist_model.py", line 67, in <module>
validation_data=(x_test, y_test))
File "/home/squall/anaconda3/envs/thunder/lib/python3.6/site-packages/keras/engine/training.py", line 1239, in fit
validation_freq=validation_freq)
File "/home/squall/anaconda3/envs/thunder/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
outs = fit_function(ins_batch)
File "/home/squall/anaconda3/envs/thunder/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3292, in __call__
run_metadata=self.run_metadata)
File "/home/squall/anaconda3/envs/thunder/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
[[metrics/accuracy/Identity/_91]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
0 successful operations.
0 derived errors ignored.
Cuda is 10.1. Driver is 418.56. CuDNN is 7.4.2. Tensorflow is 1.14. According to the official Nvidia chart, these are all compatible versions.
Any ideas?
Try this
PS: CUDA is 10.0 and cuDNN is 7.6.3 for CUDA10.0
For tensorflow to work, you need Cuda 10 version. Uninstall the Cuda 10.1 completely and install supported Cuda 10. You can read about requirements for TF here.