cuda runtime error (48) : unrecognized error code - google-colaboratory

I can run mmdetection code on T4 GPU, but it doesn't work on K80 and P100 GPU.When I execute the test code(), I get the following error:
# This is the test code
from mmdet.apis import init_detector, inference_detector, show_result
config_file = 'configs/my_faster_rcnn_r50_fpn_1x.py'
checkpoint_file = './work_dirs/faster_rcnn_r50_fpn_1x/epoch_9.pth'
model = init_detector(config_file, checkpoint_file, device='cuda:0')
img = 'demo.jpg'
result = inference_detector(model, img)
show_result(img, result, model.CLASSES)
This is the error output:
# errpr log
/content/drive/My Drive/model_train/10_酒瓶检测/mmdetection/mmdet/ops/roi_align/roi_align.py in forward(ctx, features, rois, out_size, spatial_scale, sample_num)
25 # print('进入')
26 roi_align_cuda.forward(features, rois, out_h, out_w, spatial_scale,
---> 27 sample_num, output)
28 else:
29 print('出错')
RuntimeError: cuda runtime error (48) : unrecognized error code at mmdet/ops/roi_align/src/roi_align_kernel.cu:140
Has anyone encountered this problem? Can you help me solve it?

I faced the same issue and on analyzing the issue it seems this error is due to incompatibility of Pytorch with CUDA version.
So the code worked for me.
You can refer the below link for versions:
https://pytorch.org/get-started/previous-versions/

Related

How to load a model using Tensorflow Hub and make a prediction?

This should be a simple task: Download a model saved in tensorflow_hub format, load using tensorflow_hub, and use..
This is the model I am trying to use (simCLR stored in Google Cloud): https://console.cloud.google.com/storage/browser/simclr-checkpoints/simclrv2/pretrained/r50_1x_sk0;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false
I downloaded the /hub folder as they say, using
gsutil -m cp -r \
"gs://simclr-checkpoints/simclrv2/pretrained/r50_1x_sk0/hub" \
.
The /hub folder contains the files:
/saved_model.pb
/tfhub_module.pb
/variables/variables.index
/variables/variables.data-00000-of-00001
So far so good.
Now in python3, tensorflow2, tensorflow_hub 0.12 I run the following code:
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
path_to_hub = '/home/my_name/my_path/simclr/hub'
# Attempt 1
m = tf.keras.models.Sequential([hub.KerasLayer(path_to_hub, input_shape=(224,224,3))])
# Attempt 2
m = tf.keras.models.Sequential(hub.KerasLayer(hubmod))
m.build(input_shape=[None,224,224,3])
# Attempt 3
m = hub.KerasLayer(hub.load(hubmod))
# Toy Data Test
X = np.random.random((1,244,244,3)).astype(np.float32)
y = m.predict(X)
None of these 3 options to load the hub model work, with the following errors:
Attempt 1 :
ValueError: Error when checking input: expected keras_layer_2_input to have shape (224, 224, 3) but got array with shape (244, 244, 3)
Attempt 2:
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node sequential_3/keras_layer_3/StatefulPartitionedCall/base_model/conv2d/Conv2D}}]] [Op:__inference_keras_scratch_graph_46402]
Function call stack:
keras_scratch_graph
Attempt 3:
ValueError: Expected a string, got <tensorflow.python.training.tracking.tracking.AutoTrackable object at 0x7fa71c7a2dd0>
These 3 attempts are all code taken from tensorflow_hub tutorials and are repeated in other answers in stackoverflow, but none works, and I don't know how to continue from those error messages.
Appreciate any help, thanks.
Update 1:
Same issues happen if I try with this ResNet50 hub/
https://storage.cloud.google.com/simclr-gcs/checkpoints/ResNet50_1x.zip
As #Frightera pointed out, there was an error with the input shapes. Also the error on "Attempt 2" was solved by allowing for memory growth on the selected GPU. "Attempt 3" still does not work, but at least there are two methods for loading and using a model saved in /hub format:
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)
hubmod = 'https://tfhub.dev/google/imagenet/mobilenet_v2_035_96/feature_vector/5'
# Alternative 1 - Works!
m = tf.keras.models.Sequential([hub.KerasLayer(hubmod, input_shape=(96,96,3))])
print(m.summary())
# Alternative 2 - Works!
m = tf.keras.models.Sequential(hub.KerasLayer(hubmod))
m.build(input_shape=[None, 96,96,3])
print(m.summary())
# Alternative 3 - Doesnt work
#m = hub.KerasLayer(hub.load(hubmod))
#m.build(input_shape=[None, 96,96,3])
#print(m.summary())
# Test
X = np.random.random((1,96,96,3)).astype(np.float32)
y = m.predict(X)
print(y.shape)

theano.function() throws up a long exception in Colab

I am using Google Colab to run the BinaryNet Neural Network implemented using theano by the authors of the original paper here: https://github.com/MatthieuCourbariaux/BinaryNet
When I run the following line from /Train-time/mnist.py (line 199):
train_fn = theano.function([input, target, LR], loss, updates=updates)
Colab throws up this error:
You can find the C code in this temporary file:
/tmp/theano_compilation_error_5_e2lq4v library
inux-gnu/bits/libc-header-start.h:33, is not found. library
inux-gnu/7/include-fixed/limits.h:194, is not found. library
inux-gnu/7/include-fixed/syslimits.h:7, is not found. library
inux-gnu/7/include-fixed/limits.h:34, is not found. library
inux-gnu/bits/mathcalls.h:298:1: is not found. library
inux-gnu/bits/mathcalls.h:298:1: is not found. library
inux-gnu/bits/libc-header-start.h:33, is not found. library
inux-gnu/7/include-fixed/limits.h:194, is not found. library
inux-gnu/7/include-fixed/syslimits.h:7, is not found. library
inux-gnu/7/include-fixed/limits.h:34, is not found. library
inux-gnu/bits/mathcalls.h:298:1: is not found. library
inux-gnu/bits/mathcalls.h:298:1: is not found.
Exception: ('The following error happened while compiling the node',
Elemwise{Composite{(i0 * (i1 + (i0 * round3(clip(i2, i3, i4)))) *
i5)}}[(0, 2)](TensorConstant{(1, 1) of 2.0}, TensorConstant{(1, 1) of
-1.0}, Elemwise{Composite{(i0 * (i1 + (i2 * i3 * i4) + i5))}}.0, TensorConstant{(1, 1) of 0}, TensorConstant{(1, 1) of 1},
Elemwise{Composite{Cast{float64}(LT(i0, i1))}}[(0, 0)].0), '\n',
"Compilation failed (return status=1):
/root/.theano/compiledir_Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic-x86_64-3.6.9-64/tmp9q80fef3/mod.cpp:932:2:
warning: character constant too long for its type. ['V15_tmp2'] =
round(['V15_tmp1']);. ^~~~~~~~~~.
/root/.theano/compiledir_Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic-x86_64-3.6.9-64/tmp9q80fef3/mod.cpp:932:23:
warning: character constant too long for its type. ['V15_tmp2'] =
round(['V15_tmp1']);. ^~~~~~~~~~.
/root/.theano/compiledir_Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic-x86_64-3.6.9-64/tmp9q80fef3/mod.cpp:1054:2:
warning: character constant too long for its type. ['V15_tmp2'] =
round(['V15_tmp1']);. ^~~~~~~~~~.
/root/.theano/compiledir_Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic-x86_64-3.6.9-64/tmp9q80fef3/mod.cpp:1054:23: warning: character constant too long for its type. ['V15_tmp2'] =
round(['V15_tmp1']);. ^~~~~~~~~~.
/root/.theano/compiledir_Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic-x86_64-3.6.9-64/tmp9q80fef3/mod.cpp: In member function ‘in...
I used this to install theano and lasagne:
!pip install --upgrade https://github.com/Theano/Theano/archive/master.zip
!pip install --upgrade https://github.com/Lasagne/Lasagne/archive/master.zip
I am using the exact same code as in the github repo with the only difference being that I used keras to import the mnist dataset instead of pylearn2
Could someone please help me figure out why this is happening? Thank you!
EDIT
I ran my code in python 2.7 and it worked! This question deals with using python 2 in Colab.

Error writing XGBoost Classifier to pmml with sklearn2pmml

I want to save my XGBoost model as pmml using sklearn2pmml. I'm using Python V3.7.3 with Sklearn 0.20.3 & sklearn2pmml V0.53.0. My data is mainly binary, with just 3 columns of continuous data, I'm running my notebook in Databricks and convert my Spark dataframe to a pandas dataframe. Code snippet below
import xgboost as xgb
from sklearn_pandas import DataFrameMapper
from sklearn.compose import ColumnTransformer
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml.decoration import ContinuousDomain
from sklearn.preprocessing import StandardScaler
X = pdf[continuous_features + numericCols]
y = pdf["Label"]
mapper = DataFrameMapper(
[([cont_column], [ContinuousDomain(), StandardScaler()]) for cont_column in continuous_features] +
[([c for c in numericCols], None)] # no transformation
)
clf = xgb.XGBClassifier(objective='multi:softprob',eval_metric='auc',num_class = 2,
n_jobs =6,max_delta_step=1, min_child_weight=14, gamma=1.5, subsample = 0.8,
colsample_bytree = 0.5, max_depth=10, learning_rate = 0.1)
pipeline = PMMLPipeline([
("mapper", mapper),
("estimator", clf)
])
pipeline.fit(X,y.values.reshape(-1,))
sklearn2pmml(pipeline, "xgb_V1.pmml", with_repr = True)
The pipeline fits to the data, generates a score and prediction with pipeline.score(X,y) and pipeline.predict(X), but when I try to write it to pmml, I get the following error:
Standard output is empty
Standard error:
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 47 ms.
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
INFO: Converting..
Feb 21, 2020 1:53:30 PM sklearn2pmml.pipeline.PMMLPipeline initTargetFields
WARNING: Attribute 'sklearn2pmml.pipeline.PMMLPipeline.target_fields' is not set. Assuming y as the name of the target field
Feb 21, 2020 1:53:30 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: Attribute 'xgboost.sklearn.XGBClassifier._le' has an unsupported value (Python class xgboost.compat.XGBoostLabelEncoder)
at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:45)
at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:82)
at sklearn.LabelEncoderClassifier.getLabelEncoder(LabelEncoderClassifier.java:40)
at sklearn.LabelEncoderClassifier.getClasses(LabelEncoderClassifier.java:34)
at sklearn.ClassifierUtil.getClasses(ClassifierUtil.java:32)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:133)
at org.jpmml.sklearn.Main.run(Main.java:145)
at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.preprocessing.LabelEncoder
at java.lang.Class.cast(Class.java:3369)
at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
... 7 more
Exception in thread "main" java.lang.IllegalArgumentException: Attribute 'xgboost.sklearn.XGBClassifier._le' has an unsupported value (Python class xgboost.compat.XGBoostLabelEncoder)
at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:45)
at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:82)
at sklearn.LabelEncoderClassifier.getLabelEncoder(LabelEncoderClassifier.java:40)
at sklearn.LabelEncoderClassifier.getClasses(LabelEncoderClassifier.java:34)
at sklearn.ClassifierUtil.getClasses(ClassifierUtil.java:32)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:133)
at org.jpmml.sklearn.Main.run(Main.java:145)
at org.jpmml.sklearn.Main.main(Main.java:94)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.preprocessing.LabelEncoder
at java.lang.Class.cast(Class.java:3369)
at org.jpmml.sklearn.CastFunction.apply(CastFunction.java:43)
I thought it might be a version incompatibility issue between Sklearn and sklearn2pmml as per this post https://github.com/jpmml/sklearn2pmml/issues/197, but I think the versions I have installed should be ok. Any ideas on what's going on with this? Thanks in advance
It is probably a XGBoost package version issue. The SkLearn2PMML package expects the label encoder (XGBClassifier._le attribute) to be a "normal" Scikit-Learn label encoder class (sklearn.preprocessing.(label|_label).LabelEncoder), but in your case it's something different (xgboost.compat.XGBoostLabelEncoder).
In which XGBOost package version was this xgboost.compat.XGBoostLabelEncoder introduced? It's either some very old, or very new thing.
In any case, please open a feature request with the JPMML-SkLearn project here to have this issue sorted out.

Keras: Error when downloading Fashion_MNIST Data

I am trying to download data from Fashion MNIST, but it produces an error. Originally, it was downloading and working properly, but I had to terminate it because I had to turn off my computer. Once I opened the file up again, it gives me an error. I'm not sure what the problem is, but is it because I already downloaded some parts of the data once, and keras doesn't recognize that? I am using Jupyter notebook in a conda environment
Here is the link to the image:
https://i.stack.imgur.com/wLGDm.png
You have missed adding tf. to the line
fashion_mnist = keras.datasets.fashion_mnist
The below code works perfectly for me. Importing the fashion_mnist dataset has been outlined in tensorflow documention here.
Change your code to:
import tensorflow as tf
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
or, use the better way to do it below. This avoids creating an extra variable fashion_mnist:
import tensorflow as tf
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.fashion_mnist.load_data()
I am using tensorflow 1.9.0, keras 2.2.2 and python 3.6.6 on Windows 10 x64 OS.
I know my pc well, I can't download anything larger than 2.7 MB (in terminal), due to WinError 8.
So I manually downloaded all packs from storage.google (since some packs are 25 MB).
Check the packs:
then I paste all packs to \datasets\fashion-mnist
The next time u run your code, it should be fixed.
Note : If u have VScode then just CTRL and click the link, then you can download it easily.
I had an error regarding the cURL connection, and by looking into the error message I was able to track the file where the URL was declared. In my case it was:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow_core/python/keras/datasets/fashion_mnist.py
At line 44 I have commented out the line:
# base = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/'
And declared a different base URL, which I had found looking into the documentation of the original dataset:
base = 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/'
The download started immediately and gave no errors. Hope this helps.
This is because for some reason you have an incomplete download for the MNIST dataset.
You will have to manually delete the downloaded folder which usually resides in ~/.keras/datasets or any path specified by you relative to this path, in your case MNIST_data.
Go to : C:\Users\Username.keras\datasets
and then Delete the Dataset that you want to redownload or has the error
You should be good to go!
You can also manually add print for the path from which it is taking dataset ..
Ex: print(paths) in file fashion_mnist.py
with gzip.open(paths[3], 'rb') as imgpath:
print(paths) #debug print in fashion_mnist.py
x_test = np.frombuffer(
imgpath.read(), np.uint8, offset=16).reshape(len(y_test), 28, 28)
& from this path, remove the files & this will start to download fresh data ..
Change The base address with 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/' as described previously. It works for me.
I was getting error of Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Traceback (most recent call last):
File "C:\Users\AsadA\AppData\Local\Programs\Python\Python38\lib\site-packages\numpy\lib\npyio.py", line 448, in load
return pickle.load(fid, **pickle_kwargs)
EOFError: Ran out of input
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\AsadA\AppData\Local\Programs\Python\Python38\lib\site-packages\numpy\lib\npyio.py", line 450, in load
raise IOError(
OSError: Failed to interpret file 'C:\\Users\\AsadA\\.keras\\datasets\\mnist.npz' as a pickle"**
GO TO FILE C:\Users\AsadA\AppData\Local\Programs\Python\Python38\Lib\site-packages\tensorflow\python\keras\datasets (In my Case) and follow the instructions:

TensorFlow New Op: AttributeError: 'module' object has no attribute 'custom_op'

I am creating new ops (https://www.tensorflow.org/extend/adding_an_op) for TensorFlow (r1.0) running both on x86 and ARMv7.
Minor code modifications are necessary to run TensorFlow on ARMv7, but this guide helps a lot:
https://github.com/samjabrahams/tensorflow-on-raspberry-pi/blob/master/GUIDE.md.
But I noticed that the custom operations do not work on my ARMv7 installation of TensorFlow.
For example, when I test my custom operation in a Python script on ARMv7:
import tensorflow as tf
_custom_op_module = tf.load_op_library('custom_op.so')
custom_op = _custom_op_module.add_stub
I get the following error (that does not show up on x86):
$ python test_custom_op.py
Traceback (most recent call last):
File "custom_op.py", line 3, in <module>
add_stub = _custom_op_module.add_stub
AttributeError: 'module' object has no attribute 'custom_op'
I further investigated the issue, and apparently there is not my custom operation in the .so library file.
$ python
>>> import tensorflow as tf
>>> _custom_op_module = tf.load_op_library('custom_op.so')
>>> dir(_custom_op_module)
>>> ['LIB_HANDLE', 'OP_LIST', '_InitOpDefLibrary', '__builtins__', '__doc__', '__name__', '__package__', '_collections', '_common_shapes', '_op_def_lib', '_op_def_library', '_op_def_pb2', '_op_def_registry', '_ops', '_text_format']
>>> _custom_op_module.OP_LIST
>>>
The same commands on x86 have the following output:
>>> import tensorflow as tf
>>> _custom_op_module = tf.load_op_library('custom_op.so')
>>> dir(_custom_op_module)
>>> ['LIB_HANDLE', 'OP_LIST', '_InitOpDefLibrary', '__builtins__', '__doc__', '__name__', '__package__', '_add_stub_outputs', '_collections', '_common_shapes', '_op_def_lib', '_op_def_library', '_op_def_pb2', '_op_def_registry', '_ops', '_text_format', 'custom_op']
>>> _custom_op_module.OP_LIST
op {
name: "CustomOp"
...
}
>>>
Does anybody have similar issue? Can we consider this a bug?
I hit a similar issue with a similar error message when I tried to load my new op, however, my problem was I tried to register a customized op that had the same op name as tensorflow, and that led to a name collision. Changing the name fixed it without recompiling TF.
The error message I encountered:
AttributeError: module '6e237d88703da016805889179d3f5baa' has no attribute 'custom_op'
Apparently, recompiling and re-installing the TF made it works.