Invalid Device id Error when I assign more device_ids

Invalid Device id Error when I assign more device_ids - deviceid

When I assign more than two devices to device_ids, it gives me an Invalid Device id Error. Though I am using almost the same code for different models for many GPUs.
torch.backends.cudnn.benchmark = True
models = torch.nn.DataParallel(models, device_ids=[0, 1])

Related

Error when simplifying graph after converting graph to gdfs and then back to graph

I ran into this problem when converting a graph to gdfs, then uploading gdfs into a postgres/postgis database and then downloading them and reconstructing the graph. I think (!?) I have simplified the issue so it can be recreated easily. Basically, I convert a graph to gdfs and then reconstruct the graph. Although NO errors occur when I create the graph from the gdfs, when I try to run some operations (e.g., simplify_graph) on the graph I get an error. Here is a simple example:
G = ox.graph_from_place('Encinitas, CA', simplify=False, network_type='drive_service')
gdf_nodes, gdf_edges = ox.graph_to_gdfs(G, nodes=True, edges=True, node_geometry=True, fill_edge_geometry=True)
G_new = ox.graph_from_gdfs(gdf_nodes, gdf_edges)
G_new_simplified = ox.simplify_graph(G_new)
This returns the following error:
...\AppData\Local\Continuum\anaconda3\envs\bananas\lib\site-packages\osmnx\simplification.py", line 273, in simplify_graph elif len(set(path_attributes[attr])) == 1: TypeError: unhashable type: 'LineString'
I get no error if I simplify the graph before I convert to gdfs and back, i.e.,:
G = ox.graph_from_place('Encinitas, CA', simplify=False, network_type='drive_service')
G_simplified = ox.simplify_graph(G)
This suggests it has something to do with converting to gdfs and then back to a graph.
This is similar to this previous gdfs to graph and vise versa question, but I am using the newest version of OSMNX (i.e., 1.1.2).
It might also be related to this other previous post but I'm still struggling with some of the specifics in the answer and how the graph class is constructed (especially with regard to attributes and its relation to path_attributes in the simplify function).

Cannot run GMMHMM (Hidden Markov Model with Gaussain Mixture emissions in hmmlearn) with high number of mixtures

I'm trying to use Gaussian Mixture Model within the hmmlearn package using the following configuration for a time series with 49792 samples:
model = GMMHMM(n_components=40, n_mix = 5, tol = 1e-6, covariance_type = "full", n_iter=100, verbose=True)
I get the following error:
ValueError: n_samples=3 should be >= n_clusters=5
I cannot comprehend why n_samples = 3 because of which the error is being thrown (It seems during the initial random clustering, the number of samples in some clusters become very less). Is there any way to get around this problem?

When forward using MXNet, how to do with varying 'batch size' in data_shapes?

Hi，I have a question that, how can I make predict with unfixed input data? I will try to describe in detail clear:
I use MTCNN for face detection(it's ok unfamiliar with that), and it employs 3 networks: PNet, RNet, ONet. PNet detects a mass of proposal face bounding boxes, then these boxes are coarse-to-fine by the rest net one after another, finally get precise face bbox(s). When taking an image as input to PNet, image's size is unfixed, and the output proposal box number from PNet is also unfixed, so as RNet, ONet. Reference to another MTCNN code I set a large data_shapes(e.g., image size, batch size) when I bind the module, and initialize all to zero，then make predict. That works though, Isn't that a redundant calculation? (Question 1)
PNet:
max_img_w=1000
max_img_h=1000
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det1’, 0)
self.PNets = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.PNets.bind(data_shapes=[(‘data’, (1, 3, max_img_w, max_img_h))],for_training=False)
self.PNets.set_params(arg_params,aux_params)
RNet
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det2’, 0)
self.RNet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.RNet.bind(data_shapes=[(‘data’, (2048,3, 24, 24))],for_training=False)
self.RNet.set_params(arg_params,aux_params,allow_missing=True)
ONet
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det3’, 0)
self.ONet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.ONet.bind(data_shapes=[(‘data’, (256, 3, 48, 48))],for_training=False)
self.ONet.set_params(arg_params,aux_params,allow_missing=True)
And I try mx.mod.Module.reshape before predict, which will adjust data'shape according to last network's output, but I get this error:(Question 2)
AssertionError: Shape of unspecified array arg:prob1_label changed. This can cause the new executor to not share parameters with the old one. Please check for error in the network. If this is intended, set partial_shaping=True to suppress this warning.
One more thing is that The MTCNN code (https://github.com/pangyupo/mxnet_mtcnn_face_detection) primary use deprecated function to load models:
self.PNet = mx.model.FeedForward.load(‘det1’,0)
One single line to work with arbitrary data_shapes, why this function be deprecated..?(Question 3)
I found a little difference that after load model, FeedFroward takes 0MB memory before make one predict, but mx.mod.Module takes up memory once loaded, and increase obviously after making one prediction.

You can use MXNet imperative API Gluon and that will let you use different batch-sizes.
If like in this case, your model was trained using the symbolic API or has been exported in the serialized MXNet format ('-0001.params', '-symbol.json' for e.g), you can load it in Gluon that way:
ctx = mx.cpu()
sym = mx.sym.load_json(open('det1-symbol.json', 'r').read())
PNet = gluon.nn.SymbolBlock(outputs=sym, inputs=mx.sym.var('data'))
PNet.load_params('det1-0001.params', ctx=ctx)
Then you can use it the following way:
# a given batch size (1)
data1 = mx.nd.ones((1, C, W, H))
output1 = PNet(data1)
# a different batch size (5)
data2 = mx.nd.ones((5, C, W, H))
output2 = PNet(data2)
And it would work.
You can get started with MXNet Gluon with the official 60 minutes crash course

Restored model in tensorflow gives different results for relu operation

The weights retrieved from restored model doesn't change and the input is also constant
But the output of 'Relu:0' operation is giving different results each time.
Below is my code:
sess=tf.Session()
saver = tf.train.import_meta_graph('checkpoints/checkpoints_otherapproach_1/cameranetwork_RAID_CNN-3100.meta')
saver.restore(sess,tf.train.latest_checkpoint(checkpoint_dir='checkpoints/checkpoints_otherapproach_1/'))
images = tf.get_default_graph().get_tensor_by_name('images:0')
phase = tf.get_default_graph().get_tensor_by_name('phase:0')
Activ = tf.get_default_graph().get_tensor_by_name('network/siamese_model/convolution_1/conv_1/Relu:0')
image_array = np.zeros(shape = [1,3,128,64,3]) #*******
imagepath = 'RAiD_Dataset' + '/images_afterremoving_persons_notinallcameras/'+'test'+'/camera_'+str(1)
fullfile_name = imagepath+"/"+ 'camera_1_person_23_index_1.jpg'
image_array[0][0] = cv2.imread(fullfile_name)
image_array[0][1] = image_array[0][0]
image_array[0][2] = image_array[0][0]
image_array = image_array.astype(np.float32)
feed_dict_values ={images: image_array, phase:False}
temp2 = sess.run(Activ, feed_dict =feed_dict_values)
temp1 = sess.run(Activ, feed_dict =feed_dict_values)
print (temp1==temp2).all() #output is false

There are two possible reasons for this:
Some of the tensorflow ops inherit non-deterministic behavior from CUDA. This results in small numerical errors (which might be amplified by non-linearities). See this answer on how to try running your model on a single CPU thread. If the two arrays will turn out to be identical in this condition, then this is the case.
I'm assuming that you know the graph you are loading, but the graph itself might produce inconsistent results 'by design' due to operations deliberately introducing either randomness or inconstant data. For example, consider operations that use the random number generator or operations that update variables (e.g., tf.assign) each time Activ is evaluated.

Serialized neural network size in Torch after splitting layers

After training a neural network, let's say a Multi-Layer Perceptron, at prediction time I want to split the first layer from all the others.
To do so, the only way I found in order to have files of the correct size, is the following:
I loop through all the layers and I add them to one of two containers (first layer or all the others) that I'll save separetelly using torch.save function. The funny bit is that I need to retrieve the parameters of each layer before adding them to any of the two containers, otherwise when saved, both files (first layer and all the other layers) have the same file size.
A code snippet will be more helpfull than all my previous explanation:
local function split_model(network)
-- for some reason all the models when saved have the same size
-- if not splitted calling 'getParameters()' first.
first_layer = nn.Sequential()
all_the_rest = nn.Sequential()
for i = 1, network:size() do
local l = network:get(i)
local l_params, _ = l:getParameters()
if i == 1 then
first_layer:add(l)
else
all_the_rest:add(l)
end
end
return first_layer, all_the_rest
end
local first_layer, all_the_rest = split_model(network)
torch.save("checkpoints/mlp.t7", .network)
torch.save("checkpoints/first_layer.t7", first_layer)
torch.save("checkpoints/all_the_rest.t7", all_the_rest)

Same question posted in Google groups, this is the answer by Alban Desmaison:
Hi,
The reason for this behavior is due to the way getParameters work
https://github.com/torch/nn/blob/master/doc/module.md#flatparameters-flatgradparameters-getparameters
To be able to return a flat tensor containing all the parameters, it
actually creates a single Storage containing all the weights and then
each module's weight is part of this storage. When you save the
weights of any element of the network, it will have to save the weight
tensor and to do so, save the underlying storage. Hence, if you called
getParameters on the complete network, if you save any module, you
will save all the networks weights. Here, when you call
getParameters on the single module, it actually re-create this
single storage but for this single module, thus, when you save it, it
contains only the weights that you want. BUT note that the flattened
parameters returned by the getParameters that you did on the
complete network are not valid anymore!!!
You have two solutions here:
- If you don't want to use the params coming from the getParameters on the complete network, you can just call
getParameters on each subset of your network before saving them.
This will break change the underlying storage to only contain this
subset of the network and you will save only what you need (shared
storages are stored only once).
- If you want to be able to keep using the params from the original getParameters, you can do the same as above but use cloned version
of them to do the getParameters and saving.
And because a code snippet is always better:
require 'nn'
local subset1 = nn.Linear(2,2)
local subset2 = nn.Linear(2,2)
local network = nn.Sequential():add(subset1):add(subset2)
print("Before getParameters:", subset1.weight:storage():size()) -- 4 elements
network_params,_ = network:getParameters()
print("After getParameters:", subset1.weight:storage():size()) -- 12 elements
subset1.weight:random() -- Change weights to see if linking is still working
print("network_params is valid?", network_params[1] == subset1.weight[1][1]) -- true
-- Keeping network_params valid
local clone_subset1 = subset1:clone()
print("Cloned subset1 before getParameters:", clone_subset1.weight:storage():size()) -- 12 elements
clone_subset1:getParameters()
print("Cloned subset1 after getParameters:", clone_subset1.weight:storage():size()) -- 6 elements (4 weights + 2 bias)
subset1.weight:random() -- Change weights to see if linking is still working
print("network_params is valid?", network_params[1] == subset1.weight[1][1]) -- true
-- Not keeping network_params valid (should be faster)
local clone_subset1 = subset1:clone()
print("subset1 before getParameters:", subset1.weight:storage():size()) -- 12 elements
subset1:getParameters()
print("subset1 after getParameters:", subset1.weight:storage():size()) -- 6 elements (4 weights + 2 bias)
subset1.weight:random() -- Change weights to see if linking is still working
print("network_params is valid?", network_params[1] == subset1.weight[1][1]) -- false

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Invalid Device id Error when I assign more device_ids - deviceid

When I assign more than two devices to device_ids, it gives me an Invalid Device id Error. Though I am using almost the same code for different models for many GPUs. torch.backends.cudnn.benchmark = True models = torch.nn.DataParallel(models, device_ids=[0, 1])

Related

Error when simplifying graph after converting graph to gdfs and then back to graph

Cannot run GMMHMM (Hidden Markov Model with Gaussain Mixture emissions in hmmlearn) with high number of mixtures

When forward using MXNet, how to do with varying 'batch size' in data_shapes?

Restored model in tensorflow gives different results for relu operation

Serialized neural network size in Torch after splitting layers

Categories

Resources