How can I solve this elusive error in my multi-GPU Pytorch setup?

I have spent the past day trying to figure out how to use multiple GPUs. In theory, parallelizing models across multiple GPUs is supposed to be as as easy as simply wrapping models with nn.DataParallel. However, I have found that this does not work for me. To use the most simple and canonical thing I could find for proof of this, I ran the code in the Data Parallelism tutorial, line for line.
I have tried everything from only having a specific permutation of my GPUs be visible to CUDA to reinstalling everything related to CUDA but can't figure out why I cannot run with multiple GPUs. Some information about my machine:
Operating System: Ubuntu 16.04
GPUS: 4 1080tis
Pytorch version: 1.01
CUDA version: 10.0
The error code is the following:
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-3-0f0d83e9ef13> in <module>
1 for data in rand_loader:
2 input =
----> 3 output = model(input)
4 print("Outside: input size", input.size(),
5 "output_size", output.size())
/usr/local/lib/python3.6/site-packages/torch/nn/modules/ in __call__(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
--> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)
/usr/local/lib/python3.6/site-packages/torch/nn/parallel/ in forward(self, *inputs, **kwargs)
141 return self.module(*inputs[0], **kwargs[0])
142 replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 143 outputs = self.parallel_apply(replicas, inputs, kwargs)
144 return self.gather(outputs, self.output_device)
/usr/local/lib/python3.6/site-packages/torch/nn/parallel/ in parallel_apply(self, replicas, inputs, kwargs)
152 def parallel_apply(self, replicas, inputs, kwargs):
--> 153 return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
155 def gather(self, outputs, output_device):
/usr/local/lib/python3.6/site-packages/torch/nn/parallel/ in parallel_apply(modules, inputs, kwargs_tup, devices)
73 thread.start()
74 for thread in threads:
---> 75 thread.join()
76 else:
77 _worker(0, modules[0], inputs[0], kwargs_tup[0], devices[0])
/usr/local/lib/python3.6/ in join(self, timeout)
1055 if timeout is None:
-> 1056 self._wait_for_tstate_lock()
1057 else:
1058 # the behavior of a negative timeout isn't documented, but
/usr/local/lib/python3.6/ in _wait_for_tstate_lock(self, block, timeout)
1070 if lock is None: # already determined that the C code is done
1071 assert self._is_stopped
-> 1072 elif lock.acquire(block, timeout):
1073 lock.release()
1074 self._stop()
Any insight into this error would be very much appreciated. From my relatively limited systems and CUDA knowledge, it has to do with some sort of locking, but I can't for the life of me figure out how to fix this.


Evaluating the state value function when using the SAC agent of TF-Agents

The state value function v at states x is a quantity of interest of the Markov decision process (MDP) which I intend to solve. (My MDP is fully observable: observation = state.)
I use the SAC agent of TF-agents to learn action value function q(x,a) and policy π. Thus given a state x, the policy returns an approximately optimal action a = π(x) so that v(x) ≈ q(x,π(x)).
Problem description: How can one write q(x,π(x)) as a TF-Agents expression?
I can examine the problem already with the SAC tutorial by adding the following lines to the end of the tutorial:
# Resetting the environment to obtain a TimeStep object
time_step = env.reset()
# An observation which respects the observation specs of env, corresponding to x above
observation = time_step.observation
# Calling the evaluation policy we obtain an action, this is essentially π(x) above
action = eval_policy.action(time_step).action
# I was expecting that the next line would return q(x,π(x))
The reason for the last line was that the input_tensor_spec of a CriticNetwork was described as a tuple of (observation, action) in
However instead critic_net((observation,action)) raises the following error:
InvalidArgumentError Traceback (most recent call last)
<ipython-input-32-8446b099696b> in <module>
----> 1 critic_net((observation,action))
2 frames
/usr/local/lib/python3.8/dist-packages/tf_agents/networks/ in __call__(self, inputs, *args, **kwargs)
425 normalized_kwargs.pop("network_state", None)
--> 427 outputs, new_state = super(Network, self).__call__(**normalized_kwargs) # pytype: disable=attribute-error # typed-keras
429 nest_utils.assert_matching_dtypes_and_inner_shapes(
/usr/local/lib/python3.8/dist-packages/keras/utils/ in error_handler(*args, **kwargs)
68 # To get the full stack trace, call:
69 # `tf.debugging.disable_traceback_filtering()`
---> 70 raise e.with_traceback(filtered_tb) from None
71 finally:
72 del filtered_tb
/usr/local/lib/python3.8/dist-packages/tf_agents/agents/ddpg/ in call(***failed resolving arguments***)
166 actions = layer(actions, training=training)
--> 168 joint = tf.concat([observations, actions], 1)
169 for layer in self._joint_layers:
170 joint = layer(joint, training=training)
InvalidArgumentError: Exception encountered when calling layer 'CriticNetwork' (type CriticNetwork).
{{function_node __wrapped__ConcatV2_N_2_device_/job:localhost/replica:0/task:0/device:CPU:0}} ConcatOp : Dimension 0 in both shapes must be equal: shape[0] = [28,1] vs. shape[1] = [8,1] [Op:ConcatV2] name: concat
Call arguments received by layer 'CriticNetwork' (type CriticNetwork):
• inputs=('tf.Tensor(shape=(28,), dtype=float32)', 'tf.Tensor(shape=(8,), dtype=float32)')
• step_type=()
• network_state=()
• training=False
Can someone help me with the evaluation of the critic network?

My traied model with tensorflow on transformer pipeline pop out error

I’m using this github text summarization and I have a problem. I have been struggling for two weeks and I could not figure that out.
I'm using a notebook from this github repository:
notebook link:
After training the model I want to use huggingface transformer pipeline to generate summarizations.
from transformers import pipeline
summarizer = pipeline(“summarization”, model=model, tokenizer=“t5small”, framework=“tf”)
summarizer(“some text”)
but it returns the following error:
AttributeError: ‘Functional’ object has no attribute 'config’
Anyone has any idea how can i solve it?
full error:
AttributeError Traceback (most recent call last)
/tmp/ipykernel_20/ in
----> 1 summarizer = pipeline(“summarization”, model=model, tokenizer=“t5-small”, framework=“tf”)
3 summarizer(“The US has passed the peak on new coronavirus cases, President Donald Trump said and predicted that some states would reopen”)
/opt/conda/lib/python3.7/site-packages/transformers/pipelines/ in pipeline(task, model, config, tokenizer, framework, revision, use_fast, use_auth_token, model_kwargs, **kwargs)
432 break
→ 434 return task_class(model=model, tokenizer=tokenizer, modelcard=modelcard, framework=framework, task=task, **kwargs)
/opt/conda/lib/python3.7/site-packages/transformers/pipelines/ in init(self, *args, **kwargs)
38 def init(self, *args, **kwargs):
—> 39 super().init(*args, **kwargs)
41 self.check_model_type(
/opt/conda/lib/python3.7/site-packages/transformers/pipelines/ in init(self, model, tokenizer, modelcard, framework, task, args_parser, device, binary_output)
549 # Update config with task specific parameters
→ 550 task_specific_params = self.model.config.task_specific_params
551 if task_specific_params is not None and task in task_specific_params:
552 self.model.config.update(task_specific_params.get(task))
AttributeError: ‘Functional’ object has no attribute 'config’

tensorflow UnknownError: Graph execution error: JIT compilation failed. [Op:__inference_restored_function_body_9127]

I was trying to use UNIVERSAL SENTENCE ENCODER from tensorflow hub.
Downloaded and extracted universal sentence encoder from hub
and when i tried to predict a senetence it showed an Error saying
UnknownError: Graph execution error:
JIT compilation failed.
import tensorflow_hub as hub
#loading downloaded and untarred universal sentence encoder
embed = hub.load("./universal-sentence-encoder_4/")
# passed as an array in embed()
Sentences = [
"How old are you"
embeddings = embed(Sentences)
and got error
2022-11-25 06:29:46.006767: I tensorflow/core/common_runtime/gpu/] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2630 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
2022-11-25 06:29:50.652156: W tensorflow/core/framework/] UNKNOWN: JIT compilation failed.
UnknownError Traceback (most recent call last)
Input In [1], in <cell line: 25>()
17 # Load pre-trained universal sentence encoder model
18 # embed = hub.load("")
20 # Sentences for which you want to create embeddings,
21 # passed as an array in embed()
22 Sentences = [
23 "How old are you"
24 ]
---> 25 embeddings = embed(Sentences)
27 # Printing embeddings of each sentence
28 print(embeddings)
File ~/miniconda3/envs/tf/lib/python3.10/site-packages/tensorflow/python/saved_model/, in _call_attribute(instance, *args, **kwargs)
703 def _call_attribute(instance, *args, **kwargs):
--> 704 return instance.__call__(*args, **kwargs)
File ~/miniconda3/envs/tf/lib/python3.10/site-packages/tensorflow/python/util/, in filter_traceback.<locals>.error_handler(*args, **kwargs)
151 except Exception as e:
152 filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153 raise e.with_traceback(filtered_tb) from None
154 finally:
155 del filtered_tb
File ~/miniconda3/envs/tf/lib/python3.10/site-packages/tensorflow/python/eager/, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
52 try:
53 ctx.ensure_initialized()
---> 54 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
55 inputs, attrs, num_outputs)
56 except core._NotOkStatusException as e:
57 if name is not None:
UnknownError: Graph execution error:
JIT compilation failed.
[[{{node EncoderDNN/EmbeddingLookup/EmbeddingLookupUnique/embedding_lookup/mod}}]] [Op:__inference_restored_function_body_4561]
how to i fix it?
I just want it working.
I had the error too and just did it with my CPU and it worked.
with tf.device('/CPU:0'):
embeddings = embed(Sentences)
first of all, there is a bug with using GPU over CPU fallbacks, tf.estimators() and TensorFlow-hub required dedicated hardware. embedding-4
Sample: Try adding the CUDA path to local variables of your OS, there are errors you had, follow the instructions. Thoe errors indicated of incompleted installed or setup.
import tensorflow as tf
import tensorflow_hub as hub
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)
embed = hub.load("")
embeddings = embed([
"The quick brown fox jumps over the lazy dog.",
"I am a sentence for which I would like to get its embedding"])
Output: Simply embedded layer, word to sequences working with principles.
[[-0.03133015 -0.06338634 -0.01607501 ... -0.03242781 -0.0457574
[ 0.05080863 -0.01652433 0.0157378 ... 0.00976659 0.03170122
0.01788119]], shape=(2, 512), dtype=float32)
Faced similar issues. I had tensorflow 2.10.1. Fixed by degrading to 2.8.0

OSError: [Errno 95] Operation not supported: '/content/drive/Mask_RCNN' on Google Colab

Hello trying to use saved weights for a Mask RCNN model within colab and keep incurring the error message below. I have tried different ways of accessing the .h5 problem, which was an issue before, and now I have hit a brick wall. I have tried to train different parts of the model, nothing works. Nothing specific is available on google colab with these circumstances.
The following is the cell that throws the issue:
# Training dataset.
dataset_train = linkedinDataset()
dataset_train.load_dataset(dataset_dir, "train")
# Validation dataset
dataset_val = linkedinDataset()
dataset_val.load_dataset(dataset_dir, "val")
# *** This training schedule is an example. Update to your needs ***
print("Training network heads")
```Training network heads
OSError Traceback (most recent call last)
<ipython-input-19-174a93609e58> in <module>()
17 learning_rate=config.LEARNING_RATE,
18 epochs=5,
---> 19 layers='heads')
2 frames
/content/Mask_RCNN/mrcnn/ in train(self, train_dataset, val_dataset, learning_rate, epochs,
layers, augmentation, custom_callbacks, no_augmentation_sources)
2334 # Create log_dir if it does not exist
2335 if not os.path.exists(self.log_dir):
-> 2336 os.makedirs(self.log_dir)
2338 # Callbacks
/usr/lib/python3.6/ in makedirs(name, mode, exist_ok)
208 if head and tail and not path.exists(head):
209 try:
--> 210 makedirs(head, mode, exist_ok)
211 except FileExistsError:
212 # Defeats race condition when another thread created the path
/usr/lib/python3.6/ in makedirs(name, mode, exist_ok)
218 return
219 try:
--> 220 mkdir(name, mode)
221 except OSError:
222 # Cannot rely on checking for EEXIST, since the operating system
OSError: [Errno 95] Operation not supported: '/content/drive/Mask_RCNN'```
You cannot use
You should save to either
Or, if to use Google Drive,

ValueError: Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 1]))

ValueError Traceback (most recent call last)
<ipython-input-30-33821ccddf5f> in <module>
23 output = model(data)
24 # calculate the batch loss
---> 25 loss = criterion(output, target)
26 # backward pass: compute gradient of the loss with respect to model parameters
27 loss.backward()
C:\Users\mnauf\Anaconda3\envs\federated_learning\lib\site-packages\torch\nn\modules\ in __call__(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
--> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)
C:\Users\mnauf\Anaconda3\envs\federated_learning\lib\site-packages\torch\nn\modules\ in forward(self, input, target)
593 self.weight,
594 pos_weight=self.pos_weight,
--> 595 reduction=self.reduction)
C:\Users\mnauf\Anaconda3\envs\federated_learning\lib\site-packages\torch\nn\ in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
2074 if not (target.size() == input.size()):
-> 2075 raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
2077 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
ValueError: Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 1]))
I am training a CNN. Working on the Horses vs humans dataset. This is my code. I am using criterion = nn.BCEWithLogitsLoss() and optimizer = optim.RMSprop(model.parameters(), lr=0.01). My final layer is self.fc2 = nn.Linear(512, 1). Out last neuron, will output 1 for horse and 0 for human, right? or should I choose 2 neurons for output?
16 is the batch size. Since the error says ValueError: Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 1])). I don't understand, where do I need to make change, to rectify the error.
target = target.unsqueeze(1), before passing target to criterion, changed the target tensor size from [16] to [16,1]. Doing it solved the issue. Furthermore, I also needed to do target = target.float() before passing it to criterion, because our outputs are in float. Besides, there was another error in the code. I was using sigmoid activation function in the last layer, but I shouldn’t because the criterion I am using already comes with sigmoid builtin.
You can also try _, pred = torch.max(output, 1) and then pass the pred variable into Loss function.
I had the same error when I ran my model. I was able to correct it by returning torch.tensor([target]).float().to(device) at the Dataset class.