Related
I created a python package to train a PyTorch model on GCP. I have stored my training data in the GS Bucket I am using, but when I try to read the csv file I get an error:
Exception ignored in: <function ClientSession.__del__ at 0x7f1ede3bd710>
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 Traceback (most recent call last):
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/site-packages/aiohttp/client.py", line 326, in __del__
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 if not self.closed:
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/site-packages/aiohttp/client.py", line 963, in closed
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 return self._connector is None or self._connector.closed
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 AttributeError: 'ClientSession' object has no attribute '_connector'
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 ERROR:root:Traceback (most recent call last):
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 "__main__", mod_spec)
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 exec(code, run_globals)
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 78, in <module>
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 main()
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 74, in main
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 experiment.run(args)
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 66, in run
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 train_dataset, test_dataset = utils.load_data()
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/root/.local/lib/python3.7/site-packages/trainer/utils.py", line 65, in load_data
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 train_dataset = pd.read_csv(TRAIN_DIR)
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 610, in read_csv
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 return _read(filepath_or_buffer, kwds)
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 462, in _read
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 parser = TextFileReader(filepath_or_buffer, **kwds)
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 819, in __init__
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 self._engine = self._make_engine(self.engine)
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1867, in __init__
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 self._open_handles(src, kwds)
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/site-packages/pandas/io/parsers.py", line 1368, in _open_handles
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 storage_options=kwds.get("storage_options", None),
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/site-packages/pandas/io/common.py", line 563, in get_handle
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 storage_options=storage_options,
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/site-packages/pandas/io/common.py", line 334, in _get_filepath_or_buffer
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 filepath_or_buffer, mode=fsspec_mode, **(storage_options or {})
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/root/.local/lib/python3.7/site-packages/fsspec/core.py", line 476, in open
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 **kwargs,
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/root/.local/lib/python3.7/site-packages/fsspec/core.py", line 306, in open_files
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 expand=expand,
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/root/.local/lib/python3.7/site-packages/fsspec/core.py", line 657, in get_fs_token_paths
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 fs = cls(**options)
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/root/.local/lib/python3.7/site-packages/fsspec/spec.py", line 76, in __call__
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 obj = super().__call__(*args, **kwargs)
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/opt/conda/lib/python3.7/site-packages/gcsfs/core.py", line 270, in __init__
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 self.loop, get_client, callback_timeout=self.callback_timeout
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/root/.local/lib/python3.7/site-packages/fsspec/asyn.py", line 66, in sync
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 raise return_result
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/root/.local/lib/python3.7/site-packages/fsspec/asyn.py", line 26, in _runner
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 result[0] = await coro
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 File "/root/.local/lib/python3.7/site-packages/fsspec/implementations/http.py", line 29, in get_client
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 return aiohttp.ClientSession(**kwargs)
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 TypeError: __init__() got an unexpected keyword argument 'callback_timeout'
ERROR 2022-07-04 13:39:43 +0200 master-replica-0 NoneType: None
INFO 2022-07-04 13:39:43 +0200 master-replica-0 ERROR:Command '['python3', '-m', 'trainer.task', '--model-name=bigbert_large_finetuned_cnnhead', '--job-dir=gs://led-test-run/bigbert_large_finetuned_cnnhead/models/bigbert_large_finetuned_cnnhead_20220704_133325']' returned non-zero exit status 1.
This the code in my utils.py:
GCS_BUCKET = "led-test-run" # #param {type:"string"}
# Setting location were training logs and checkpoints will be stored
GCS_BASE_ROOT = f"gs://{GCS_BUCKET}"
TRAIN_DIR = os.path.join(GCS_BASE_ROOT, "bigpatent_sample_corpus_BIG_train_df.csv")
VAL_DIR = os.path.join(GCS_BASE_ROOT, "bigpatent_sample_corpus_BIG_val_df.csv")
train_dataset = pd.read_csv(TRAIN_DIR)
Does someone know where the problem is or how I can read my training data in an other way?
Thanks!
EDIT:
I now tried to read the file in another way, but I get the same error when I try to use GCSFileSystem.
fs = gcsfs.GCSFileSystem(project='master-thesis-351911')
with fs.open('led-test-run/bigpatent_sample_corpus_BIG_train_df.csv') as f:
train_dataset = pd.read_csv(f)
The problem seems occur when connecting to the project.
The replica master 0 exited with a non-zero status of 1.
ERROR 2022-07-04 14:27:34 +0200 service Traceback (most recent call last):
ERROR 2022-07-04 14:27:34 +0200 service File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
ERROR 2022-07-04 14:27:34 +0200 service "__main__", mod_spec)
ERROR 2022-07-04 14:27:34 +0200 service File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
ERROR 2022-07-04 14:27:34 +0200 service exec(code, run_globals)
ERROR 2022-07-04 14:27:34 +0200 service File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 3, in <module>
ERROR 2022-07-04 14:27:34 +0200 service from trainer import experiment
ERROR 2022-07-04 14:27:34 +0200 service File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 13, in <module>
ERROR 2022-07-04 14:27:34 +0200 service from trainer import model, metadata, utils
ERROR 2022-07-04 14:27:34 +0200 service File "/root/.local/lib/python3.7/site-packages/trainer/utils.py", line 24, in <module>
ERROR 2022-07-04 14:27:34 +0200 service fs = gcsfs.GCSFileSystem(project='master-thesis-351911')
ERROR 2022-07-04 14:27:34 +0200 service File "/root/.local/lib/python3.7/site-packages/fsspec/spec.py", line 76, in __call__
ERROR 2022-07-04 14:27:34 +0200 service obj = super().__call__(*args, **kwargs)
ERROR 2022-07-04 14:27:34 +0200 service File "/opt/conda/lib/python3.7/site-packages/gcsfs/core.py", line 270, in __init__
ERROR 2022-07-04 14:27:34 +0200 service self.loop, get_client, callback_timeout=self.callback_timeout
ERROR 2022-07-04 14:27:34 +0200 service File "/root/.local/lib/python3.7/site-packages/fsspec/asyn.py", line 66, in sync
ERROR 2022-07-04 14:27:34 +0200 service raise return_result
ERROR 2022-07-04 14:27:34 +0200 service File "/root/.local/lib/python3.7/site-packages/fsspec/asyn.py", line 26, in _runner
ERROR 2022-07-04 14:27:34 +0200 service result[0] = await coro
ERROR 2022-07-04 14:27:34 +0200 service File "/root/.local/lib/python3.7/site-packages/fsspec/implementations/http.py", line 29, in get_client
ERROR 2022-07-04 14:27:34 +0200 service return aiohttp.ClientSession(**kwargs)
ERROR 2022-07-04 14:27:34 +0200 service TypeError: __init__() got an unexpected keyword argument 'callback_timeout'
I recently downloaded cuDNN and am getting some errors. My model starts training but then it quickly dies out. If I have a smaller network it will train for longer before dying out I want to try to understand if they are GPU OOM related or something else.
I am using Tensorflow 1.15.2 and cuDNN 7.6.5 w/ Cuda 10.0
Errors:
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 | Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 49.104568
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 2 | Loss: 54.958607
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 3 | Loss: 46.999936
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 4 | Loss: 69.989386
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 5 | Loss: 67.471436
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 6 | Loss: 66.270167
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 24, 4, 2048]
[[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
(1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 24, 4, 2048]
[[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
[[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_69]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "DeepSpeech.py", line 12, in <module>
ds_train.run_script()
File "/DeepSpeech/training/deepspeech_training/train.py", line 955, in run_script
absl.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/DeepSpeech/training/deepspeech_training/train.py", line 927, in main
train()
File "/DeepSpeech/training/deepspeech_training/train.py", line 595, in train
train_loss, _ = run_set('train', epoch, train_init_op)
File "/DeepSpeech/training/deepspeech_training/train.py", line 560, in run_set
feed_dict=feed_dict)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 24, 4, 2048]
[[node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 24, 4, 2048]
[[node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_69]]
0 successful operations.
0 derived errors ignored.
Error indicates that you're running out of GPU memory. Either remove samples that are too long from your dataset or reduce your batch size.
If that doesn't help look at this thread, there has been discussion around this known bug (similar to the one you reported) and proposed working solution is to set:
TF_CUDNN_RESET_RND_GEN_STATE=1
Hello tensorflow and Google Cloud users/developers,
When I submit a job that demands GPU support, ml-engine fails while loading libnccl.so.2 file. Here is the output from gcloud logs:
INFO 2019-01-07 15:13:58 +0000 master-replica-0 Error reported to Coordinator: libnccl.so.2: cannot open shared object file: No such file or directory
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 Traceback (most recent call last):
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 "__main__", mod_spec)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 exec(code, run_globals)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/root/.local/lib/python3.5/site-packages/main/task.py", line 220, in <module>
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 main()
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/root/.local/lib/python3.5/site-packages/main/task.py", line 185, in main
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 valid_spec
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return executor.run()
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 637, in run
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 getattr(self, task_to_run)()
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 674, in run_master
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 self._start_distributed_training(saving_listeners=saving_listeners)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 788, in _start_distributed_training
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 saving_listeners=saving_listeners)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 loss = self._train_model(input_fn, hooks, saving_listeners)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1205, in _train_model
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return self._train_model_distributed(input_fn, hooks, saving_listeners)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1316, in _train_model_distributed
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 self.config)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/distribute.py", line 721, in call_for_each_tower
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return self._call_for_each_tower(fn, *args, **kwargs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 556, in _call_for_each_tower
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return _call_for_each_tower(self, fn, *args, **kwargs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 183, in _call_for_each_tower
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 coord.join(threads)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 six.reraise(*self._exc_info_to_raise)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 raise value
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 yield
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 177, in _call_for_each_tower
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 **merge_kwargs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 661, in _distributed_apply
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 variable_scope.VariableAggregation.SUM, grads_and_vars)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/distribute.py", line 776, in batch_reduce
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return self._batch_reduce(aggregation, value_destination_pairs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/mirrored_strategy.py", line 628, in _batch_reduce
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 value_destination_pairs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 243, in batch_reduce
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return self._batch_reduce(aggregation, value_destination_pairs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 597, in _batch_reduce
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 [v[0] for v in value_destination_pairs])
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py", line 631, in _batch_all_reduce
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 device_grad_packs)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/distribute/python/cross_tower_utils.py", line 41, in aggregate_gradients_using_nccl
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 agg_grads = nccl.all_sum(single_grads)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 49, in all_sum
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 return _apply_all_reduce('sum', tensors)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 217, in _apply_all_reduce
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 _validate_and_load_nccl_so()
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 288, in _validate_and_load_nccl_so
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 _maybe_load_nccl_ops_so()
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py", line 274, in _maybe_load_nccl_ops_so
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 resource_loader.get_path_to_datafile('_nccl_ops.so'))
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/util/loader.py", line 56, in load_op_library
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 ret = load_library.load_op_library(path)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/load_library.py", line 60, in load_op_library
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 lib_handle = py_tf.TF_LoadLibrary(library_filename)
ERROR 2019-01-07 15:13:58 +0000 master-replica-0 tensorflow.python.framework.errors_impl.NotFoundError: libnccl.so.2: cannot open shared object file: No such file or directory
Am I supposed to install nccl to ml-engine? I specify "tensorflow-gpu (>=1.12)" in required_packages field of setup.py. and my config.yaml file looks like this:
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_m_gpu
workerType: complex_model_m_gpu
parameterServerType: large_model
workerCount: 0
parameterServerCount: 0
My quota allows me to use 4 K-80 devices on europe-west1 region.
Thanks for the help in advance.
I use anaconda2 with python 3.5 based tensorflow-gpu environment in wind10. I test the installation of tensorflow (v1.2) by run:
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
There is no problem with the installation.
Then I further test it by running two the provided examples:
reader_test.py
ptb_word_lm.py #this is to use LSTM to model penntree bank data
But the two programs cannot be run successfully:
For the first case:
For the second case:
#implementation in anaconda prompt
(tensorflow-gpu) D:\Research\Manuscript\Simplified LSTM\models-master\models-master\tutorials\rnn\ptb>python ptb_word_lm.py --data_path=D:\simple-examples\data
Resultant error messages:
2017-06-30 18:06:05.819002: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.
2017-06-30 18:06:05.819089: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-30 18:06:05.819770: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-30 18:06:05.819816: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-30 18:06:05.819843: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-30 18:06:05.819866: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-06-30 18:06:05.819889: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-06-30 18:06:05.819911: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-06-30 18:06:06.317871: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:940] Found device 0 with properties:
name: GeForce 940M
major: 5 minor: 0 memoryClockRate (GHz) 1.176
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 1.66GiB
2017-06-30 18:06:06.317961: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:961] DMA: 0
2017-06-30 18:06:06.321380: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0: Y
2017-06-30 18:06:06.322688: I c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce 940M, pci bus id: 0000:01:00.0)
WARNING:tensorflow:Standard services need a 'logdir' passed to the SessionManager
Epoch: 1 Learning rate: 1.000
2017-06-30 18:06:11.106452: E c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\stream_executor\cuda\cuda_blas.cc:365] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2017-06-30 18:06:11.106573: W c:\tf_jenkins\home\workspace\release-win\m\windows-gpu\py\35\tensorflow\stream_executor\stream.cc:1601] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1139, in _do_call
return fn(*args)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run_fn
status, run_metadata)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\contextlib.py", line 66, in __exit__
next(self.gen)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(20, 400), b.shape=(400, 800), m=20, n=800, k=400
[[Node: Train/Model/RNN/RNN/multi_rnn_cell/cell_0/cell_0/basic_lstm_cell/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Train/Model/RNN/RNN/multi_rnn_cell/cell_0/cell_0/basic_lstm_cell/basic_lstm_cell/concat, Model/RNN/multi_rnn_cell/cell_0/basic_lstm_cell/kernel/read)]]
[[Node: Train/Model/RNN/RNN/multi_rnn_cell/cell_1/cell_1/basic_lstm_cell/add_39/_123 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_6049_Train/Model/RNN/RNN/multi_rnn_cell/cell_1/cell_1/basic_lstm_cell/add_39", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ptb_word_lm.py", line 395, in <module>
tf.app.run()
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "ptb_word_lm.py", line 381, in main
verbose=True)
File "ptb_word_lm.py", line 310, in run_epoch
vals = session.run(fetches, feed_dict)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
run_metadata_ptr)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(20, 400), b.shape=(400, 800), m=20, n=800, k=400
[[Node: Train/Model/RNN/RNN/multi_rnn_cell/cell_0/cell_0/basic_lstm_cell/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Train/Model/RNN/RNN/multi_rnn_cell/cell_0/cell_0/basic_lstm_cell/basic_lstm_cell/concat, Model/RNN/multi_rnn_cell/cell_0/basic_lstm_cell/kernel/read)]]
[[Node: Train/Model/RNN/RNN/multi_rnn_cell/cell_1/cell_1/basic_lstm_cell/add_39/_123 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_6049_Train/Model/RNN/RNN/multi_rnn_cell/cell_1/cell_1/basic_lstm_cell/add_39", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op 'Train/Model/RNN/RNN/multi_rnn_cell/cell_0/cell_0/basic_lstm_cell/basic_lstm_cell/MatMul', defined at:
File "ptb_word_lm.py", line 395, in <module>
tf.app.run()
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "ptb_word_lm.py", line 357, in main
m = PTBModel(is_training=True, config=config, input_=train_input)
File "ptb_word_lm.py", line 157, in __init__
(cell_output, state) = cell(inputs[:, time_step, :], state)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\rnn_cell_impl.py", line 180, in __call__
return super(RNNCell, self).__call__(inputs, state)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\layers\base.py", line 441, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\rnn_cell_impl.py", line 916, in call
cur_inp, new_state = cell(cur_inp, cur_state)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\rnn_cell_impl.py", line 180, in __call__
return super(RNNCell, self).__call__(inputs, state)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\layers\base.py", line 441, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\rnn_cell_impl.py", line 383, in call
concat = _linear([inputs, h], 4 * self._num_units, True)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\rnn_cell_impl.py", line 1021, in _linear
res = math_ops.matmul(array_ops.concat(args, 1), weights)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1816, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 1217, in _mat_mul
transpose_b=transpose_b, name=name)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Users\Y L\Anaconda2\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\framework\ops.py", line 1269, in __init__
self._traceback = _extract_stack()
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(20, 400), b.shape=(400, 800), m=20, n=800, k=400
[[Node: Train/Model/RNN/RNN/multi_rnn_cell/cell_0/cell_0/basic_lstm_cell/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](Train/Model/RNN/RNN/multi_rnn_cell/cell_0/cell_0/basic_lstm_cell/basic_lstm_cell/concat, Model/RNN/multi_rnn_cell/cell_0/basic_lstm_cell/kernel/read)]]
[[Node: Train/Model/RNN/RNN/multi_rnn_cell/cell_1/cell_1/basic_lstm_cell/add_39/_123 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_6049_Train/Model/RNN/RNN/multi_rnn_cell/cell_1/cell_1/basic_lstm_cell/add_39", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
I solved the problem by updating anaconda (conda upate --all) and then restarting PC.
Running OpenStack Newton, fresh install, am getting this on the compute node (in nova-compute.log) when trying to launch an instance:
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [req-af37e2ee-0ef9-4d4e-b3ce-d7a1bf27a780 - - - - -] [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] An error occurred while refreshing the network cache.
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] Traceback (most recent call last):
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 5766, in _heal_instance_info_cache
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] self.network_api.get_instance_nw_info(context, instance)
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] File "/usr/lib/python2.7/dist-packages/nova/network/api.py", line 369, in get_instance_nw_info
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] **kwargs)
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] File "/usr/lib/python2.7/dist-packages/nova/network/base_api.py", line 249, in get_instance_nw_info
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] result = self._get_instance_nw_info(context, instance, **kwargs)
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] File "/usr/lib/python2.7/dist-packages/nova/network/api.py", line 378, in _get_instance_nw_info
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] nw_info = self.network_rpcapi.get_instance_nw_info(context, **args)
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] File "/usr/lib/python2.7/dist-packages/nova/network/rpcapi.py", line 211, in get_instance_nw_info
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] host=host, project_id=project_id)
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 169, in call
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] retry=self.retry)
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 97, in _send
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] timeout=timeout, retry=retry)
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 464, in send
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] retry=retry)
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 453, in _send
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] result = self._waiter.wait(msg_id, timeout)
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in wait
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] message = self.waiters.get(msg_id, timeout=timeout)
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 238, in get
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] 'to message ID %s' % msg_id)
2017-04-04 19:28:47.546 31726 ERROR nova.compute.manager [instance: 6ecaf72c-88bc-4f26-8907-dc19d7924327] MessagingTimeout: Timed out waiting for a reply to message ID bb7d1a5d89c8469aa1243f9102656d3
This only happens for messages in exchange 'nova' topic 'network':
Apr 4 19:26:47 ip-192-168-99-11 nova-compute[31726]: 2017-04-04 19:26:47.544 31726 DEBUG oslo_messaging._drivers.amqpdriver [req-af37e2ee-0ef9-4d4e-b3ce-d7a1bf27a780 - - - - -] CALL msg_id: bb7d1a5d89c8469aa1243f9102656d3f exchange 'nova' topic 'network' _send /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:448
(e.g., messages on topic 'conductor' go through fine).
I notice that in RabbitMQ there is a conductor queue (with conductor routing key) but no network queue (which corresponds to https://ilearnstack.com/2013/04/24/messaging-in-openstack-using-rabbitmq/)
The connectivity between the compute node and controller (where Rabbit runs) is fine.
Tried turning on the Rabbit tracing (http://www.rabbitmq.com/firehose.html), and I see all messages BUT the ones in question.
Any pointers?