I have tried to create new cluster using OpsCenter 5.2.4 but I got this error:
Error: Fingerprint Detection failed: sequence item 0: expected string, NoneType found
in the logs of OpsCenter /var/log/opscenter/opscenterd.log I got this error:
2016-10-09 10:02:06+0000 [] INFO: Determining ssh fingerprints of new instances.
2016-10-09 10:02:06+0000 [] ERROR: /usr/bin/ssh-keyscan had some issues:
stdout=
stderr= getaddrinfo None: Name or service not known^M
2016-10-09 10:02:06+0000 [] ERROR: /usr/bin/ssh-keyscan had some issues:
stdout=
stderr= getaddrinfo None: Name or service not known^M
2016-10-09 10:02:06+0000 [] ERROR: /usr/bin/ssh-keyscan had some issues:
stdout=
stderr= getaddrinfo None: Name or service not known^M
2016-10-09 10:02:06+0000 [] Error determining fingerprints
Traceback (most recent call last):
Failure: exceptions.TypeError: sequence item 0: expected string, NoneType found
2016-10-09 10:02:06+0000 [] ERROR: Fingerprint Detection failed: sequence item 0: expected string, NoneType found sequence item 0: expected string, NoneType found
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 1018, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/python/failure.py", line 349, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python2.7/dist-packages/opscenterd/cloud/Ec2Launcher.py", line 582, in _determine_fingerprints
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 1018, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/python/failure.py", line 349, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python2.7/dist-packages/opscenterd/SecureShell.py", line 148, in get_remote_ssh_key_map
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 1020, in _inlineCallbacks
result = g.send(result)
File "/usr/lib/python2.7/dist-packages/opscenterd/SecureShell.py", line 370, in _get_remote_ssh_keys_in_bulk
2016-10-09 10:02:06+0000 [] WARN: Marking request 1ca29951-2ce1-49a2-9a92-5555c93064ce as failed: Fingerprint Detection failed: sequence item 0: expected string, NoneType found
2016-10-09 10:02:06+0000 [] WARN: Marking request 3e815163-648d-4681-9893-7abf5dd67aff as failed: Fingerprint Detection failed: sequence item 0: expected string, NoneType found
2016-10-09 10:02:06+0000 [] ERROR:
2016-10-09 10:02:06+0000 [] WARN: Marking request f59e8ea2-1765-4145-a981-c087d8825b50 as failed: Fingerprint Detection failed: sequence item 0: expected string, NoneType found
2016-10-09 10:02:06+0000 [] Unexpected error provisioning cluster.
Traceback (most recent call last):
Failure: exceptions.TypeError: sequence item 0: expected string, NoneType found
2016-10-09 10:02:06+0000 [] WARN: Marking request f59e8ea2-1765-4145-a981-c087d8825b50 as failed: sequence item 0: expected string, NoneType found
2016-10-09 10:02:06+0000 [] ERROR: Launching instances failed (with an unexpected error): Fingerprint Detection failed: sequence item 0: expected string, NoneType found
Any idea?
OpsCenter provisioning dev here. This is telling you the operation that failed:
2016-10-09 10:02:06+0000 [] ERROR: /usr/bin/ssh-keyscan had some issues:
stdout=
stderr= getaddrinfo None: Name or service not known
Which is probably network related. Can you run ssh-keyscan against the host in question or are you able to ssh to it in general? I'm guessing not and that you need to mess with network-firewalls, iptables, and/or your routing until you're able to ssh from the OpsCenter server to the target hosts.
Also note that the provisioning features in OpsCenter 6.0.x are light-years ahead of what's in 5.2.x. I strongly suggest you upgrade.
Related
I am trying to run tensorflow using my GPU and have followed the instructions at this link. After running the commands in Step 6, I get the proper output.
Then, when I try to run an actual model I am trying to build, I get the following error.
2023-01-06 18:39:14.692537: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-01-06 18:39:14.693094: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.693196: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-01-06 18:39:14.693275: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-01-06 18:39:14.704458: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.704603: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
Traceback (most recent call last):
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 42, in <module>
main(sys.argv[1:])
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 27, in main
model.train()
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_V5.py", line 99, in train
history = self.model.fit(x, y, batch_size = batchSize, epochs = epochs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node 'StatefulPartitionedCall_10' defined at (most recent call last):
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 42, in <module>
main(sys.argv[1:])
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 27, in main
model.train()
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_V5.py", line 99, in train
history = self.model.fit(x, y, batch_size = batchSize, epochs = epochs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1650, in fit
tmp_logs = self.train_function(iterator)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1249, in train_function
return step_function(self, iterator)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1233, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1222, in run_step
outputs = model.train_step(data)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1027, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
self.apply_gradients(grads_and_vars)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
return super().apply_gradients(grads_and_vars, name=name)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
iteration = self._internal_apply_gradients(grads_and_vars)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
return tf.__internal__.distribute.interim.maybe_merge_call(
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
distribution.extended.update(
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_10'
libdevice not found at ./libdevice.10.bc
[[{{node StatefulPartitionedCall_10}}]] [Op:__inference_train_function_8591]
After doing some research, it appears that the relevant errors are the following:
2023-01-06 18:39:14.692537: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-01-06 18:39:14.693094: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.693196: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-01-06 18:39:14.693275: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-01-06 18:39:14.704458: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.704603: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
Traceback (most recent call last):
For context, this is running in Ubuntu 20.04 and python 3.9. Any ideas on how to fix?
I downloaded MODIS data from its http server and was trying to load it in xarray on google colab. I added the netrc file and the file is not corrupted since gdalinfo on that gave no error.
URL = 'https://e4ftl01.cr.usgs.gov/MOLT/MOD09GA.061/2019.02.24/MOD09GA.A2019055.h09v07.061.2020288120208.hdf'
result = requests.get(URL)
filename = 'test.hdf'
with open(filename, 'wb') as f:
f.write(result.content)
when I run
xr.open_dataset('test.hdf',engine = 'netcdf4' )
this is the error
KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/content/test1.hdf',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))]
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
203 kwargs = kwargs.copy()
204 kwargs["mode"] = self._mode
--> 205 file = self._opener(*self._args, **kwargs)
206 if self._mode == "w":
207 # ensure file doesn't get overriden when opened again
src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()
src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()
OSError: [Errno -128] NetCDF: Attempt to use feature that was not turned on when netCDF was built.: b'/content/test.hdf'
What is the error about and how to resolve it?
netcdf4 version 1.5.8
xarray version 0.18.1
I'm trying to train cGAN on g4dn.xlarge GPU ec2 machine and it crashes every time after 8 epochs exactly with the following message:
Traceback (most recent call last):
File "pix2pix_tf2.py", line 841, in <module>
main()
File "pix2pix_tf2.py", line 802, in main
results = sess.run(fetches, options=options, run_metadata=run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 958, in run
run_metadata_ptr)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1181, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: 2 root error(s) found.
(0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
(1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
[[{{node TensorArrayV2Write/TensorListSetItem}}]]
(1) Invalid argument: 2 root error(s) found.
(0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
(1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
[[{{node TensorArrayV2Write/TensorListSetItem}}]]
[[Func/encode_images/target_pngs/while/body/_47/input/_154/_773]]
0 successful operations.
0 derived errors ignored.
env spec:
tensorflow 2.2.0
CUDA V10.0.130
cudnn 7.6.5
updating CUDA to 10.1 solved the issue
I am trying to create version under google cloud ml models for the successfully trained tensorflow estimator model. I believe that I am providing the correct Uri(in google storage) which includes saved_model.pb.
Framework: Tensorflow,
Framework Version: 1.13.1,
Runtime Version: 1.13,
Python: 3.5
Here is the traceback of the error:
Traceback (most recent call last):
File "/google/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 985, in Execute
resources = calliope_command.Run(cli=self, args=args)
File "/google/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 795, in Run
resources = command_instance.Run(args)
File "/google/google-cloud-sdk/lib/surface/ml_engine/versions/create.py", line 119, in Run
python_version=args.python_version)
File "/google/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine/versions_util.py", line 114, in Create
message='Creating version (this might take a few minutes)...')
File "/google/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine/versions_util.py", line 75, in WaitForOpMaybe
return operations_client.WaitForOperation(op, message=message).response
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/ml_engine/operations.py", line 114, in WaitForOperation
sleep_ms=5000)
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 264, in WaitFor
sleep_ms, _StatusUpdate)
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 326, in PollUntilDone
sleep_ms=sleep_ms)
File "/google/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 229, in RetryOnResult
if not should_retry(result, state):
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 320, in _IsNotDone
return not poller.IsDone(operation)
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 122, in IsDone
raise OperationError(operation.error.message)
OperationError: Bad model detected with error: "Failed to load model: a bytes-like object is required, not 'str' (Error code: 0)"
ERROR: (gcloud.ml-engine.versions.create) Bad model detected with error: "Failed to load model: a bytes-like object is required, not 'str' (Error code: 0)"
Do you have any idea what might be the problem?
EDIT
I am using:
tf.estimator.LatestExporter('exporter', model.serving_input_fn)
as a estimator exporter.
serving_input_fn:
def serving_input_fn():
inputs = {'string1': tf.placeholder(tf.int16, [None, MAX_SEQUENCE_LENGTH]),
'string2': tf.placeholder(tf.int16, [None, MAX_SEQUENCE_LENGTH])}
return tf.estimator.export.ServingInputReceiver(inputs, inputs)
PS: my model takes two inputs and returns one binary output.
My redis server is running under ubuntu 16.04 and I have RQ Dashboard running to monitor the queue. The redis server has a password which I supply for the initial connection. Here's my code:
from rq import Queue, Connection, Worker
from redis import Redis
from dblogger import DbLogger
def _redisCon():
redis_host = "192.168.1.169"
redis_port = "6379"
redis_password = "SecretPassword"
return Redis(host=redis_host, port=redis_port, password=redis_password)
rcon = _redisCon()
if rcon is not None:
with Connection(rcon):
DbLogger.log("rqworker", 0, "Launching Worker", "launching an RQ Worker - default Queue")
worker = Worker(list(map(Queue, 'default'))) # this works - I see the worker registered in RQ dashboard
worker.work() # this eventually fails with the Connection error:
"""
16:28:49 RQ worker 'rq:worker:steve-imac.95379' started, version 0.12.0
16:28:49 *** Listening on default...
16:28:49 Cleaning registries for queue: default
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 177, in _read_from_socket
raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
OSError: Connection closed by server.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/client.py", line 668, in execute_command
return self.parse_response(connection, command_name, **options)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/client.py", line 680, in parse_response
response = connection.read_response()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 624, in read_response
response = self._parser.read_response()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 284, in read_response
response = self._buffer.readline()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 216, in readline
self._read_from_socket()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 191, in _read_from_socket
(e.args,))
redis.exceptions.ConnectionError: Error while reading from socket: ('Connection closed by server.',)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 489, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 61 connecting to 192.168.1.169:6379. Connection refused.
"""
I've tried removing the password and enabling the unixsocket in the redis.conf -- neither seemed to help. This seems to be happening in some sort of timeout, since in other testing the worker actually loads a task and executes it before eventually dying with this error.