Error when Determining ssh fingerprints of new instances - datastax

I have tried to create new cluster using OpsCenter 5.2.4 but I got this error:
Error: Fingerprint Detection failed: sequence item 0: expected string, NoneType found
in the logs of OpsCenter /var/log/opscenter/opscenterd.log I got this error:
2016-10-09 10:02:06+0000 [] INFO: Determining ssh fingerprints of new instances.
2016-10-09 10:02:06+0000 [] ERROR: /usr/bin/ssh-keyscan had some issues:
stdout=
stderr= getaddrinfo None: Name or service not known^M
2016-10-09 10:02:06+0000 [] ERROR: /usr/bin/ssh-keyscan had some issues:
stdout=
stderr= getaddrinfo None: Name or service not known^M
2016-10-09 10:02:06+0000 [] ERROR: /usr/bin/ssh-keyscan had some issues:
stdout=
stderr= getaddrinfo None: Name or service not known^M
2016-10-09 10:02:06+0000 [] Error determining fingerprints
Traceback (most recent call last):
Failure: exceptions.TypeError: sequence item 0: expected string, NoneType found
2016-10-09 10:02:06+0000 [] ERROR: Fingerprint Detection failed: sequence item 0: expected string, NoneType found sequence item 0: expected string, NoneType found
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 1018, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/python/failure.py", line 349, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python2.7/dist-packages/opscenterd/cloud/Ec2Launcher.py", line 582, in _determine_fingerprints
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 1018, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/python/failure.py", line 349, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/lib/python2.7/dist-packages/opscenterd/SecureShell.py", line 148, in get_remote_ssh_key_map
File "/usr/share/opscenter/lib/py-debian/2.7/amd64/twisted/internet/defer.py", line 1020, in _inlineCallbacks
result = g.send(result)
File "/usr/lib/python2.7/dist-packages/opscenterd/SecureShell.py", line 370, in _get_remote_ssh_keys_in_bulk
2016-10-09 10:02:06+0000 [] WARN: Marking request 1ca29951-2ce1-49a2-9a92-5555c93064ce as failed: Fingerprint Detection failed: sequence item 0: expected string, NoneType found
2016-10-09 10:02:06+0000 [] WARN: Marking request 3e815163-648d-4681-9893-7abf5dd67aff as failed: Fingerprint Detection failed: sequence item 0: expected string, NoneType found
2016-10-09 10:02:06+0000 [] ERROR:
2016-10-09 10:02:06+0000 [] WARN: Marking request f59e8ea2-1765-4145-a981-c087d8825b50 as failed: Fingerprint Detection failed: sequence item 0: expected string, NoneType found
2016-10-09 10:02:06+0000 [] Unexpected error provisioning cluster.
Traceback (most recent call last):
Failure: exceptions.TypeError: sequence item 0: expected string, NoneType found
2016-10-09 10:02:06+0000 [] WARN: Marking request f59e8ea2-1765-4145-a981-c087d8825b50 as failed: sequence item 0: expected string, NoneType found
2016-10-09 10:02:06+0000 [] ERROR: Launching instances failed (with an unexpected error): Fingerprint Detection failed: sequence item 0: expected string, NoneType found
Any idea?

OpsCenter provisioning dev here. This is telling you the operation that failed:
2016-10-09 10:02:06+0000 [] ERROR: /usr/bin/ssh-keyscan had some issues:
stdout=
stderr= getaddrinfo None: Name or service not known
Which is probably network related. Can you run ssh-keyscan against the host in question or are you able to ssh to it in general? I'm guessing not and that you need to mess with network-firewalls, iptables, and/or your routing until you're able to ssh from the OpsCenter server to the target hosts.
Also note that the provisioning features in OpsCenter 6.0.x are light-years ahead of what's in 5.2.x. I strongly suggest you upgrade.

Related

Libdevice not found Tensorflow

I am trying to run tensorflow using my GPU and have followed the instructions at this link. After running the commands in Step 6, I get the proper output.
Then, when I try to run an actual model I am trying to build, I get the following error.
2023-01-06 18:39:14.692537: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-01-06 18:39:14.693094: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.693196: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-01-06 18:39:14.693275: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-01-06 18:39:14.704458: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.704603: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
Traceback (most recent call last):
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 42, in <module>
main(sys.argv[1:])
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 27, in main
model.train()
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_V5.py", line 99, in train
history = self.model.fit(x, y, batch_size = batchSize, epochs = epochs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node 'StatefulPartitionedCall_10' defined at (most recent call last):
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 42, in <module>
main(sys.argv[1:])
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 27, in main
model.train()
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_V5.py", line 99, in train
history = self.model.fit(x, y, batch_size = batchSize, epochs = epochs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1650, in fit
tmp_logs = self.train_function(iterator)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1249, in train_function
return step_function(self, iterator)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1233, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1222, in run_step
outputs = model.train_step(data)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1027, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
self.apply_gradients(grads_and_vars)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
return super().apply_gradients(grads_and_vars, name=name)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
iteration = self._internal_apply_gradients(grads_and_vars)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
return tf.__internal__.distribute.interim.maybe_merge_call(
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
distribution.extended.update(
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_10'
libdevice not found at ./libdevice.10.bc
[[{{node StatefulPartitionedCall_10}}]] [Op:__inference_train_function_8591]
After doing some research, it appears that the relevant errors are the following:
2023-01-06 18:39:14.692537: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-01-06 18:39:14.693094: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.693196: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-01-06 18:39:14.693275: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-01-06 18:39:14.704458: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.704603: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
Traceback (most recent call last):
For context, this is running in Ubuntu 20.04 and python 3.9. Any ideas on how to fix?

OSError: [Errno -128] NetCDF: Attempt to use feature that was not turned on when netCDF was built.: b'/content/test.hdf'

I downloaded MODIS data from its http server and was trying to load it in xarray on google colab. I added the netrc file and the file is not corrupted since gdalinfo on that gave no error.
URL = 'https://e4ftl01.cr.usgs.gov/MOLT/MOD09GA.061/2019.02.24/MOD09GA.A2019055.h09v07.061.2020288120208.hdf'
result = requests.get(URL)
filename = 'test.hdf'
with open(filename, 'wb') as f:
f.write(result.content)
when I run
xr.open_dataset('test.hdf',engine = 'netcdf4' )
this is the error
KeyError: [<class 'netCDF4._netCDF4.Dataset'>, ('/content/test1.hdf',), 'r', (('clobber', True), ('diskless', False), ('format', 'NETCDF4'), ('persist', False))]
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
203 kwargs = kwargs.copy()
204 kwargs["mode"] = self._mode
--> 205 file = self._opener(*self._args, **kwargs)
206 if self._mode == "w":
207 # ensure file doesn't get overriden when opened again
src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__()
src/netCDF4/_netCDF4.pyx in netCDF4._netCDF4._ensure_nc_success()
OSError: [Errno -128] NetCDF: Attempt to use feature that was not turned on when netCDF was built.: b'/content/test.hdf'
What is the error about and how to resolve it?
netcdf4 version 1.5.8
xarray version 0.18.1

tensorflow 2 on g4dn.xlarge GPU crashes after 8 epochs

I'm trying to train cGAN on g4dn.xlarge GPU ec2 machine and it crashes every time after 8 epochs exactly with the following message:
Traceback (most recent call last):
File "pix2pix_tf2.py", line 841, in <module>
main()
File "pix2pix_tf2.py", line 802, in main
results = sess.run(fetches, options=options, run_metadata=run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 958, in run
run_metadata_ptr)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1181, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: 2 root error(s) found.
(0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
(1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
[[{{node TensorArrayV2Write/TensorListSetItem}}]]
(1) Invalid argument: 2 root error(s) found.
(0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
(1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
[[{{node TensorArrayV2Write/TensorListSetItem}}]]
[[Func/encode_images/target_pngs/while/body/_47/input/_154/_773]]
0 successful operations.
0 derived errors ignored.
env spec:
tensorflow 2.2.0
CUDA V10.0.130
cudnn 7.6.5
updating CUDA to 10.1 solved the issue

GCloud MLEngine:Create Version failed. Bad model detected with error: Failed to load model: a bytes-like object is required, not 'str' (Error code: 0)

I am trying to create version under google cloud ml models for the successfully trained tensorflow estimator model. I believe that I am providing the correct Uri(in google storage) which includes saved_model.pb.
Framework: Tensorflow,
Framework Version: 1.13.1,
Runtime Version: 1.13,
Python: 3.5
Here is the traceback of the error:
Traceback (most recent call last):
File "/google/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 985, in Execute
resources = calliope_command.Run(cli=self, args=args)
File "/google/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 795, in Run
resources = command_instance.Run(args)
File "/google/google-cloud-sdk/lib/surface/ml_engine/versions/create.py", line 119, in Run
python_version=args.python_version)
File "/google/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine/versions_util.py", line 114, in Create
message='Creating version (this might take a few minutes)...')
File "/google/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine/versions_util.py", line 75, in WaitForOpMaybe
return operations_client.WaitForOperation(op, message=message).response
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/ml_engine/operations.py", line 114, in WaitForOperation
sleep_ms=5000)
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 264, in WaitFor
sleep_ms, _StatusUpdate)
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 326, in PollUntilDone
sleep_ms=sleep_ms)
File "/google/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 229, in RetryOnResult
if not should_retry(result, state):
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 320, in _IsNotDone
return not poller.IsDone(operation)
File "/google/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/waiter.py", line 122, in IsDone
raise OperationError(operation.error.message)
OperationError: Bad model detected with error: "Failed to load model: a bytes-like object is required, not 'str' (Error code: 0)"
ERROR: (gcloud.ml-engine.versions.create) Bad model detected with error: "Failed to load model: a bytes-like object is required, not 'str' (Error code: 0)"
Do you have any idea what might be the problem?
EDIT
I am using:
tf.estimator.LatestExporter('exporter', model.serving_input_fn)
as a estimator exporter.
serving_input_fn:
def serving_input_fn():
inputs = {'string1': tf.placeholder(tf.int16, [None, MAX_SEQUENCE_LENGTH]),
'string2': tf.placeholder(tf.int16, [None, MAX_SEQUENCE_LENGTH])}
return tf.estimator.export.ServingInputReceiver(inputs, inputs)
PS: my model takes two inputs and returns one binary output.

RQ Redis : Connection Refused after a successful connection

My redis server is running under ubuntu 16.04 and I have RQ Dashboard running to monitor the queue. The redis server has a password which I supply for the initial connection. Here's my code:
from rq import Queue, Connection, Worker
from redis import Redis
from dblogger import DbLogger
def _redisCon():
redis_host = "192.168.1.169"
redis_port = "6379"
redis_password = "SecretPassword"
return Redis(host=redis_host, port=redis_port, password=redis_password)
rcon = _redisCon()
if rcon is not None:
with Connection(rcon):
DbLogger.log("rqworker", 0, "Launching Worker", "launching an RQ Worker - default Queue")
worker = Worker(list(map(Queue, 'default'))) # this works - I see the worker registered in RQ dashboard
worker.work() # this eventually fails with the Connection error:
"""
16:28:49 RQ worker 'rq:worker:steve-imac.95379' started, version 0.12.0
16:28:49 *** Listening on default...
16:28:49 Cleaning registries for queue: default
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 177, in _read_from_socket
raise socket.error(SERVER_CLOSED_CONNECTION_ERROR)
OSError: Connection closed by server.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/client.py", line 668, in execute_command
return self.parse_response(connection, command_name, **options)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/client.py", line 680, in parse_response
response = connection.read_response()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 624, in read_response
response = self._parser.read_response()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 284, in read_response
response = self._buffer.readline()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 216, in readline
self._read_from_socket()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 191, in _read_from_socket
(e.args,))
redis.exceptions.ConnectionError: Error while reading from socket: ('Connection closed by server.',)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/redis/connection.py", line 489, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 61 connecting to 192.168.1.169:6379. Connection refused.
"""
I've tried removing the password and enabling the unixsocket in the redis.conf -- neither seemed to help. This seems to be happening in some sort of timeout, since in other testing the worker actually loads a task and executes it before eventually dying with this error.