Failed to run tensorflow distributed MNIST test

Failed to run tensorflow distributed MNIST test - tensorflow

I installed tensorflow 0.8 by building from source.
I use AWS EC2 g2.8xlarge instance which has 4 GPUs.
I tried to run tensorflow distributed mnist test, code in here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/scripts/dist_mnist_test.sh
my script:
bash dist_mnist_test.sh "grpc://localhost:2223 grpc://localhost:2224"
and I got this message:
E0609 14:53:07.430440599 62872 tcp_client_posix.c:173] failed to connect to 'ipv4:127.0.0.1:2223': socket error: connection refused
E0609 14:53:07.445297934 62873 tcp_client_posix.c:173] failed to connect to 'ipv4:127.0.0.1:2224': socket error: connection refused
Any one know what is wrong here? Thanks a lot!

This script does not run standalone. In particular, it expects that you have created a TensorFlow cluster with workers running at each of the addresses before running the script. The create_tf_cluster.sh script can set up such a cluster using Kubernetes. The dist_test.sh script runs these scripts end-to-end.
See my answer to your other question, which has a suggested script for running MNIST on distributed TensorFlow.

I suspect there's a networking problem here. The first debugging step I would take is to ensure that the sockets 2223 and 2224 are actually being listened to using a tool like netstat. Here's a good general description of how to do that:
https://askubuntu.com/questions/278448/how-to-know-what-program-is-listening-on-a-given-port
If that works, then try using telnet to connect to the socket manually, to make sure the network addressing is working correctly.

Related

How do you manage 100 virtual servers for inference?

I need to boot 100 virtual servers and use each for tensorflow model inference for 30 days. What are some tools to do this?
I currently boot the servers using an image, and open two tmux sessions manually. One session is for the model client and the other is for the tensorflow server. I receive a slack notification if any of the server CPUs stop working as a way to know if a server fails (I also manually SSH to debug/restart the server).
Would appreciate any tips!

Google Compute Engine SSH from browser stopped working Error 13

A compute instance I had running stopped working and I am no longer able to ssh to it from the browser. When I try it hangs forever and eventually I get the error message:
You cannot connect to the VM instance because of an unexpected error.
Wait a few moments and then try again. (#13)
I looked here for common issues. I made a snapshot and tried recreating with a larger disk, in a different region and with a bigger compute instance but I was still unable to connect. When other users try to connect they have the same problem. I'm using a standard container so I expect the google daemon should be running.
This instance was collecting tweets and writing output to GCS regularly. Since ssh stopped working the instance has also stopped writing output.
Does anyone have any idea what could have gone wrong?

I would also suggest checking the Serial Console of the machine to see if there are any messages which provide any clues. For example, if the boot disk has run out of space (which can prevent SSH connectivity), there will be some messages displayed in the Serial Console implying this.
You could also try connecting to the machine via the Serial Console to troubleshoot the issue by following the advice here.
When you try to SSH into the instance from the Cloud Shell for example, using the following command, the output should provide some clues as to why you cannot SSH into the machine:
$ gcloud compute ssh INSTANCE_NAME --zone ZONE

If you are on a VPC network, try to check the applicable network TAG that allows the instance to use SSH and provide that tag to your instance. Because it could be the Firewall rules that are blocking your instance from creating the ssh connection.

Meld error when setting up a new cluster

I am evaluating the DataStax OpsCenter on a virtual machine to start managing/monitoring cassandra. I am following the online docs to create cluster topology models via OpsCenter LCM, but the error message doesn't provide much information for me to continue. The jobs status are,
error- MeldError, 400 Client Error: Bad Request for url: http://[ip_address]:8888/api/v1/lcm/internal/nodes/6185c776-9034-45b4-a54f-6eb9511274a2/package_information
Meld failed on name="testnode1" ssh-management-address=[ip_address]" node-id="6185c776-9034-45b4-a54f-6eb9511274a2" node-name="testnode1" job-id="1b792c69-bcca-489f-ad12-a6285ba84d59" stdout=" Meld has started... " stderr=""
My question is what might be wrong and any hint how to resolve that?
I am new to the cassandra and DataStax communities, please forgive me if any silly question asked!
Q: I used to be a buildbot user and DataStax agent looks like a Buildbot's slave. Why we don't need agent setup on the remote machine to work with opscenter? The working directory of agent is configured in opscenter?
The opscenterd.log, https://pastebin.com/TJsvmr6t

According to the compatibility of the tools set mentioned in https://docs.datastax.com/en/landing_page/doc/landing_page/compatibility.html#compatibilityDocument__opsc-compatibility , I actually use the OpsCenter v5.2 for monitoring and basic db operations. After trial-and-error of .yaml of Agent and .conf of Cassandra 2.2, the Dashboard works!
Knowledge gained,
The OpsCenter 5.2 actually works with Cassandra 2.2 which is not listed in the compatibility table
For beginner, if not sure where to start, try to install all the components on one machine to get idea of the least viable working setup. And from there to configure the actual dev/test/production environment.

How to fix the Zookeeper error for Hbase

Main OS is windows 7 64bit. Using VM player to create two vm CentOS 5.6 system. The net connection is bridge. I installed Hbase on both of the CentOS system， one is master, the other is slave. When I enter the shell, and run status 'details'.
The error from master is
zookeeper.ZKConfig: no valid quorum servers found in zoo.cfg ERROR:
org.apache.hadoop.hbase.ZooKeeperConnectionException: An error is
preventing HBase from connecting to ZooKeeper
And the error from slave is
ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is
able to connect to ZooKeeper but the connection closes immediately.
This could be a sign that the server has too many connections (30 is
the default). Consider inspecting your ZK server logs for that error
and then make sure you are reusing HBaseConfiguration as often as you
can. See HTable's javadoc for more information.
Please give me some suggestion.
Thanks a lot

Check if this is within your .bashrc, if not, add them and restart all hbase services (do not forget to manually run them as well), that did it for me with a pseudo-distributed installation. My problem (and maybe yours as well) was that Hbase wasn't detecting it's configuration.
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HBASE_CONF_DIR=/etc/hbase/conf

I see this very often on my machine. I don't have a failsafe cure, but end up running stop-all.sh, and deleting every place that hadoop and dfs (its a dfs failure) store their temp files. It seems to happen after my computer goes to sleep while dfs is running.
I am going to experiment with single-user mode to avoid this. I dont need distribution while developing.

Cannot create DB2 index, getting SQL30081N error

Trying to create an index (and run some long queries) on DB2 v9.1 and failing with the following error message:
SQL30081N (A communication error has been detected. Communication protocol being used: "TCP/IP". Communication API being used: "SOCKETS". Location where the error was detected: "". Communication function detecting the
error...")
I have tried to follow advice given by IBM here regarding setting QUERYTIMEOUTINTERVAL=0- http://www-01.ibm.com/support/docview.wss?rs=71&uid=swg21164785 but it did not take.
any ideas? queries and commands seem to time out at about 15 minutes.

You can rule out any network interference by running the DDL and SQL locally on the server. By using nohup on UNIX or schtasks on Windows, you can start a DB2 job that will run to completion even if the database server loses all network connectivity.

It seems like a network error, probably your client machine is losing the connection to the server. Are you over an unstable network connection, for example a VPN over the internet?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Failed to run tensorflow distributed MNIST test - tensorflow

Related

How do you manage 100 virtual servers for inference?

Google Compute Engine SSH from browser stopped working Error 13

Meld error when setting up a new cluster

How to fix the Zookeeper error for Hbase

Cannot create DB2 index, getting SQL30081N error

Categories

Resources