Distributed Tensorflow: The worker does not responded

Distributed Tensorflow: The worker does not responded - tensorflow

I am running Tensorflow Distributed mode on AWS instances. PS is on one machine and each worker is on a different machine. I am running in the following problem:
tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
I found somebody already posts exactly the same problem that I face right now but his answer is not clear for me, Tensorflow distributed: CreateSession still waiting for response from worker: /job:ps/replica:0/task:0.
Can anybody suggest what might be the solution?

Now, after I solved my error, I will share my solution. It was not a bug in the TF code but one of two things that I tried solved my problem stated above. Because I am working on EC2 instances, the firewall prevents the connection between nodes. So, I made the rule to accept all the traffics to the instances. Second, I was using only IP-address:port-No in the command line. Instead, I wrote it like this, ec2-IP.compute-1.amazonaws.com:2222.

Related

Unable to add apache Nifi in ambari?

I am trying to add Apache Nifi in ambari but continuously failing with error Error occured during stack advisor command invocation:
Unable to delete directory /var/run/ambari-server/stack-recommendations/1.
There is a similar thread with the same error in hortonworks community, I have tried everything mentioned in that thread but unable to fix it. My sandbox is installed in vmware workstation 12 player. I also tried to create and remove directory manually but it is failing with the error invalid argument. Created a thread for this error also on stackexchange. Please help!!!

Here is a link to Hortonworks forum thread. And it seems like sandbox is just broken:
This is due to a docker issue in this 2.5 sandbox build. It will be
fixed in next revision of the sandbox.
There are also some workarounds described (like use older HDP 2.4 or establishing own cluser based on the HDP 2.5 docker image)
Updated sandbox arrived: http://hortonworks.com/downloads

Trust me, active member of community see your posts in multiple locations. In a good, no Big Brother ways :) but cross-posting is an old as world ... Well, you got it.
Did you see a notice for this service in Ambari? Telling it's been deprecated? Same note in the github. There's a good reason for that, it's now been implemented properly by the dev team and with many more features. I.e. all the action is there now.
I think I replied a similar question, though not sure it was yours, take a look in HCC.

How to submit code to a remote Spark cluster from IntelliJ IDEA

I have two clusters, one in local virtual machine another in remote cloud. Both clusters in Standalone mode.
My Environment:
Scala: 2.10.4
Spark: 1.5.1
JDK: 1.8.40
OS: CentOS Linux release 7.1.1503 (Core)
The local cluster:
Spark Master: spark://local1:7077
The remote cluster:
Spark Master: spark://remote1:7077
I want to finish this:
Write codes(just simple word-count) in IntelliJ IDEA locally(on my laptp), and set the Spark Master URL to spark://local1:7077 and spark://remote1:7077, then run my codes in IntelliJ IDEA. That is, I don't want to use spark-submit to submit a job.
But I got some problem:
When I use the local cluster, everything goes well. Run codes in IntelliJ IDEA or use spark-submit can submit job to cluster and can finish the job.
But When I use the remote cluster, I got a warning log:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
It is sufficient resources not sufficient memory!
And this log keep printing, no further actions. Both spark-submit and run codes in IntelliJ IDEA result the same.
I want to know:
Is it possible to submit codes from IntelliJ IDEA to remote cluster?
If it's OK, does it need configuration?
What are the possible reasons that can cause my problem?
How can I handle this problem?
Thanks a lot!
Update
There is a similar question here, but I think my scene is different. When I run my codes in IntelliJ IDEA, and set Spark Master to local virtual machine cluster, it works. But I got Initial job has not accepted any resources;... warning instead.
I want to know whether the security policy or fireworks can cause this?

Submitting code programatically (e.g. via SparkSubmit) is quite tricky. At the least there is a variety of environment settings and considerations -handled by the spark-submit script - that are quite difficult to replicate within a scala program. I am still uncertain of how to achieve it: and there have been a number of long running threads within the spark developer community on the topic.
My answer here is about a portion of your post: specifically the
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources
The reason is typically there were a mismatch on the requested memory and/or number of cores from your job versus what were available on the cluster. Possibly when submitting from IJ the
$SPARK_HOME/conf/spark-defaults.conf
were not properly matching the parameters required for your task on the existing cluster. You may need to update:
spark.driver.memory 4g
spark.executor.memory 8g
spark.executor.cores 8
You can check the spark ui on port 8080 to verify that the parameters you requested are actually available on the cluster.

Redis server not starting - Forked Process did not respond in a timely manner

Redis server which was working fine got stopped suddenly and the error is:
BeginForkOperation: system error caught. error code=0x00000000, message=Forked
Process did not respond in a timely manner.
Not able to figure out why it is happening, and also when I am restarting my machine then
if I start the redis-server it's working fine.
Please help me in this regard.

You should try updating your Redis version, the guys from MSOpenTech fixed a lot of bugs in the last months and this one looks related, at least the error message is identical: https://github.com/MSOpenTech/redis/issues/144

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?

We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.

We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.

jmeter hangs up and won't return

I am running 340 concurrent users to load test on server using jmeter.
But on most of the cases jmeter hangs up and won' t return, even if I try to close the connection it just hangs up. and eventually I have to close the application.
Any idea how to check what is holding the requests and how to check the requests sent by jmeter and find the bottleneck.
Got the following message on closing the thread
Shutting down thread please be patient message

I've hit this several times over the past few years. In each of my cases (may not be in your's) the issue was with the Load Balance (F5) I was sending my traffic through. Basically a property called OneConnect was holding the connections in a time-wait state and never killing the connection.
Run a pack tool like wireshark and see what's happening with the requests.

Try distributed testing, 340 concurrent users is not a big deal, but still you can try if that decreases your pain. Also take a look at the following link:
http://jmeter.apache.org/usermanual/best-practices.html#lean_mean

First check you script is ok with one user.
Ensure you use assertions.
Then run you test following jmeter best practices:
no gui
no costly listeners
You should then be able to see in csv output the longest request and be able to fix your issue.

I also encountered this problem before when I run my JMeter on my laptop(Core 2 Duo 1.5Ghz) it always hang-up in the middle of the processing. I tried to run on another pc which is more powerful than my laptop and its works now smoothly. Therefore, JMeter will run effectively if your pc or laptop has a better specs.
Note: It is also advisable to run your JMeter in non-gui mode.
Example to run JMeter in Linux box:
$ ./jmeter -t test.jmx -n -l /Users/home/test.jtl

I had the
one or more test threads won't exit
because of a firewall blocking some requests. So I had to leap in the firewalls timeout for all blocked request... then it returned.

You are getting this error probably because JVM is not capable of running so many threads. If you take a look at your terminal, you will see the exception you get:
Uncaught Exception java.lang.OutOfMemoryError: unable to create new native thread. See log file for details.
You can solve this by doing Remote Testing and have multiple clusters running, instead of one.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas