Google CloudML job fails with "CreateSession still waiting for response from worker"

Google CloudML job fails with "CreateSession still waiting for response from worker" - tensorflow

It's intermittent and seemingly non-deterministic: the exact same job will sometimes work perfectly, sometimes it will stall and print dozens of these errors, then work, and sometimes it stalls for a long time, then dies.
Other StackOverflow users who have run into this say it's a bad Cluster config (typically wrong port #s), but we're not setting any cluster params, instead relying on tf.contrib.learn.Experiment to do all the distributed config. Also if it were just a bad config, then it would either always work, or never work.
Full error looks like:
10:53:28.899 2017-10-20 17:53:28.899466: I tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0

Related

Handling cache warm-up with twisted and systemd

I have a simple twisted application which I run using a systemd service, executing a script, which subsequently executes a .tac file.
The application is structured as a JSON RPC endpoint (fastjsonrpc), built into a t.w.r.Resource, which is in a t.w.s.Site, and served t.a.i.TCPServer, and the whole thing packed into a t.a.Application. This works fine.
Where I do run into trouble is when I try to warm up caches at startup. This warm-up process is pretty slow (~300 seconds), and makes systemd timeout and kill the process. Increasing the timeout is not really a viable option, since I wouldn't want this to block system boot.
Analogous code is used in a separate stack running on Flask from within Apache and wsgi. That server starts itself off and lets systemd go on while it takes its time building the caches. This behaviour is fine for me.
I've tried calling the warmup function using the following within the setup function of the t.w.r.Resource:
reactor.callLater(1, ep.warmup, None)
I've not yet tried using this from within systemd, and have been testing it from twistd directly on the command line. The server does work as expected, however it no longer responds to SIGINT (^C). Removing the callLater is all that's needed to let the server respond to SIGINT.
If the warmup function is called directly (not by callLater, i.e., the arrangement which makes systemd give up while waiting for warm up to complete), the resulting server also continues to respond to SIGINT.
Is there a better / good way to handle this sort of long-running warmup code?
Why would twistd / the reactor not respond to SIGINT? Am I missing something here?

Twisted is a single-threaded thing. It sounds like your "cache warmup" code is blocking the reactor for those 300 seconds. One easy way to fix this would be using deferToThread to let it run without blocking the reactor.

What happens AFTER Apache says "Script timed out before returning headers" to the running script?

I have a Perl web app served by Apache httpd using plain mod_cgi or optionally mod_perl with PerlHandler ModPerl::Registry. Recently the app encountered the error Script timed out before returning headers on some invocations and behaved differently afterwards: While some requests seemed to be processed successfully in the background, after httpd sent status 504 to the client, others didn't.
So how exactly behaves httpd AFTER it reached its configured timeout and sent the error to the client? The request/response cycle is finished now, so I guess things like KeepAlive come into play to decide if the TCP connections stays alive or not etc. But what happens to the running script in which environment, e.g. mod_cgi vs. mod_perl?
Especially in mod_cgi, where new processes are started for each request, I would have guessed that httpd keeps the processes simply running. Because all our Perl files have a shebang, I'm not even sure if httpd is able to track the processes and does so or not. That could be completely different with mod_perl, because in that case httpd is aware of the interpreters and what they are doing etc. In fact, the same operation which timed out using plain mod_cgi, succeeded using mod_perl without any timeout, but even with a timeout in mod_cgi at least one request succeeded afterwards as well.
I find this question interesting for other runtimes than Perl as well, because they share the concepts of plain mod_cgi vs. some persistent runtime embedded into the httpd processes or using some external daemons.
So, my question is NOT about how to get the error message away. Instead I want to understand how httpd behaves AFTER the error occurred, because I don't seem to find much information on that topic. It's all just about increasing configuration values and try to avoid the problem in the first place, which is fine, but not what I need to know currently.
Thanks!

Both mod_cgi and mod_cgid set a cleanup function on the request scope to kill the child process, but they do it slightly different ways. This would happen shortly after the timeout is reported (a little time for mod_cgi to return control, the error response to be written, the request logged, etc)
mod_cgi uses a core facility in httpd that does SIGTERM, sleeps for 3 seconds, then does a SIGKILL.

mod_perl2 with apache 2.22 Apache2::RequestIO::print: (103) Software caused connection abort

I’m trying to get a mod_perl2 application ported to AWS. As part of the port I thought I’d move from Debian Squeeze to Wheezy with the latest stable mod_perl & Apache2 combination.
The application works right up to the point I try and write JSON responses to the client. At this point, each request is canceled on the client and on the server I get the error
Apache2::RequestIO::print: (103) Software caused connection abort
whenever I write to the client, i.e.:
$self->req->print($output);
I’ve tried tcpdumping the response to the client, and I can see it being written out, but no response is received on the client end and it just barfs chips. I can’t find any information on how to get around this.

I found quite a few people asking about this question on the net without many answers. The solution to my problem was very specific but I thought I’d post what I did anyway, it may help someone.
The client was canceling the request before the response was fully written, which was crapping out Apache::RequestIO (for reasons I still don’t know).
I couldn’t work out why I was seeing this behavior.
By using tcpdump I could see that data was being written out to the client – and it looked fine.
By inspecting the page in Chrome and looking at the network stack, I could see that my request for data was being canceled after no response was received (which was odd because the code worked fine on other servers and I could see the response was being written). Debugging was may harder because with Apache crashing out with an error in print IO I couldn’t check if the bytes written equaled the bytes of data. I wasn’t sure if something was getting stuck on the server side.
So, I changed the Content-Type of the response from application/json to text/html, so that I could query the page and just look at the actual response as text. Once I did that, I could see that the response was fine.
I started to look for other causes, and I found that in the migration to the new server, I’d missed altering some URLs in the DB to point to the new server, which meant my application was trying to get some data from the old DB.
This in turn was causing a load of timing issues, which was causing my problems. Once I fixed the config, the problems went away.

ORA-29273: HTTP request failed intermittent error using the utl_http package

I'm using the utl_http package to make HTTP GET requests to an IIS site on the same server (local) as Oracle. Sometimes it works and I get the response, but more often than not it hangs for about 15 seconds and then I get this error:
ORA-29273: HTTP request failed ORA-06512: at "SYS.UTL_HTTP", line 1722 ORA-29263: HTTP protocol error
As a test, I've got a small static text file in the IIS site, so this is how I'm testing it:
select utl_http.request('http://domain.com/test.txt') from dual
I get the same problem if I run it in Oracle Apex instead of direct on the db.
The other thing I've tried is to create a package of my own that does the HTTP request using the long utl_htp.begin_request() method, instead of the utl_http.request() shortcut. This gives the exact same problem (works sometimes but errors mostly - same error).
The pattern I'm seeing is if I wait a while and then try, it works for the first 2-10 times, and then begins erroring. When it does work, I get the response instantly and when it errors, there is always the delay before the error.
If I request the text file URL (or any other resource in the site) using a remote web browser then I get the correct response every time.
I have tried setting a timeout like below but it doesn't have any effect. For example instead of timing out after 3 seconds it continues for 10 or 15 seconds before the error is shown.
UTL_HTTP.set_transfer_timeout(3);
I think I can rule out ACL because it works sometimes.
Does anyone know what might cause this behaviour?

Possible reasons
-> You may have a problem with your TNS-Listener.
From the command prompt window, try to run TNSPING service_name .. try to run it quickly several times and check if it fails in some of them.
I had once a similar problem. Try to re-configure your TNS-Listener.
There must be also an option in which you can give an IP number in the TNS-listener definition. This also solves sometimes these kind of problems.
-> IIS problem.

Read about SET_PERSISTENT_CONN_SUPPORT Procedure:
https://docs.oracle.com/cd/B28359_01/appdev.111/b28419/u_http.htm#i1027673
Using: utl_http.set_persistent_conn_support(true, 30);

Could you be exceeding the limit of of concurrent HTTP connections? I vaguely remembering that I run into a similar problem when I forgot to close the HTTP connection.

jmeter hangs up and won't return

I am running 340 concurrent users to load test on server using jmeter.
But on most of the cases jmeter hangs up and won' t return, even if I try to close the connection it just hangs up. and eventually I have to close the application.
Any idea how to check what is holding the requests and how to check the requests sent by jmeter and find the bottleneck.
Got the following message on closing the thread
Shutting down thread please be patient message

I've hit this several times over the past few years. In each of my cases (may not be in your's) the issue was with the Load Balance (F5) I was sending my traffic through. Basically a property called OneConnect was holding the connections in a time-wait state and never killing the connection.
Run a pack tool like wireshark and see what's happening with the requests.

Try distributed testing, 340 concurrent users is not a big deal, but still you can try if that decreases your pain. Also take a look at the following link:
http://jmeter.apache.org/usermanual/best-practices.html#lean_mean

First check you script is ok with one user.
Ensure you use assertions.
Then run you test following jmeter best practices:
no gui
no costly listeners
You should then be able to see in csv output the longest request and be able to fix your issue.

I also encountered this problem before when I run my JMeter on my laptop(Core 2 Duo 1.5Ghz) it always hang-up in the middle of the processing. I tried to run on another pc which is more powerful than my laptop and its works now smoothly. Therefore, JMeter will run effectively if your pc or laptop has a better specs.
Note: It is also advisable to run your JMeter in non-gui mode.
Example to run JMeter in Linux box:
$ ./jmeter -t test.jmx -n -l /Users/home/test.jtl

I had the
one or more test threads won't exit
because of a firewall blocking some requests. So I had to leap in the firewalls timeout for all blocked request... then it returned.

You are getting this error probably because JVM is not capable of running so many threads. If you take a look at your terminal, you will see the exception you get:
Uncaught Exception java.lang.OutOfMemoryError: unable to create new native thread. See log file for details.
You can solve this by doing Remote Testing and have multiple clusters running, instead of one.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas