Using TPU on GKE: Error recorded from infeed: Socket closed - tensorflow

Once in a while our GKE TPUEstimator based training job using TPUs fail with:
Error recorded from infeed: Socket closed
An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed
I have two questions about that:
What is happening here? I checked the pods memory usage and that did not spike. The TPU allocated to the pod is still there as well.
The job doesn't always raise an error to the pod. It continues to show as running unless someone manually checks the state and then takes action to restart it. Any way to make it always restart automatically?

Related

Timeout exception (wcf) while running python script

I am running a python script that controls an ADI software named ACE capturing data remotely through port 8080. The ACE will attempt to create an IPC server on the port number. It takes around 12 minutes to capture 1 data set. The script runs normally for capture under 10 minutes. I suspects that the exception occurs because the default ReceiveTimeout is 10 minutes.
I adjusts the ReceiveTimeout in the SMSvcHost.exe.config file but it still not fix the issue. Other threads suggest that I need to adjust the ReceiveTimeout in app.config/ web.config. I am not sure how to locate these file. I am worried I might make changes to the wrong config file. Please advise!
Tracing Information:
[1]: https://i.stack.imgur.com/AKeHA.png
Here is the error message
This request operation sent to net.tcp://localhost:8080/ did not receive a reply within the configured timeout (00:10:00). The time allotted to this operation may have been a portion of a longer timeout. This may be because the service is still processing the operation or because the service was unable to send a reply message. Please consider increasing the operation timeout (by casting the channel/proxy to IContextChannel and setting the OperationTimeout property) and ensure that the service is able to connect to the client.

Failed Deployment in App Engine Google Cloud

I am deploying my nodejs application in google cloud app engine but it is giving error
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time.
This request may thus take longer and use more CPU than a typical request for your application. -- when making request.
I had also saw some stackoverflow answers, but they didn't worked for me.
my app.yaml have this config
runtime: nodejs10
Can anyone help me out
You could add the following to your app.yaml:
inbound_services:
- warmup
And then implement a handler that will catch all warmup requests, so that your application doesn't get the full load. The full explanation is given here. Another detailed post about this topic can be found here.
Additionally you can also add automatic scaling options. You can play a bit with those to find the optimum for your application. Especially the latency related variables are important. Good to note that they can be set in a standard GAE environment.
automatic_scaling:
min_idle_instances: automatic
max_idle_instances: automatic
min_pending_latency: automatic
max_pending_latency: automatic
More scaling options can be found here.
The "request caused a new process to be started" notification usually occurred when there is no warm up request present in your application.
Can you try to implement a health check handler that only returns a ready status when the application is warmed up. This will allow your service to not receive traffic until it is ready.
Warning: Legacy health checks using the /_ah/health path are now
deprecated, and you should migrate to use split health checks.
Here you can find Split health checks for Nodejs
Liveness checks
Liveness checks confirm that the VM and the Docker container are
running. Instances that are deemed unhealthy are restarted.
path: "/liveness_check"
check_interval_sec: 30
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
Readiness checks
Readiness checks confirm that an instance can accept incoming
requests. Instances that don't pass the readiness check are not added
to the pool of available instances.
path: "/readiness_check"
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
Edit
For App Engine Standard, which doesn't afford you that flexibility, hardware and software failures that cause early termination or frequent restarts can occur without prior warning. link
App Engine attempts to keep manual and basic scaling instances running
indefinitely. However, at this time there is no guaranteed uptime for
manual and basic scaling instances. Hardware and software failures
that cause early termination or frequent restarts can occur without
prior warning and can take considerable time to resolve; thus, you
should construct your application in a way that tolerates these
failures.
Here are some good strategies for avoiding downtime due to instance
restarts:
Reduce the amount of time it takes for your instances restart or for
new ones to start.
For long-running computations, periodically create
checkpoints so that you can resume from that state.
Your app should be "stateless" so that nothing is stored on the instance.
Use queues for performing asynchronous task execution.
If you configure your instances to manual scaling: Use load balancing across > multiple instances. Configure more instances than required to handle normal
traffic. Write fall-back logic that uses cached results when a manual
scaling instance is unavailable.
Instance Uptime

GCP VM consistently shutting down without warning

Been using a GCP preemptible VM for a few months without problems, but in the last 4 weeks my instances have consistently shut off anywhere from 10 minutes to 20 minutes into operation.
I'll be in the middle of training, and my notebook will suddenly disconnect. The terminal will show this error:
jupyter#fastai-instance:~$ Connection to 104.154.142.171 closed by remote host.
Connection to 104.154.142.171 closed.
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].
I then check the status of my VM, to see that it has shutdown.
I searched the terminal traceback and found this thread, which seemed promising: ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255]
When I ran sudo gcloud compute config-ssh my VM ran for much longer than usual before shutting down, yet shutdown in the same way after about an hour. Since then, back to the same behavior.
I know preemptible instances can be shutdown when the platform needs resources, but my understanding is that comes with some kind of warning. I've checked the status of GCP's servers after shutdowns and they appear to be fine. This is also happening the same way every time I turn my VM on, which seems too frequent for preempting.
I am not sure where to look for any clues – has anyone else had a problem like this? What's especially puzzling to me is, if it is in fact an SSH problem, why would that cause the VM itself to shutdown, rather than just break the connection?
Thanks very much for any help!
Did you try to set a shutdown script and to print something in a file for validating the state of the VM when it goes down ?
Try this as shutdown script
#!/bin/bash
curl "http://metadata.google.internal/computeMetadata/v1/instance/preempted" -H "Metadata-Flavor: Google" > /tmp/preempted.log
If there is TRUE in the file, it's because the VM has been preempted.
If a VM stops and you have an active SSH connection to that VM (via gcloud compute ssh), then it's normal that you are receiving an error. Since the VM goes down, all connections are closed, so does your SSH connection (you cannot connect to a stopped instance). The VM termination causes the SSH error, not the opposite.
When using preemptible instances, Google can reclaim the instance whenever it's needed. Note that (from the docs about preemptible instances limitations) :
Compute Engine might terminate preemptible instances at any time due to system events. The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but might vary from day to day and from zone to zone depending on current conditions.
It means that one day, your instance may be running for 24 hours without being terminated, but an other day, your instance may be stopped 30 minutes after being started if Compute Engine needs to reclaim some resources.
A comment on the "continuously shutting down" part:
(I have experienced this as well)
Keep in mind that Google prefers to shut down RECENTLY STARTED preemptible instances, over ones started earlier.
The link below (and supplied earlier) has the statement:
Generally, Compute Engine avoids preempting too many instances from a single customer and preempts new instances over older instances whenever possible.
This would generally mean that, yes, I suppose, if you are preempted, and boot up again, it is quite likely that you are going to be preempted again and again until the load in the zone reduces.
I'm surprised that Google don't simply preclude you starting the preemptible VM for a while (like 30-60 minutes?). - How much CPU is being wasted bouncing VMs up and down and crossing our fingers???
P.S. There is a dirty trick to end-around your frustration - Have 2 VMs identically configured, except for preemptibility, but only 1 underlying book disk. If you are having a bad day with preempts, simply 'move' the boot disk to the non-preemptible VM, boot it, and carry on. - It's a couple of simple gcloud commands to achieve this, easily scripted and very fast. Don't tell Google I told ya....
https://cloud.google.com/compute/docs/instances/preemptible#limitations

can stale connection exception cause slowness?

While checking SystemOut.log during a reported slowness in the application I found StaleConnectionException occuring frequently. This exception was not observed earlier and I doubt that if this is the reason for slowness and needs to be resolved.
StaleConnectionException usually happens, when WebSphere was disconnected from the database. It can be caused by database restart or by network issue e.g. firewall which disconnects requests after some time. If it happens frequently, make sure that Purge policy for that datasource is set to Entire Pool, not Failing Connections. If you have firewall between WAS and DB set Aged timeout to lower value than timeout on firewall (try with 1200 for example).
Can this be a reason for slowness?
It can a bit, as when application gets StaleConnectionException, that request is failing and either application has implemented logic to retry it or end user will get error and will retry the same request.

Load Testing - Socket Connection Aborted While Running The Load Test

I am running a load test in order to see the performance of the WCF services during the peak time (heavy load). I am using the Step-Load where we push the virtual users Step-by-Step. When I start running the load test for the first few minutes the test runs smoothly and as the load increases by time, after some time all of a sudden the below error is triggering,
"Test method "XYZ" threw exception:
System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host".
I tried lot of solution that I found online but non of them worked for me. I tried changing the default time-outs, maxconnections, maxconcurrent connections etc., in config files. I would really appreciate any help on this.
I had similar problems while ago when moved to WCF. It have probably something to do with the way WCF handles connections.
Solution for me was to move each test into separate ApplicationDomain.