GCP VM consistently shutting down without warning - ssh

Been using a GCP preemptible VM for a few months without problems, but in the last 4 weeks my instances have consistently shut off anywhere from 10 minutes to 20 minutes into operation.
I'll be in the middle of training, and my notebook will suddenly disconnect. The terminal will show this error:
jupyter#fastai-instance:~$ Connection to 104.154.142.171 closed by remote host.
Connection to 104.154.142.171 closed.
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].
I then check the status of my VM, to see that it has shutdown.
I searched the terminal traceback and found this thread, which seemed promising: ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255]
When I ran sudo gcloud compute config-ssh my VM ran for much longer than usual before shutting down, yet shutdown in the same way after about an hour. Since then, back to the same behavior.
I know preemptible instances can be shutdown when the platform needs resources, but my understanding is that comes with some kind of warning. I've checked the status of GCP's servers after shutdowns and they appear to be fine. This is also happening the same way every time I turn my VM on, which seems too frequent for preempting.
I am not sure where to look for any clues – has anyone else had a problem like this? What's especially puzzling to me is, if it is in fact an SSH problem, why would that cause the VM itself to shutdown, rather than just break the connection?
Thanks very much for any help!

Did you try to set a shutdown script and to print something in a file for validating the state of the VM when it goes down ?
Try this as shutdown script
#!/bin/bash
curl "http://metadata.google.internal/computeMetadata/v1/instance/preempted" -H "Metadata-Flavor: Google" > /tmp/preempted.log
If there is TRUE in the file, it's because the VM has been preempted.

If a VM stops and you have an active SSH connection to that VM (via gcloud compute ssh), then it's normal that you are receiving an error. Since the VM goes down, all connections are closed, so does your SSH connection (you cannot connect to a stopped instance). The VM termination causes the SSH error, not the opposite.
When using preemptible instances, Google can reclaim the instance whenever it's needed. Note that (from the docs about preemptible instances limitations) :
Compute Engine might terminate preemptible instances at any time due to system events. The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but might vary from day to day and from zone to zone depending on current conditions.
It means that one day, your instance may be running for 24 hours without being terminated, but an other day, your instance may be stopped 30 minutes after being started if Compute Engine needs to reclaim some resources.

A comment on the "continuously shutting down" part:
(I have experienced this as well)
Keep in mind that Google prefers to shut down RECENTLY STARTED preemptible instances, over ones started earlier.
The link below (and supplied earlier) has the statement:
Generally, Compute Engine avoids preempting too many instances from a single customer and preempts new instances over older instances whenever possible.
This would generally mean that, yes, I suppose, if you are preempted, and boot up again, it is quite likely that you are going to be preempted again and again until the load in the zone reduces.
I'm surprised that Google don't simply preclude you starting the preemptible VM for a while (like 30-60 minutes?). - How much CPU is being wasted bouncing VMs up and down and crossing our fingers???
P.S. There is a dirty trick to end-around your frustration - Have 2 VMs identically configured, except for preemptibility, but only 1 underlying book disk. If you are having a bad day with preempts, simply 'move' the boot disk to the non-preemptible VM, boot it, and carry on. - It's a couple of simple gcloud commands to achieve this, easily scripted and very fast. Don't tell Google I told ya....
https://cloud.google.com/compute/docs/instances/preemptible#limitations

Related

Raspberry (PiHole): 500 Internal Server Error

It happened already twice today, I try to reach the dashboard and I get "500 Internal Server Error"; I can ping the raspberry but SSH does not work (connection closed by peer)
A reboot will fix the problem
Any ideas?
There is a ton of situations here, you probably need some level of monitoring on your system. If it happens "a lot" then I'd look at a partition like your /var partition, filling up and the OS not being able to write to it.
With the reboot, typically your /tmp/ and /var partitions get cleaned out allowing you to administer the machine.
So, tl;dr? The best thing would be to set up some type of monitoring on your raspberry pi and watch the graphs. If you have no idea where to start, datadog will help you get off the ground.

ClientAliveInterval is not closing the idle connection

I have the task to close the idle ssh connection if they are idle for more than 5 minutes. I have tried setting these value on sshd_config
TCPKeepAlive no
ClientAliveInterval 300
ClientAliveCountMax 0
But nothing seems to work the idle remains active and does not get lost even after 5 minutes of idle time.
Then I came across this https://bbs.archlinux.org/viewtopic.php?id=254707 they guys says
These are not for user-idle circumstances, they are - as that man page
excerpt notes - for unresponsive SSH clients. The client will be
unresponsive if the client program has frozen or the connection has
been broken. The client should not be unresponsive simply because the
human user has stepped away from the keyboard: the ssh client will
still receive packets sent from the server.
I can't even use TMOUT because there are ssh client scripts that do not run bash program.
How to achieve this?
Openssh version
OpenSSH_8.2p1 Ubuntu-4ubuntu0.4, OpenSSL 1.1.1f 31 Mar 2020
close the idle ssh connection if they are idle for more than 5 minutes
This task is surprisingly difficult. OpenSSH itself has no functionality to set a idle-timeout on shell commands, probably for a good reason: killing "idle" shells itself is non-trivial:
There's multiple ways to define "idleness", e.g., no stdin, no stdout, no I/O activity whatsoever, no CPU consumption etc
Even when a process is deemed "idle", it's difficult to kill the process and all its child processes that have possibly been created.
Given that, it's not surprising that there's only few solutions for killing idle shell sessions in general. Those that I could find with (little) research rely on background daemons that check the idle status of all processes running on a system (e.g., doinkd/idled, idleout).
One possible solution is to check if any of those solutions can be adapted to enforce an idle timeout on a specific shell session.
Another option is to adapt the OpenSSH source code to support your specific requirement. In principle, OpenSSH should be able to easily access console I/O activity and session duration, so assessing the "idle" property is probably relative easy. As for "killing" the shell and all involved children, running (and killing) the remote shell in a PID namespace is an effective option on Linux systems.
Both options a relatively complex -- so before pursuing them further, I'd further check if there's existing solutions to enforce an idle timeout on a shell session. Using them under OpenSSH will be straightforward.

Google Compute Engine SSH from browser stopped working Error 13

A compute instance I had running stopped working and I am no longer able to ssh to it from the browser. When I try it hangs forever and eventually I get the error message:
You cannot connect to the VM instance because of an unexpected error.
Wait a few moments and then try again. (#13)
I looked here for common issues. I made a snapshot and tried recreating with a larger disk, in a different region and with a bigger compute instance but I was still unable to connect. When other users try to connect they have the same problem. I'm using a standard container so I expect the google daemon should be running.
This instance was collecting tweets and writing output to GCS regularly. Since ssh stopped working the instance has also stopped writing output.
Does anyone have any idea what could have gone wrong?
I would also suggest checking the Serial Console of the machine to see if there are any messages which provide any clues. For example, if the boot disk has run out of space (which can prevent SSH connectivity), there will be some messages displayed in the Serial Console implying this.
You could also try connecting to the machine via the Serial Console to troubleshoot the issue by following the advice here.
When you try to SSH into the instance from the Cloud Shell for example, using the following command, the output should provide some clues as to why you cannot SSH into the machine:
$ gcloud compute ssh INSTANCE_NAME --zone ZONE
If you are on a VPC network, try to check the applicable network TAG that allows the instance to use SSH and provide that tag to your instance. Because it could be the Firewall rules that are blocking your instance from creating the ssh connection.

Celery workers missing heartbeats and getting substantial drift over Ec2

I am testing my celery implementation over 3 ec2 machines right now. I am pretty confident in my implementation now, but I am getting problems with the actual worker execution. My test structure is as follows:
1 ec2 machine is designated as the broker, also runs a celery worker
1 ec2 machine is designated as the client (runs the client celery script that enqueues all the tasks using .delay(), also runs a celery worker
1 ec2 machine is purely a worker.
All the machines have 1 celery worker running. Before, I was immediately getting the message:
"Substantial drift from celery#[other ec2 ip] may mean clocks are out of sync."
A drift amount in seconds would then be printed, which would increase over time.
I would also get messages : "missed heartbeat from celery#[other ec2 ip].
The machine would be doing very little work at this point, so my AutoScaling config in ec2 would shut down the instance automatically once it got to cpu utilization levels very low (<5%)
So to try to solve this problem, i attempted to sync all my machine's clocks (although I thought celery handled this) with this command, which was performed upon start up for all machines:
apt-get -qy install ntp
service ntp start
With this, they all performed well for about 10 minutes with no hitches, after which I started getting missed heartbeats and my ec2 instances stalled and shut down. The weird thing is, the drift increased and then decreased sometimes.
Any idea on why this is happening?
I am using the newest version of celery (3.1) and rabbitmq
EDIT: It should be noted that I am utilizing us-west-1a and us-west-1c availability zones on ec2.
EDIT2: I am starting to think memory problems might be an issue. I am using a t2.micro instance, and running 3 celery workers on the same machine (only 1 instance) which is also the broker, still cause heartbeat misses and stalls.

how to handle memory leaks in amazon web services t1.micro?

I have a t1.micro instance in amazon web services to handle a virtual image (in concrete a formhub image) and sometimes I got an eror of not allocated memory, I solve it rebooting the instance. Any clues?
is possible to reboot the instances automatically every day?
The micro instances are quite constrained with only 600mb or so of RAM. You may solve the problem by moving up to a small or medium instance or even one of the new T2 instances - even the smallest one has 1Gb of RAM.
If this is not an option for you, you can add a cron job to restart the instance at a particular time of day.
ssh in to the instance and type the command:
sudo crontab -e
Enter a line like:
0 5 * * * /sbin/reboot
to restart the system at 5am each day. This is for an Ubuntu system - the reboot command may be elsewhere in other distributions. Run the command which reboot to check.