Airflow BashOperator freezes after some time - ssh

BashOperator executes a ssh command that runs process on another server and waits for its execution. Problem is a process on the other server freezes after some time (usually a few hours) whilst Airflow task is still in running mode. What might be the cause of a problem?
Problem exists on both Airflow 1 and Airflow 2 which are running on the same server.
I read that ssh sessions freeze after some time so I added theses configurations https://unix.stackexchange.com/questions/200239/how-can-i-keep-my-ssh-sessions-from-freezing
but unfortunately it didn't help :(
Also, I tried using SSHOperator instead, but I faced that problem again.

Related

Ansible playbook stops after loosing connection (even for few seconds) with ssh window of VM on which it is running?

My ansible playbook consist several task in it and I am running my ansible playbook on Virtual Machine. I am using ssh method to log in to VM and run the playbook. if my ssh window gets closed during the execution of any task (when internet connection is not stable and not reliable), the execution of ansible playbook stops as the ssh window already got closed.
It takes around 1 hour for My play book to run, and sometimes even if I loose internet connectivity for few seconds , the ssh terminal lost its connection and thus entire playbook stops. any idea how to make ansible script more redundant to avoid this problem ?
Thanks in advance !!
If you need to run a job on an external system that hangs for a long time and it is relevant that the task completes. It is extremly bad idea to run that job in the foreground.
It is not important that the task is Ansible or the connection is SSH. In every case you would always just "push" the command to the remote host and send it to background with something like "nohup" if available. The problem is of course the tree of processes. Your connection creates a process on the remote system and that creates the job you want to run. Is the connection gets lost, al subprocesses will be killed automatically by the OS.
So - under Windows - maybe use RDP to open a screen that stays available even after connection is lost or use something like Cygwin and nohup via SSH to change the hung up your process from the ssh session.
Or - when you need to run a playbook on that system install for example a AWX container and use that. There are many options based on your requirements, resources and administrative options.

Application stops receiving data from remote machine

I am developing an app in VB.NET that connects to a remote Linux server to execute a task; my application sends input data for the server program, starts the execution in the server and retrieves output information from the server when the command execution is done. I use Renci SSH.NET to exchange files (SFTP) and monitor the remote server (SSH). One of the critical tasks is to detect when the remote machine finishes command execution, and depending on the input this can take between 5 minutes and many hours.
Here is my problem: for short runs (5 minutes or less), the application runs smoothly, no problems whatsoever. But, when the execution of the remote command takes more than 10 minutes, my application does not detect any response from the server and just hangs, since it never receives the indication from the server that the remote task is finished. I use the Renci.SshNet.Common.Stream.DataAvailable property to detect when the command is done by testing if the shell prompt string is present when I invoke Stream.Read().
Now, I know that the problem is the app, since when I execute the remote code "manually" (logging into the server with a terminal app and executing the command) it finishes without problems. I suspect it has something to do with the way Renci ShellStream object polls the remote machine for data. Any ideas as to what might be happening? Thank you in advance for your answers.

GCP VM consistently shutting down without warning

Been using a GCP preemptible VM for a few months without problems, but in the last 4 weeks my instances have consistently shut off anywhere from 10 minutes to 20 minutes into operation.
I'll be in the middle of training, and my notebook will suddenly disconnect. The terminal will show this error:
jupyter#fastai-instance:~$ Connection to 104.154.142.171 closed by remote host.
Connection to 104.154.142.171 closed.
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255].
I then check the status of my VM, to see that it has shutdown.
I searched the terminal traceback and found this thread, which seemed promising: ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code [255]
When I ran sudo gcloud compute config-ssh my VM ran for much longer than usual before shutting down, yet shutdown in the same way after about an hour. Since then, back to the same behavior.
I know preemptible instances can be shutdown when the platform needs resources, but my understanding is that comes with some kind of warning. I've checked the status of GCP's servers after shutdowns and they appear to be fine. This is also happening the same way every time I turn my VM on, which seems too frequent for preempting.
I am not sure where to look for any clues – has anyone else had a problem like this? What's especially puzzling to me is, if it is in fact an SSH problem, why would that cause the VM itself to shutdown, rather than just break the connection?
Thanks very much for any help!
Did you try to set a shutdown script and to print something in a file for validating the state of the VM when it goes down ?
Try this as shutdown script
#!/bin/bash
curl "http://metadata.google.internal/computeMetadata/v1/instance/preempted" -H "Metadata-Flavor: Google" > /tmp/preempted.log
If there is TRUE in the file, it's because the VM has been preempted.
If a VM stops and you have an active SSH connection to that VM (via gcloud compute ssh), then it's normal that you are receiving an error. Since the VM goes down, all connections are closed, so does your SSH connection (you cannot connect to a stopped instance). The VM termination causes the SSH error, not the opposite.
When using preemptible instances, Google can reclaim the instance whenever it's needed. Note that (from the docs about preemptible instances limitations) :
Compute Engine might terminate preemptible instances at any time due to system events. The probability that Compute Engine will terminate a preemptible instance for a system event is generally low, but might vary from day to day and from zone to zone depending on current conditions.
It means that one day, your instance may be running for 24 hours without being terminated, but an other day, your instance may be stopped 30 minutes after being started if Compute Engine needs to reclaim some resources.
A comment on the "continuously shutting down" part:
(I have experienced this as well)
Keep in mind that Google prefers to shut down RECENTLY STARTED preemptible instances, over ones started earlier.
The link below (and supplied earlier) has the statement:
Generally, Compute Engine avoids preempting too many instances from a single customer and preempts new instances over older instances whenever possible.
This would generally mean that, yes, I suppose, if you are preempted, and boot up again, it is quite likely that you are going to be preempted again and again until the load in the zone reduces.
I'm surprised that Google don't simply preclude you starting the preemptible VM for a while (like 30-60 minutes?). - How much CPU is being wasted bouncing VMs up and down and crossing our fingers???
P.S. There is a dirty trick to end-around your frustration - Have 2 VMs identically configured, except for preemptibility, but only 1 underlying book disk. If you are having a bad day with preempts, simply 'move' the boot disk to the non-preemptible VM, boot it, and carry on. - It's a couple of simple gcloud commands to achieve this, easily scripted and very fast. Don't tell Google I told ya....
https://cloud.google.com/compute/docs/instances/preemptible#limitations

How to use rotate_logs on a log file that is 80+gb's for RabbitMQ on windows server

I need to run rabbitmqctl rotate_logs on a rabbitmq log file that is over 80gb's in size. When I tried to run this the first time it froze rabbit and no messages could be received. The freeze lasted 20 mins before I had to kill the command and restart the rabbit server.
This is a production server and completing this in a small amount of time without losing messages or killing the broker would be optimal.
Would it be possible to shut down the service and move the current log file to another location and restart the service and then run the rotate_logs command?
I'm fairly new to rabbitmq and I am not sure what the best way to handle this would be.
This is installed on a windows 2008 server as a service for a heavy traffic production site (However the message queue has a small load and only affects the administrative side of things).
Any help or insight would be appreciated.
I ran into a similar situation, but with only about 4GB of log file instead of 80.
the workaround I used was pretty much what you suggested... stop the service, move the log file and restart the service as quickly as possible.
for me, specifically, instead of moving the file while the service was stopped i just renamed it. i also wrote a commandline script to do the work for me.
this allowed me to stop the service, rename the file and restart the service in a matter of seconds.
once the service was back up and running, i was free to move / rename / whatever the large log file as needed.

Keeping access to a Fedora Amazon EC2 instance

I run a Fedora instance in Amazon EC2. I can access and work on it perfectly by Putty.
I also set Seconds between keepalives to 1 for not losing the connection due to inactivity (I mean in Putty).
Nevertheless, if a network/electric failure happens on my local computer, it shuts down the Putty connection, so the session logs off and the executions in the instances stop.
Can anybody help me in keeping a session alive and being able to connect/disconnect to it whenever I want?
Use the screen command.