Run Ansible Tasks against failed hosts - error-handling

I'm running a ansible playbook with several tasks and hosts. In this playbook I'm trying to rerun tasks to failed hosts. I'll try to rebuild the situation:
Inventory:
[hostgroup_1]
host1 ansible_host=1.1.1.1
host2 ansible_host=1.1.1.2
[hostgroup_2]
host3 ansible_host=1.1.1.3
host4 ansible_host=1.1.1.4
The hosts from "hostgroup_1" are supposed to fail, so I can check the error-handling on the two hosts.
Playbook:
---
- name: firstplaybook
hosts: all
gather_facts: false
connection: network_cli
vars:
- ansible_network_os: ios
tasks:
- name: sh run
cisco.ios.ios_command:
commands: show run
- name: sh run
cisco.ios.ios_command:
commands: show run
As expected the fist two hosts (1.1.1.1 & 1.1.1.2) are failing and won't be considered for the second task. After looking to several Ansible documentations I found the meta clear_host_errors task. So I tried to run the playbook like this:
---
- name: firstplaybook
hosts: all
gather_facts: false
connection: network_cli
vars:
- ansible_network_os: ios
tasks:
- name: sh run
cisco.ios.ios_command:
commands: show run
- meta: clear_host_errors
- name: sh run
cisco.ios.ios_command:
commands: show run
Sadly the meta input did not reset the hosts and the Playbook went on without considering the failed hosts again.
Actually I would just like to know how Ansible considers failed hosts in a run again, so I can go on with these.
Thank y'all in advance
Regards, Lucas

Do you get any different results when using:
ignore_errors: true
or
ignore_unreachable: yes
with the first task?

Q: "How Ansible considers failed hosts in a run again?"
A: Use ignore_unreachable (New in version 2.7.). For example, in the play below the host test_99 is unreachable
- hosts: test_11,test_12,test_99
gather_facts: false
tasks:
- ping:
- debug:
var: inventory_hostname
As expected, the debug task omit the unreachable host
PLAY [test_11,test_12,test_99] ********************************************
TASK [ping] ***************************************************************
fatal: [test_99]: UNREACHABLE! => changed=false
msg: 'Failed to connect to the host via ssh: ssh: Could not resolve
hostname test_99: Name or service not known'
unreachable: true
ok: [test_11]
ok: [test_12]
TASK [debug] ***************************************************************
ok: [test_11] =>
inventory_hostname: test_11
ok: [test_12] =>
inventory_hostname: test_12
PLAY RECAP *****************************************************************
If you set ignore_unreachable: true the host will be skipped and included in the next task
- hosts: test_11,test_12,test_99
gather_facts: false
tasks:
- ping:
ignore_unreachable: true
- debug:
var: inventory_hostname
PLAY [test_11,test_12,test_99] ********************************************
TASK [ping] ***************************************************************
fatal: [test_99]: UNREACHABLE! => changed=false
msg: 'Failed to connect to the host via ssh: ssh: Could not resolve
hostname test_99: Name or service not known'
skip_reason: Host test_99 is unreachable
unreachable: true
ok: [test_11]
ok: [test_12]
TASK [debug] ***************************************************************
ok: [test_11] =>
inventory_hostname: test_11
ok: [test_12] =>
inventory_hostname: test_12
ok: [test_99] =>
inventory_hostname: test_99
PLAY RECAP *****************************************************************

Related

delegate_to ignores configured ssh_port

Use-Case:
We are deploying virtual machines into a cloud with a default linux image (Ubuntu 22.04 at the moment). After deploying a machine, we configure our default users and change the SSH port from 22 to 2222 with Ansible.
Side note: We are using a jump concept through the internet - Ansible automation platform / AWS => internet => SSH jump host => target host
To keep the possibility for Ansible to connect to the new machine, after changing the SSH port, I found multiple Stack Overflow / blog entries, checking and setting ansible_ssh_port, basically by running wait_for on port 22 and 2222 and set the SSH variable depending on the result (code below).
Right now this works fine for the first SSH host (jumphost), but always fails for the second host due to issues with establishing the ssh connection.
Side note: The SSH daemon is running. If I use my user from the jump host, I can get a SSH response from 22/2222 (depending on the current state of deployment).
Edit from questions:
The deployment tasks should only be run on the target host. Not the jumphost as well.
I run the deployment on the jumphost first and make sure it is up, running and configured.
After that, i run the deployment on all machines behind the jumphost to configure them.
This also ensures that if i ever would need reboot, that i don't kill all tunneled ssh session by accident.
Ansible inventory example
all:
hosts:
children:
jumphosts:
hosts:
example_jumphost:
ansible_host: 123.123.123.123
cloud_hosts:
hosts:
example_cloud_host01: #local DNS is resolved on the jumphost - no ansible_host here (yet)
ansible_ssh_common_args: '-oProxyCommand="ssh -W %h:%p -oStrictHostKeyChecking=no -q ansible#123.123.123.123 -p 2222"' #Tunnel through the appropriate jumphost
delegation_host: "ansible#123.123.123.123" #delegate jobs to the jumphost in each project if needed
vars:
ansible_ssh_port: 2222
SSH check_port role
- name: Set SSH port to 2222
set_fact:
ansible_ssh_port: 2222
- name: "Check backend port 2222"
wait_for:
port: 2222
state: "started"
host: "{{ inventory_hostname }}"
connect_timeout: "5"
timeout: "5"
# delegate_to: "{{ delegation_host }}"
# vars:
# ansible_ssh_port: 2222
ignore_errors: true
register: ssh_port
- name: "Check backend port 22"
wait_for:
port: "22"
state: "started"
host: "{{ inventory_hostname }}"
connect_timeout: "5"
timeout: "5"
# delegate_to: "{{ delegation_host }}"
# vars:
# ansible_ssh_port: 2222
ignore_errors: true
register: ssh_port_default
when:
- ssh_port is defined
- ssh_port.state is undefined
- name: Set backend SSH port to 22
set_fact:
ansible_ssh_port: 22
when:
- ssh_port_default.state is defined
The playbook itself
- hosts: "example_cloud_host01"
gather_facts: false
roles:
- role: check_port #check if we already have the correct port or need 22
- role: sshd #Set Port to 2222 and restart sshd
- role: check_port #check the port again, after it has been changed
- role: install_apps
- role: configure_apps
Error message:
with delegate_to for the task Check backend port 2222:
fatal: [example_cloud_host01 -> ansible#123.123.123.123]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 123.123.123.123 port 22: Connection refused", "unreachable": true}
This confuses me, because I expect the delegation host to use the same ansible_ssh_port as the target host.
Without delegate_to for task Check backend port 2222 and Check backend port 22:
fatal: [example_cloud_host01]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python3"}, "changed": false, "elapsed": 5, "msg": "Timeout when waiting for example_cloud_host01:2222"}
fatal: [example_cloud_host01]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/bin/python3"}, "changed": false, "elapsed": 5, "msg": "Timeout when waiting for example_cloud_host01:22"}
I have no idea why this happens. If I try the connection manually, it works fine.
What I tried so far:
I played around with delegate_to, vars, ... as mentioned above.
I wanted to see if I can provide delegato_to with the proper port 2222 for the jump host.
I wanted to see if can run this without delegate_to (since it should automatically use the proxy command to run on the jump host anyway).
Neither way gave me a solution on how to connect to my second tier servers after changing the SSH port.
Right now, I split the playbook into two
deploy sshd config with port 22
run our full deploy afterwards on port 2222
I would do the following (I somewhat tested this with fake values in the inventory using localhost as a jumphost to check ports on localhost as well)
Edit: modified my examples to somewhat try to show you a way after your comments on your question an on this answer
Inventory
---
all:
vars:
ansible_ssh_port: 2222
proxies:
vars:
ansible_user: ansible
hosts:
example_jumphost1:
ansible_host: 123.123.123.123
example_jumphost2:
ansible_host: 231.231.231.231
# ... and more jump hosts ...
cloud_hosts:
vars:
jump_vars: "{{ hostvars[jump_host] }}"
ansible_ssh_common_args: '-oProxyCommand="ssh -W %h:%p -oStrictHostKeyChecking=no -q {{ jump_vars.ansible_user }}#{{ jump_vars.ansible_host }} -p {{ jump_vars.ansible_shh_port | d(22) }}"'
children:
cloud_hosts_north:
vars:
jump_host: example_jumphost1
hosts:
example_cloud_host01:
example_cloud_host02:
# ... and more ...
cloud_hosts_south:
var:
jump_host: example_jumphost2
hosts:
example_cloud_host03:
example_cloud_host04:
# ... and more ...
# ... and more cloud groups ...
Tasks to check ports.
- name: "Check backend inventory configured port {{ ansible_ssh_port }}"
wait_for:
port: "{{ ansible_ssh_port }}"
state: "started"
host: "{{ inventory_hostname }}"
connect_timeout: "5"
timeout: "5"
delegate_to: "{{ jump_host }}"
ignore_errors: true
register: ssh_port
- name: "Check backend default ssh port if relevant"
wait_for:
port: "22"
state: "started"
host: "{{ inventory_hostname }}"
connect_timeout: "5"
timeout: "5"
delegate_to: "{{ jump_host }}"
ignore_errors: true
register: ssh_port_default
when: ssh_port is failed
- name: "Set backend SSH port to 22 if we did not change it yet"
set_fact:
ansible_ssh_port: 22
when:
- ssh_port_default is not skipped
- ssh_port_default is success
Please note that if checks for ports 22/2222 both fail, your configured port will still be 2222 but any later task will obviously fail. You might want to fail fast after checks for those relevant hosts:
- name: "Fail host if no port is available"
fail:
msg:
- "Host {{ inventory_hostname }}" does not have"
- "any ssh port available (tested 22 and 2222)"
when:
- ssh_port is failed
- ssh_port_default is failed
With this in place, you can use different targets on your play to reach the relevant hosts:
For jump hosts
Run on a single bastion host: e.g. hosts: example_jumphost1
Run on all bastion hosts: hosts: proxies
For cloud hosts
Run on all cloud hosts: hosts: cloud_hosts
Run on a single child group: e.g. hosts: cloud_hosts_north
Run on all cloud hosts except a subgroup: e.g. hosts: cloud_hosts:!cloud_hosts_south
For more see ansible patterns

Ansible wait_for not response correct error

I trying to test a batch of connection but the connections that fail all error resposnses are "Timeout" but I know (I tested) some of them are "no route to host".
How I can do that with wait_for in ansible?
- name: Test connectivity flow
wait_for:
host: "{{ item.destination_ip }}"
port: "{{ item.destination_port }}"
state: started # Port should be open
delay: 0 # No wait before first check (sec)
timeout: 3 # Stop checking after timeout (sec)
delegate_to: "{{ item.source_ip }}"
failed_when: false
register: test_connectivity_flow_result
- name: Append result message to result list msg
set_fact:
result_list_msg: "{% if test_connectivity_flow_result.msg is defined %}{{ result_list_msg + [test_connectivity_flow_result.msg] }}{% else %}{{ result_list_msg + [ '' ] }}{% endif %}"
Current response: Timeout when waiting for 1.1.1.1:1040
Expected response: No route to host 1.1.1.1:1040
Quoting the title of the documentation of the wait_for module
wait_for – Waits for a condition before continuing
If I "rephrase" the condition your have written, this would give something like: "wait for host X to be a resolvable destination and for port 22 to be opened on that destination, retry with no delay and timeout after 3s".
This could typically be a test you launch because you started a new vm and registered it in a dns. So you wait for the dns to propagate AND the ssh port being available.
In your case, you get a timeout because your hostname never becomes a resolvable address.
If you specifically want to test there is no route to host and don't want to wait until the route eventually becomes available, you need to do that an other way. Here is a simple example playbook with the ping module:
---
- name: Very basic connection test
hosts: localhost
gather_facts: false
tasks:
- name: Test if host is reachable (will report no route if so)
ping:
delegate_to: nonexistent.host.local
Which results in:
PLAY [Very basic connection test] *****************************************************
TASK [Test if host is reachable (will report no route if so)] *************************
fatal: [localhost]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: Could not resolve hostname nonexistent.host.local: Name or service not known", "unreachable": true}
PLAY RECAP ****************************************************************************
localhost : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0
Note that the ping module:
Will report that there is no route to host if so
Will implicitly try to connect to port 22
Will make sure that the host has python installed and is ready to be managed through ansible.
If the host you are trying to check should not meet all of the above conditions (e.g. you want the test to succeed even if python is not installed), you will need an other scenario. Running the ICMP ping through the command module is one of the multiple solutions.
I ended up doing something like that:
- name: Check if port {{ R_PORT }} is open. If it is, let's fail and investigate manually.
shell: ss -ltpn | grep :{{ R_PORT }} | wc -l
register: open_redis_port
- debug: msg="{{ open_redis_port.stdout }}"
- name: Fail playbook execution if port {{ R_PORT }} is open
fail:
msg: Port {{ R_PORT }} is open. Failing, please investigate manually.
when: open_redis_port.stdout == "2" or open_redis_port.stdout == "1"

Ansible how to ignore unreachable hosts before ansible 2.7.x

I'm using ansible to run a command against multiple servers at once. I want to ignore any hosts that fail because of the '"SSH Error: data could not be sent to remote host \"1.2.3.4\". Make sure this host can be reached over ssh"' error because some of the hosts in the list will be offline. How can I do this? Is there a default option in ansible to ignore offline hosts without failing the playbook? Is there an option to do this in a single ansible cli argument outside of a playbook?
Update: I am aware that the ignore_unreachable: true works for ansible 2.7 or greater, but I am working in an ansible 2.6.1 environment.
I found a good solution here. You ping each host locally to see if you can connect and then run commands against the hosts that passed:
---
- hosts: all
connection: local
gather_facts: no
tasks:
- block:
- name: determine hosts that are up
wait_for_connection:
timeout: 5
vars:
ansible_connection: ssh
- name: add devices with connectivity to the "running_hosts" group
group_by:
key: "running_hosts"
rescue:
- debug: msg="cannot connect to {{inventory_hostname}}"
- hosts: running_hosts
gather_facts: no
tasks:
- command: date
With current version on Ansible (2.8) something like this is possible:
- name: identify reachable hosts
hosts: all
gather_facts: false
ignore_errors: true
ignore_unreachable: true
tasks:
- block:
- name: this does nothing
shell: exit 1
register: result
always:
- add_host:
name: "{{ inventory_hostname }}"
group: reachable
- name: Converge
hosts: reachable
gather_facts: false
tasks:
- debug: msg="{{ inventory_hostname }} is reachable"

ansible playbook [setup] gather facts - SSH UNREACHABLE Connection timed out during banner

I'm on a Mac machine.
$ which ansible
/Library/Frameworks/Python.framework/Versions/3.5/bin/ansible
or I guess, ansible can be located at a generic location: /usr/bin/ansible (for ex: on CentOS/Ubuntu).
$ ansible --version
ansible 2.2.0.0
Running the following playbook works fine from my other vagrant / Ubuntu box.
Playbook file looks like:
- hosts: all
become: true
gather_facts: true
roles:
- a_role_which_just_say_hello_world_debug_msg
From my local machine, I can successfully ssh to the target servers/the following server (without any password as I have already added the .pem key file using ssh-add), which is failing in Ansible playbook's [Setup] (gather facts step) in Ansible playbook run.
On Mac machine, I'm getting this error sometimes (not everytime). Error: Failed to connect to the host via ssh: Connection timed out during banner exchange. PS: this issue is not coming all the time.
$ ansible-playbook -i inventory -l tag_cluster_mycluster myplabook.yml
PLAY [all] *********************************************************************
TASK [setup] *******************************************************************
ok: [myclusterSomeServer01_i_07f318688f6339971]
fatal: [myclusterSomeServer02_i_03df6f1f988e665d9]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Connection timed out during banner exchange\r\n", "unreachable": true}
OK, tried couple of times, same behavior, out of 15 servers (that I have in the mycluster cluster), the [SETUP] setup is failing during the gathering facts setup and next time it's working fine.
Retried:
$ ansible-playbook -i inventory -l tag_cluster_mycluster myplabook.yml
PLAY [all] *********************************************************************
TASK [setup] *******************************************************************
ok: [myclusterSomeServer01_i_07f318688f6339971]
ok: [myclusterSomeServer02_i_03df6f1f988e665d9]
ok: [myclusterSomeServer03_i_057dfr56u88e665d9]
...
.....more...this time it worked for all servers.
As you see above, this time the above step worked fine. The same issue (SSH connection timed out) is happening during some task/actions (where I'm trying to install something using Ansible yum module. If I try it again, it works fine for the server which failed last time but then it may fail for another server which was successful last time. Thus, the behavior is random.
My /etc/ansible/ansible.cfg file has:
[ssh_connection]
scp_if_ssh = True
Adding the following timeout setting to /etc/ansible/ansible.cfg config file worked when I increased it to 25. When it was 10 or 15, I still saw the errors in some servers due to connection timeout banner issue.
[defaults]
timeout = 25
[ssh_connection]
scp_if_ssh = True
Apart from the above, I had to use serial: N or serial: N% (where N is a number) to run my playbook on N number or percentage of servers at a time, then it worked fine.
i.e.
- hosts: all
become: true
gather_facts: true
serial: 2
#serial: "10%"
#serial: "{{ serialNumber }}"
#serial: "{{ serialNumber }}%"
vars:
- serialNumber: 5
roles:
- a_role_which_just_say_hello_world_debug_msg

Failed to connect to host via SSH on Vagrant with Ansible Playbook

I was not able to find where the actual problem is. I executed below playbook with my private key:
---
- hosts: localhost
gather_facts: false
sudo: yes
tasks:
- name: Install package libpcre3-dev
apt: name=libpcre3-dev state=latest
But I am getting the error below on Vagrant Ubuntu machine:
PLAY [localhost]
*********************************************************************
TASK [Install package ]
***************************************************
fatal: [vagrant]: UNREACHABLE! => {"changed": false, "msg": "Failed to
connect to the host via ssh: Permission denied (publickey,password).\r\n",
"unreachable": true}
to retry, use: --limit #/home/vagrant/playbooks/p1.retry
PLAY RECAP
*********************************************************************
vagrant : ok=0 changed=0 unreachable=1 failed=0
What could be the possible suggestion?
You are running a playbook against a localhost with SSH connection (default in Ansible) and this fails. Most likely because you never configured the account on your machine to accept the key from itself. Using defaults, you'd need to add the ~/.ssh/id_rsa.pub to ~/.ssh/authorized_keys.
Instead, to run on locally add connection: local to the play:
---
- hosts: localhost
connection: local
tasks:
- debug:
And it will give you a proper response:
TASK [debug] *******************************************************************
ok: [localhost] => {
"msg": "Hello world!"
}