I am having this log on my swarm manage container:
time="2016-04-15T02:47:59Z" level=debug msg="Failed to validate pending node: lookup node1 on 10.0.2.3:53: server misbehaving" Addr="node1:2376"
I have set up a github repo to reproduce my problem: https://github.com/casertap/playing-with-swarm-tls
I am running a cluster ok 2 machine (built with vagrant)
$script2 = <<STOP
service docker stop
sed -i 's/DOCKER_OPTS=/DOCKER_OPTS="-H tcp:\\/\\/0.0.0.0:2376 -H unix:\\/\\/\\/var\\/run\\/docker.sock --tlsverify --tlscacert=\\/home\\/vagrant\\/.certs\\/ca.pem --tlscert=\\/home\\/vagrant\\/.certs\\/cert.pem --tlskey=\\/home\\/vagrant\\/.certs\\/key.pem"/' /etc/init/docker.conf
service docker start
STOP
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
config.vm.box = "ubuntu/trusty64"
config.vm.define "node1" do |app|
app.vm.network "private_network", ip: "192.168.33.10"
app.vm.provision "file", source: "ca.pem", destination: "~/.certs/ca.pem"
app.vm.provision "file", source: "node1-cert.pem", destination: "~/.certs/cert.pem"
app.vm.provision "file", source: "node1-priv-key.pem", destination: "~/.certs/key.pem"
app.vm.provision "file", source: "node1.csr", destination: "~/.certs/node1.csr"
app.vm.provision "docker"
app.vm.provision :shell, :inline => $script2
end
config.vm.define "swarm" do |app|
app.vm.network "private_network", ip: "192.168.33.12"
app.vm.provision "shell", inline: "echo '192.168.33.10 node1' >> /etc/hosts"
app.vm.provision "shell", inline: "echo '192.168.33.12 swarm' >> /etc/hosts"
app.vm.provision "docker"
app.vm.provision "file", source: "ca.pem", destination: "~/.certs/ca.pem"
app.vm.provision "file", source: "swarm-cert.pem", destination: "~/.certs/cert.pem"
app.vm.provision "file", source: "swarm-priv-key.pem", destination: "~/.certs/key.pem"
app.vm.provision "file", source: "swarm.csr", destination: "~/.certs/swarm.csr"
end
end
As you can see my node1 /etc/init/docker.conf has the options:
DOCKER_OPTS="-H tcp:\\/\\/0.0.0.0:2376 -H unix:\\/\\/\\/var\\/run\\/docker.sock --tlsverify --tlscacert=\\/home\\/vagrant\\/.certs\\/ca.pem --tlscert=\\/home\\/vagrant\\/.certs\\/cert.pem --tlskey=\\/home\\/vagrant\\/.certs\\/key.pem"
I do
vagrant up
then I connect to swarm
vagrant ssh swarm
export TOKEN=$(docker run swarm create)
#dd182b8d2bc8c03f417376296558ba29
docker run -d swarm join --advertise node1:2376 token://dd182b8d2bc8c03f417376296558ba29
node1 is defined in the /etc/hosts file as you can see on the vagrant provision file.
Start the swarm manager with log debug level (wihthout -d)
docker run -p 3376:3376 -v /home/vagrant/.certs:/certs:ro swarm -l debug manage --tlsverify --tlscacert=/certs/ca.pem --tlscert=/certs/cert.pem --tlskey=/certs/key.pem --host=0.0.0.0:3376 token://dd182b8d2bc8c03f417376296558ba29
The log is showing me:
time="2016-04-15T02:47:59Z" level=debug msg="Failed to validate pending node: lookup node1 on 10.0.2.3:53: server misbehaving" Addr="node1:2376"
my node1 ip address in /etc/hosts is actually:
192.168.33.10 node1
It seems that docker is trying to lookup the node1 alias on the wrong bridge network?
========== more info:
You can check this url to see if the discovery service found your node1 and it does:
https://discovery.hub.docker.com/v1/clusters/dd182b8d2bc8c03f417376296558ba29
Now if you run the swarm manager with -d and do:
vagrant#vagrant-ubuntu-trusty-64:~$ docker --tlsverify --tlscacert=/home/vagrant/.certs/ca.pem --tlscert=/home/vagrant/.certs/cert.pem --tlskey=/home/vagrant/.certs/key.pem -H swarm:3376 info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: swarm/1.2.0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 1
(unknown): node1:2376
└ Status: Pending
└ Containers: 0
└ Reserved CPUs: 0 / 0
└ Reserved Memory: 0 B / 0 B
└ Labels:
└ Error: (none)
└ UpdatedAt: 2016-04-15T03:03:28Z
└ ServerVersion:
Plugins:
Volume:
Network:
Kernel Version: 3.13.0-85-generic
Operating System: linux
Architecture: amd64
CPUs: 0
Total Memory: 0 B
Name: ee85273cbb64
Docker Root Dir:
Debug mode (client): false
Debug mode (server): false
WARNING: No kernel memory limit support
You see the node has being: Pending
Although you define node1 in your machine's /etc/hosts, the container that swarm manager is running doesn't have node1 in its /etc/hosts file. By default a container doesn't share the host's file system. See https://docs.docker.com/engine/userguide/containers/dockervolumes/. Swarm manager tries to look up node1 thru DNS resolver and fails.
There are several options to resolve this.
Use a resolvable FQDN so Swarm manager in the container can resolve the node
Or provide node1's IP in swarm join command
Or pass /etc/hosts file from host to the Swarm manager container using -v option. See the link above.
Related
I deployed https://github.com/bitnami/charts/tree/master/bitnami/redis-cluster
with following command
helm install redis-cluster bitnami/redis-cluster --create-namespace -n redis -f redis-values.yaml
redis-values.yaml
cluster:
init: true
nodes: 6
replicas: 1
podDisruptionBudget:
minAvailable: 2
persistence:
size: 1Gi
password: "redis#pass"
redis:
configmap: |-
maxmemory 600mb
maxmemory-policy allkeys-lru
maxclients 40000
cluster-require-full-coverage no
cluster-allow-reads-when-down yes
sysctlImage:
enabled: true
mountHostSys: true
command:
- /bin/sh
- -c
- |-
insta
sysctl -w net.core.somaxconn=10000
echo never > /host-sys/kernel/mm/transparent_hugepage/enabled
# echo never > /host-sys/kernel/mm/transparent_hugepage/defrag
metrics:
enabled: true
Now cluster working fine but issue only that if i will delete any pod then redis go down i start getting error for redis.
Here is my config for quarkus to connect:
quarkus.redis.hosts=redis://redis-cluster.redis.svc.local:6379
quarkus.redis.master-name=redis-cluster
quarkus.redis.password=redis#pass
quarkus.redis.client-type=cluster
Don't connect with service but use nodes change from
quarkus.redis.hosts=redis://redis-cluster:6379
To
quarkus.redis.hosts=redis://redis-cluster-0.redis-cluster-headless.redis.svc.cluster.local:6379,redis://redis-cluster-1.redis-cluster-headless.redis.svc.cluster.local:6379,redis://redis-cluster-2.redis-cluster-headless.redis.svc.cluster.local:6379,redis://redis-cluster-3.redis-cluster-headless.redis.svc.cluster.local:6379,redis://redis-cluster-4.redis-cluster-headless.redis.svc.cluster.local:6379,redis://redis-cluster-5.redis-cluster-headless.redis.svc.cluster.local:6379
Format for host is following:
POD-NAME.HEADLESS-SVC-NAME.NAMESPACE.svc.cluster.local:PORT
I've been following the instructions on github to setup an azure files volume.
apiVersion: v1
kind: Secret
metadata:
name: azure-files-secret
type: Opaque
data:
azurestorageaccountname: Yn...redacted...=
azurestorageaccountkey: 3+w52/...redacted...MKeiiJyg==
I then in my pod config have:
...stuff
volumeMounts:
- mountPath: /var/ccd
name: openvpn-ccd
...more stuff
volumes:
- name: openvpn-ccd
azureFile:
secretName: azure-files-secret
shareName: az-files
readOnly: false
Creating the containers then fails:
MountVolume.SetUp failed for volume "kubernetes.io/azure-file/007adb39-30df-11e7-b61e-000d3ab6ece2-openvpn-ccd" (spec.Name: "openvpn-ccd") pod "007adb39-30df-11e7-b61e-000d3ab6ece2" (UID: "007adb39-30df-11e7-b61e-000d3ab6ece2") with: mount failed: exit status 32 Mounting command: mount Mounting arguments: //xxx.file.core.windows.net/az-files /var/lib/kubelet/pods/007adb39-30df-11e7-b61e-000d3ab6ece2/volumes/kubernetes.io~azure-file/openvpn-ccd cifs [vers=3.0,username=xxx,password=xxx,dir_mode=0777,file_mode=0777] Output: mount error(13): Permission denied Refer to the mount.cifs(8) manual page (e.g. man mount.cifs)
I was previously getting password errors, as I hadn't base64 encoded the account key, but that has resolved now, and I get the more generic Permission denied error, which I suspect is maybe on the mount point, rather than the file storage. In any case, I need advice on how to troubleshoot further please?
This appears to be an auth error to your storage account. Un-base64 your password, and then validate using an ubuntu image in the same region as the storage account.
Here is a sample script to validate the Azure Files share correctly mounts:
if [ $# -ne 3 ]
then
echo "you must pass arguments STORAGEACCOUNT STORAGEACCOUNTKEY SHARE"
exit 1
fi
ACCOUNT=$1
ACCOUNTKEY=$2
SHARE=$3
MOUNTSHARE=/mnt/${SHARE}
apt-get update && apt-get install -y cifs-utils
mkdir -p /mnt/$SHARE
mount -t cifs //${ACCOUNT}.file.core.windows.net/${SHARE} ${MOUNTSHARE} -o vers=2.1,username=${ACCOUNT},password=${ACCOUNTKEY}
I get a strange error with Ansible. First of all, the first role works fine but when Ansible tries to execute the seconde one it failed because of ssh error.
Environment:
OS: CentOS 7
Ansible version: 2.2.1.0
Python version: 2.7.5
OpenSSH version: OpenSSH_6.6.1p1, OpenSSL 1.0.1e-fips 11 Feb 2013
Ansible command which is executed:
ansible-playbook -vvvv -i inventory/dev playbook_update_system.yml --limit "db[0]"
Playbook:
- name: "HUB Playbook | Updating system packages on {{ ansible_hostname }}"
hosts: release_first_half
roles:
- upgrade_system_package
- reboot_server
Role: upgrade_system_package:
- name: "upgrading CentOS system packages on {{ ansible_hostname }}"
shell: sudo puppet apply -e 'exec{"upgrade-package":command => "/usr/bin/yum clean all; /usr/bin/yum -y update;"}'
when: ansible_distribution == 'CentOS' and 'cassandra' not in group_names
Role: reboot_server:
- name: "reboot CentOS [{{ ansible_hostname }}] server"
shell: sudo puppet apply -e 'exec{"reboot-os":command => "/usr/sbin/reboot"}'
when: ansible_distribution == 'CentOS' and 'cassandra' not in group_names
Current behavior:
Connection to "db1" node and execute role "upgrade system packages" => OK
Try to connect to "db1" and execute role "reboot_server" => failed due to ssh.
Error message returned by Ansible:
fatal: [db1]: UNREACHABLE! => {
"changed": false,
"msg": "Failed to connect to the host via ssh: OpenSSH_6.6.1, OpenSSL 1.0.1e-fips 11 Feb 2013\r\ndebug1: Reading configuration data /USR/newtprod/.ssh/config\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 56: Applying options for *\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 64994\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Control master terminated unexpectedly\r\nShared connection to db1 closed.\r\n",
"unreachable": true
}
I don't understand because the previous role has been executed successfully on this node. Moreover, we have a lot of playbook which are using same inventory file and they works fine. I tried on another node too but same result.
It's a simple and pretty well-known issue: the shutdown process causes SSH daemon to quit and this breaks the current SSH session (you get the "broken pipe" error). The server reboots properly, but Ansible flow gets interrupted.
You need to add a delay to your shell command and run it with async option, so that Ansible's SSH session can finish before it gets killed.
shell: sleep 5; sudo puppet apply -e 'exec{"reboot-os":command => "/usr/sbin/reboot"}'
async: 0
poll: 0
I would like to provision with my three nodes from the last one by using Ansible.
My host machine is Windows 10.
My Vagrantfile looks like:
Vagrant.configure("2") do |config|
(1..3).each do |index|
config.vm.define "node#{index}" do |node|
node.vm.box = "ubuntu"
node.vm.box = "../boxes/ubuntu_base.box"
node.vm.network :private_network, ip: "192.168.10.#{10 + index}"
if index == 3
node.vm.provision :setup, type: :ansible_local do |ansible|
ansible.playbook = "playbook.yml"
ansible.provisioning_path = "/vagrant/ansible"
ansible.inventory_path = "/vagrant/ansible/hosts"
ansible.limit = :all
ansible.install_mode = :pip
ansible.version = "2.0"
end
end
end
end
end
My playbook looks like:
---
# my little playbook
- name: My little playbook
hosts: webservers
gather_facts: false
roles:
- create_user
My hosts file looks like:
[webservers]
192.168.10.11
192.168.10.12
[dbservers]
192.168.10.11
192.168.10.13
[all:vars]
ansible_connection=ssh
ansible_ssh_user=vagrant
ansible_ssh_pass=vagrant
After executing vagrant up --provision I got the following error:
Bringing machine 'node1' up with 'virtualbox' provider...
Bringing machine 'node2' up with 'virtualbox' provider...
Bringing machine 'node3' up with 'virtualbox' provider...
==> node3: Running provisioner: setup (ansible_local)...
node3: Running ansible-playbook...
PLAY [My little playbook] ******************************************************
TASK [create_user : Create group] **********************************************
fatal: [192.168.10.11]: FAILED! => {"failed": true, "msg": "ERROR! Using a SSH password instead of a key is not possible because Host Key checking is enabled and sshpass does not support this. Please add this host's fingerprint to your known_hosts file to manage this host."}
fatal: [192.168.10.12]: FAILED! => {"failed": true, "msg": "ERROR! Using a SSH password instead of a key is not possible because Host Key checking is enabled and sshpass does not support this. Please add this host's fingerprint to your known_hosts file to manage this host."}
PLAY RECAP *********************************************************************
192.168.10.11 : ok=0 changed=0 unreachable=0 failed=1
192.168.10.12 : ok=0 changed=0 unreachable=0 failed=1
Ansible failed to complete successfully. Any error output should be
visible above. Please fix these errors and try again.
I extended my Vagrantfile with ansible.limit = :all and added [all:vars] to the hostfile, but still cannot get through the error.
Has anyone encountered the same issue?
Create a file ansible/ansible.cfg in your project directory (i.e. ansible.cfg in the provisioning_path on the target) with the following contents:
[defaults]
host_key_checking = false
provided that your Vagrant box has sshpass already installed - it's unclear, because the error message in your question suggests it was installed (otherwise it would be "ERROR! to use the 'ssh' connection type with passwords, you must install the sshpass program"), but in your answer you add it explicitly (sudo apt-get install sshpass), like it was not
I'm using Ansible version 2.6.2 and solution with host_key_checking = false doesn't work.
Adding environment variable export ANSIBLE_HOST_KEY_CHECKING=False skipping fingerprint check.
This error can also be solved by simply export ANSIBLE_HOST_KEY_CHECKING variable.
export ANSIBLE_HOST_KEY_CHECKING=False
source: https://github.com/ansible/ansible/issues/9442
This SO post gave the answer.
I just extended the known_hosts file on the machine that is responsible for the provisioning like this:
Snippet from my modified Vagrantfile:
...
if index == 3
node.vm.provision :pre, type: :shell, path: "install.sh"
node.vm.provision :setup, type: :ansible_local do |ansible|
...
My install.sh looks like:
# add web/database hosts to known_hosts (IP is defined in Vagrantfile)
ssh-keyscan -H 192.168.10.11 >> /home/vagrant/.ssh/known_hosts
ssh-keyscan -H 192.168.10.12 >> /home/vagrant/.ssh/known_hosts
ssh-keyscan -H 192.168.10.13 >> /home/vagrant/.ssh/known_hosts
chown vagrant:vagrant /home/vagrant/.ssh/known_hosts
# reload ssh in order to load the known hosts
/etc/init.d/ssh reload
I had a similar challenge when working with Ansible 2.9.6 on Ubuntu 20.04.
When I run the command:
ansible all -m ping -i inventory.txt
I get the error:
target | FAILED! => {
"msg": "Using a SSH password instead of a key is not possible because Host Key checking is enabled and sshpass does not support this. Please add this host's fingerprint to your known_hosts file to manage this host."
}
Here's how I fixed it:
When you install ansible, it creates a file called ansible.cfg, this can be found in the /etc/ansible directory. Simply open the file:
sudo nano /etc/ansible/ansible.cfg
Uncomment this line to disable SSH key host checking
host_key_checking = False
Now save the file and you should be fine now.
Note: You could also try to add the host's fingerprint to your known_hosts file by SSHing into the server from your machine, this prompts you to save the host's fingerprint to your known_hosts file:
promisepreston#ubuntu:~$ ssh myusername#192.168.43.240
The authenticity of host '192.168.43.240 (192.168.43.240)' can't be established.
ECDSA key fingerprint is SHA256:9Zib8lwSOHjA9khFkeEPk9MjOE67YN7qPC4mm/nuZNU.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '192.168.43.240' (ECDSA) to the list of known hosts.
myusername#192.168.43.240's password:
Welcome to Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-53-generic x86_64)
That's all.
I hope this helps
run the below command, it resolved my issue
export ANSIBLE_HOST_KEY_CHECKING=False && ansible-playbook -i
all provided solutions require changes in global config file or adding environment variable what create problems to onboard new people.
Instead you can add following variable to your inventory or host vars
ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
Adding ansible_ssh_common_args='-o StrictHostKeyChecking=no'
to either your inventory
like:
[all:vars]
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
[all:children]
servers
[servers]
host1
OR:
[servers]
host1 ansible_ssh_common_args='-o StrictHostKeyChecking=no'
I'm unable to get a NFS volume mounted for a Docker Swarm, and the lack of proper official documentation regarding the --mount syntax (https://docs.docker.com/engine/reference/commandline/service_create/) doesnt help.
I have tried basically this command line to create a simple nginx service with a /kkk directory mounted to an NFS volume:
docker service create --mount type=volume,src=vol_name,volume-driver=local,dst=/kkk,volume-opt=type=nfs,volume-opt=device=192.168.1.1:/your/nfs/path --name test nginx
The command line is accepted and the service is scheduled by Swarm, but the container never reaches "running" state and swarm tries to start a new instance every few seconds. I set the daemon to debug but no error regarding the volume shows...
Which is the right syntax to create a service with a NFS volume?
Thanks a lot
I found an article here that shows how to mount nfs share (and that works for me): http://collabnix.com/docker-1-12-swarm-mode-persistent-storage-using-nfs/
sudo docker service create \
--mount type=volume,volume-opt=o=addr=192.168.x.x,volume-opt=device=:/data/nfs,volume-opt=type=nfs,source=vol_collab,target=/mount \
--replicas 3 --name testnfs \
alpine /bin/sh -c "while true; do echo 'OK'; sleep 2; done"
Update:
In case you want to use it with docker-compose you can do it the following:
version: '3'
services:
alpine:
image: alpine
volumes:
- vol_collab:/mount
deploy:
mode: replicated
replicas: 2
command: /bin/sh -c "while true; do echo 'OK'; sleep 2; done"
volumes:
vol_collab:
driver: local
driver_opts:
type: nfs
o: addr=192.168.xx.xx
device: ":/data/nfs"
and then run it with
docker stack deploy -c docker-compose.yml test
you could also this in docker compose to create nfs volume
data:
driver: local
driver_opts:
type: "nfs"
o: addr=<nfs-Host-domain-name>,rw,sync,nfsvers=4.1
device: ":<path to directory in nfs server>"