Sometimes cannot connect via SSH to to GCP VM for a certain amount of time - ssh

I am setting up a project with Terraform and Google Compute. In this project I start up multiple VMs and configure them directly afterwards via SSH. Sometimes there is a single or multiple VMs to which I cannot connect via SSH with my usual account. The problem magically disappears after approximately 5 minutes, even if I do not do anything. After this time everything works normally again. I am however able to SSH into the instance with the web interface during the down time.
I am not able to reliably reproduce this issue. It just magically happens sometimes to a random amount of VMs for about 5 minutes.
I am pretty lost on this and would really appreciate any pointer as to where I might find a solution.
Here is a short summary of the problem:
Cannot connect to GCP VM via SSH with predefined user
Only happens sometimes (Issue is not reliably reproducable)
Only lasts for a few minutes (~5 minutes)
During this time I can SSH into the VM via GCPs web interface
This is the Terraform code I am using to start the instances:
module.google:
variable "project"{}
variable "credentials"{}
variable "count"{default = 1}
variable "name_machine"{}
variable "zone"{}
provider "google" {
credentials = "${var.credentials}"
project = "${var.project}"
}
resource "google_compute_instance" "vm" {
count = "${var.count}"
zone = "${var.zone}"
name = "${var.name_machine}${count.index}"
machine_type = "n1-standard-1"
boot_disk {
initialize_params {
image = "ubuntu-os-cloud/ubuntu-1604-lts"
}
}
network_interface {
network = "default"
access_config {
}
}
}
EDIT The code I use to SSH into the instance.
resource "null_resource" "node"{
provisioner "remote-exec" {
inline="${data.template_file.start_up_script.rendered}"
}
connection {
user = "${var.ssh_user}"
host = "${var.ip_address}"
type = "ssh"
private_key="${var.ssh_private_key}"
}
}
EDIT 2 Terraform Output
null_resource.node (remote-exec): Connecting to remote host via SSH...
null_resource.node (remote-exec): Host: xx.xx.xx.xx
null_resource.node (remote-exec): User: Nopx
null_resource.node (remote-exec): Password: false
null_resource.node (remote-exec): Private key: true
null_resource.node (remote-exec): SSH Agent: false
null_resource.node: Still creating... (10s elapsed)
null_resource.node (remote-exec): Connecting to remote host via SSH...
null_resource.node (remote-exec): Host: xx.xx.xx.xx
null_resource.node (remote-exec): User: Nopx
null_resource.node (remote-exec): Password: false
null_resource.node (remote-exec): Private key: true
null_resource.node (remote-exec): SSH Agent: false
null_resource.node (remote-exec): Connecting to remote host via SSH...
null_resource.node (remote-exec): Host: xx.xx.xx.xx
null_resource.node (remote-exec): User: Nopx
null_resource.node (remote-exec): Password: false
null_resource.node (remote-exec): Private key: true
null_resource.node (remote-exec): SSH Agent: false
null_resource.node: Still creating... (20s elapsed)
null_resource.node (remote-exec): Connecting to remote host via SSH...
null_resource.node (remote-exec): Host: xx.xx.xx.xx
null_resource.node (remote-exec): User: Nopx
null_resource.node (remote-exec): Password: false
null_resource.node (remote-exec): Private key: true
null_resource.node (remote-exec): SSH Agent: false
...

Related

Does a terraform connection w/ bastion work similarly to "ssh -J"?

I am able to connect through an existing jump server using ssh:
ssh -o "CertificateFile ~/.ssh/id_rsa-cert.pub" -J <JUMP_USER>#<JUMP_HOST> <USER>#<HOST> echo connected
And I thought this connection block would work the same way:
resource "null_resource" "connect" {
connection {
type = "ssh"
port = 22
host = "<HOST>"
user = "<USER>"
bastion_port = 22
bastion_host = "<JUMP_HOST>"
bastion_user = "<JUMP_USER>"
bastion_certificate = "~/.ssh/id_rsa-cert.pub"
agent = true
timeout = "30s"
}
provisioner "remote-exec" {
inline = [ "echo connected" ]
}
}
The result of the ssh command was successful
% ssh -o "CertificateFile ~/.ssh/id_rsa-cert.pub" -J $BASTION_USER#$BASTION_HOST $USER#$HOST echo connected
connected
But the result of the terraform appears to be a retry loop:
% terraform apply
[...]
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
null_resource.connect: Destroying... [id=5962206430386145659]
null_resource.connect: Destruction complete after 0s
null_resource.connect: Creating...
null_resource.connect: Provisioning with 'remote-exec'...
null_resource.connect (remote-exec): Connecting to remote host via SSH...
null_resource.connect (remote-exec): Host: 3.87.64.117
null_resource.connect (remote-exec): User: <USER>
null_resource.connect (remote-exec): Password: false
null_resource.connect (remote-exec): Private key: false
null_resource.connect (remote-exec): Certificate: false
null_resource.connect (remote-exec): SSH Agent: true
null_resource.connect (remote-exec): Checking Host Key: false
null_resource.connect (remote-exec): Target Platform: unix
null_resource.connect (remote-exec): Using configured bastion host...
null_resource.connect (remote-exec): Host: <JUMP_HOST>
null_resource.connect (remote-exec): User: <JUMP_USER>
null_resource.connect (remote-exec): Password: false
null_resource.connect (remote-exec): Private key: false
null_resource.connect (remote-exec): Certificate: true
null_resource.connect (remote-exec): SSH Agent: true
null_resource.connect (remote-exec): Checking Host Key: false
The Connecting to remote host via SSH... and Using configured bastion host... repeat until
null_resource.connect: Still creating... [20s elapsed]
null_resource.connect: Still creating... [30s elapsed]
╷
│ Error: remote-exec provisioner error
│
│ with null_resource.connect,
│ on t.tf line 15, in resource "null_resource" "connect":
│ 15: provisioner "remote-exec" {
│
│ timeout - last error: Error connecting to bastion: ssh: handshake failed: ssh: unable to authenticate,
│ attempted methods [none publickey], no supported methods remain
I've tried a bunch of permutations without success:
setting certificate instead of bastion_certificate
using file("~/.ssh/id_rsa-cert.pub") for certificate and bastion_certificate
setting agent = false

Can't ping Terraform created droplets with Ansible

Using Terraform I have created 3 droplets on DigitalOcean. While doing it, in folder I have been writing SSH key and creating inventory.txt file.
Here is how it look in Terraform code:
resource "local_file" "servers_ipv4" {
content = join("\n", [
for idx, s in module.openvpn_do_infrastructure_module.servers_ipv4:
<<EOT
${var.droplet_names[idx]} ansible_host=${s} ansible_user=root ansible_ssh_private_key=openvpn_do_ssh.key
EOT
])
filename = "${path.module}/ansible/inventory.txt"
}
resource "local_file" "ssh_keys" {
content = module.openvpn_do_infrastructure_module.ssh_keys
filename = "${path.module}/ansible/openvpn_do_ssh.key"
}
Then, I have ansible folder. After execution of the script and creating droplets in this folder I have 3 files. First file, is just ansible.cfg:
[defaults]
host_key_checking = false
inventory = ./inventory.txt
The other 2 are created by Terraform. It's SSH key - openvpn_do_ssh.key and inventory.txt:
certificate-authority-server ansible_host=123.123.123.121 ansible_user=root ansible_ssh_private_key=openvpn_do_ssh.key
openvpn-server ansible_host=123.123.123.122 ansible_user=root ansible_ssh_private_key=openvpn_do_ssh.key
nextcloud-server ansible_host=123.123.123.123 ansible_user=root ansible_ssh_private_key=openvpn_do_ssh.key
And here is the problem. When I do ansible all -m ping, I get errors:
certificate-authority-server | UNREACHABLE! => {
"changed": false,
"msg": "Failed to connect to the host via ssh: root#123.123.123.121: Permission denied (publickey).",
"unreachable": true
}
nextcloud-server | UNREACHABLE! => {
"changed": false,
"msg": "Failed to connect to the host via ssh: root#123.123.123.122: Permission denied (publickey).",
"unreachable": true
}
openvpn-server | UNREACHABLE! => {
"changed": false,
"msg": "Failed to connect to the host via ssh: root#123.123.123.123: Permission denied (publickey).",
"unreachable": true
}
Also, I can connect to those droplets with SSH and everything is just fine. Even when I change permission to .key file, I still have the same error. I was trying to get more logs with -vvv flags, and here is the most interesting info I found:
ESTABLISH SSH CONNECTION FOR USER: root
...
<123.123.123.121> (255, b'', b"Warning: Permanently added '123.123.123.121' (ED25519) to the list of known hosts.\r\nroot#123.123.123.121: Permission denied (publickey).\r\n")
<123.123.123.121> (255, b'', b'root#123.123.123.121: Permission denied (publickey).\r\n')
I have solved this problem. This is what has helped me:
First of all, I have changed the extension of SSH key file from .key to .pem.
To ansible.cfg I have added next line:
[defaults]
host_key_checking = false
inventory = ./inventory.txt
inventory = ./inventory.txt
private_key_file = ./openvpn_do_ssh.pem
The last thing I have done, is adding read-only file_permission for SSH key.
resource "local_file" "ssh_keys" {
content = module.openvpn_do_infrastructure_module.ssh_keys
filename = "${path.module}/ansible/openvpn_do_ssh.pem"
content = module.openvpn_do_infrastructure_module.ssh_keys
filename = "${path.module}/ansible/openvpn_do_ssh.pem"
file_permission = "0400"
}
Hope it can help someone...

Vagrant multi vm ssh connection setup works on one but not the others

I have searched many of the similar issues but can't seem to figure out the one I'm having. I have a Vagrantfile with which I setup 3 VMs. I add a public key to each VM so I can run Ansible against the boxes after vagrant up command (I don't want to use the ansible provisioner). I forward all the SSH ports on each box.
I can vagrant ssh <server_name> on to each box successfully.
With the following:
ssh vagrant#192.168.56.2 -p 2711 -i ~/.ssh/ansible <-- successful connection
ssh vagrant#192.168.56.3 -p 2712 -i ~/.ssh/ansible <-- connection error
ssh: connect to host 192.168.56.3 port 2712: Connection refused
ssh vagrant#192.168.56.4 -p 2713 -i ~/.ssh/ansible <-- connection error
ssh: connect to host 192.168.56.4 port 2713: Connection refused
And
ssh vagrant#localhost -p 2711 -i ~/.ssh/ansible <-- successful connection
ssh vagrant#localhost -p 2712 -i ~/.ssh/ansible <-- successful connection
ssh vagrant#localhost -p 2713 -i ~/.ssh/ansible <-- successful connection
Ansible can connect to the first one (vagrant#192.168.56.2) but not the other 2 also. I can't seem to find out why it connects to one and not the others. Any ideas what I could be doing wrong?
The Ansible inventory:
{
"all": {
"hosts": {
"kubemaster": {
"ansible_host": "192.168.56.2",
"ansible_user": "vagrant",
"ansible_ssh_port": 2711
},
"kubenode01": {
"ansible_host": "192.168.56.3",
"ansible_user": "vagrant",
"ansible_ssh_port": 2712
},
"kubenode02": {
"ansible_host": "192.168.56.4",
"ansible_user": "vagrant",
"ansible_ssh_port": 2713
}
},
"children": {},
"vars": {}
}
}
The Vagrantfile:
# Define the number of master and worker nodes
NUM_MASTER_NODE = 1
NUM_WORKER_NODE = 2
PRIV_IP_NW = "192.168.56."
MASTER_IP_START = 1
NODE_IP_START = 2
# Vagrant configuration
Vagrant.configure("2") do |config|
# The most common configuration options are documented and commented below.
# For a complete reference, please see the online documentation at
# https://docs.vagrantup.com.
# default box
config.vm.box = "ubuntu/jammy64"
# automatic box update checking.
config.vm.box_check_update = false
# Provision master nodes
(1..NUM_MASTER_NODE).each do |i|
config.vm.define "kubemaster" do |node|
# Name shown in the GUI
node.vm.provider "virtualbox" do |vb|
vb.name = "kubemaster"
vb.memory = 2048
vb.cpus = 2
end
node.vm.hostname = "kubemaster"
node.vm.network :private_network, ip: PRIV_IP_NW + "#{MASTER_IP_START + i}"
node.vm.network :forwarded_port, guest: 22, host: "#{2710 + i}"
# argo and traefik access
node.vm.network "forwarded_port", guest: 8080, host: "#{8080}"
node.vm.network "forwarded_port", guest: 9000, host: "#{9000}"
# synced folder for kubernetes setup yaml
node.vm.synced_folder "sync_folder", "/vagrant_data", create: true, owner: "root", group: "root"
node.vm.synced_folder ".", "/vagrant", disabled: true
# setup the hosts, dns and ansible keys
node.vm.provision "setup-hosts", :type => "shell", :path => "vagrant/setup-hosts.sh" do |s|
s.args = ["enp0s8"]
end
node.vm.provision "setup-dns", type: "shell", :path => "vagrant/update-dns.sh"
node.vm.provision "shell" do |s|
ssh_pub_key = File.readlines("#{Dir.home}/.ssh/ansible.pub").first.strip
s.inline = <<-SHELL
echo #{ssh_pub_key} >> /home/vagrant/.ssh/authorized_keys
echo #{ssh_pub_key} >> /root/.ssh/authorized_keys
SHELL
end
end
end
# Provision Worker Nodes
(1..NUM_WORKER_NODE).each do |i|
config.vm.define "kubenode0#{i}" do |node|
node.vm.provider "virtualbox" do |vb|
vb.name = "kubenode0#{i}"
vb.memory = 2048
vb.cpus = 2
end
node.vm.hostname = "kubenode0#{i}"
node.vm.network :private_network, ip: PRIV_IP_NW + "#{NODE_IP_START + i}"
node.vm.network :forwarded_port, guest: 22, host: "#{2711 + i}"
# synced folder for kubernetes setup yaml
node.vm.synced_folder ".", "/vagrant", disabled: true
# setup the hosts, dns and ansible keys
node.vm.provision "setup-hosts", :type => "shell", :path => "vagrant/setup-hosts.sh" do |s|
s.args = ["enp0s8"]
end
node.vm.provision "setup-dns", type: "shell", :path => "vagrant/update-dns.sh"
node.vm.provision "shell" do |s|
ssh_pub_key = File.readlines("#{Dir.home}/.ssh/ansible.pub").first.strip
s.inline = <<-SHELL
echo #{ssh_pub_key} >> /home/vagrant/.ssh/authorized_keys
echo #{ssh_pub_key} >> /root/.ssh/authorized_keys
SHELL
end
end
end
end
Your Vagrantfile confirms what I suspected:
You define port forwarding as follows:
node.vm.network :forwarded_port, guest: 22, host: "#{2710 + i}"
That means, port 22 of the guest is made reachable on the host under port 2710+i. For your 3 VMs, from the host's point of view, this means:
192.168.2.1:22 -> localhost:2711
192.168.2.2:22 -> localhost:2712
192.168.2.3:22 -> localhost:2713
As IP addresses for your VMs you have defined the range 192.168.2.0/24, but you try to access the range 192.168.56.0/24.
If a Private IP address is defined (for your 1st node e.g. 192.168.2.2), Vagrant implements this in the VM on VirtualBox as follows:
Two network adapters are defined for the VM:
NAT: this gives the VM Internet access
Host-Only: this gives the host access to the VM via IP 192.168.2.2.
For each /24 network, VirtualBox (and Vagrant) creates a separate VirtualBox Host-Only Ethernet Adapter, and the host is .1 on each of these networks.
What this means for you is that if you use an IP address from the 192.168.2.0/24 network, an adapter is created on your host that always gets the IP address 192.168.2.1/24, so you have the addresses 192.168.2.2 - 192.168.2.254 available for your VMs.
This means: You have for your master a collision of the IP address with your host!
But why does the access to your first VM work?
ssh vagrant#192.168.56.1 -p 2711 -i ~/.ssh/ansible <-- successful connection
That is relatively simple: The network 192.168.56.0/24 is the default network for Host-Only under VirtualBox, so you probably have a VirtualBox Host-Only Ethernet Adapter with the address 192.168.56.1/24.
Because you have defined a port forwarding in your Vagrantfile a mapping of the 1st VM to localhost:2711 takes place. If you now access 192.168.56.1:2711, this is your own host, thus localhost, and the SSH of the 1st VM is mapped to port 2711 on this host.
So what do you have to do now?
Change the IP addresses of your VMs, e.g. use 192.168.2.11 - 192.168.2.13.
The access to the VMs is possible as follows:
Node
via Guest-IP
via localhost
kubemaster
192.168.2.11:22
localhost:2711
kubenode01
192.168.2.12:22
localhost:2712
kubenode02
192.168.2.13:22
localhost:2713
Note: If you want to access with the guest IP address, use port 22, if you want to access via localhost, use port 2710+i defined by you.

Remote-exec provisioner on gcp not connecting with host

I'm trying to use remote-exec provisioner for a use-case related to my project on GCP using Terraform version 12, based on the format specified in terraform docs I get a known hosts key mismatch error after the provisioner timesout.
resource "google_compute_instance" "secondvm" {
name = "secondvm"
machine_type = "n1-standard-1"
zone = "us-central1-a"
boot_disk {
initialize_params {
image = "centos-7-v20190905"
}
}
network_interface {
network = "default"
access_config {
nat_ip = google_compute_address.second.address
network_tier = "PREMIUM"
}
}
#metadata = {
#ssh-keys = "root:${file("~/.ssh/id_rsa.pub")}"
#}
metadata_startup_script = "cd /; touch makefile.txt; sudo echo \"string xyz bgv\" >>./makefile.txt"
provisioner "remote-exec" {
inline = [
"sudo sed -i 's/xyz/google_compute_address.first.address/gI' /makefile.txt"
]
connection {
type = "ssh"
#port = 22
host = self.network_interface[0].access_config[0].nat_ip
user = "root"
timeout = "120s"
#agent = false
private_key = file("~/.ssh/id_rsa")
#host_key = file("~/.ssh/google_compute_engine.pub")
host_key = file("~/.ssh/id_rsa.pub")
}
}
depends_on = [google_compute_address.second]
}
I'm not sure what exactly I'm doing wrong with the keys here but the error I get is
google_compute_instance.secondvm: Still creating... [2m10s elapsed]
google_compute_instance.secondvm (remote-exec): Connecting to remote host via SSH...
google_compute_instance.secondvm (remote-exec): Host: 104.155.186.128
google_compute_instance.secondvm (remote-exec): User: root
google_compute_instance.secondvm (remote-exec): Password: false
google_compute_instance.secondvm (remote-exec): Private key: true
google_compute_instance.secondvm (remote-exec): Certificate: false
google_compute_instance.secondvm (remote-exec): SSH Agent: false
google_compute_instance.secondvm (remote-exec): Checking Host Key: true
google_compute_instance.secondvm: Still creating... [2m20s elapsed]
Error: timeout - last error: SSH authentication failed (root#104.155.186.128:22): ssh: handshake failed: knownhosts: key mismatch

Terraform remote-exec on windows with ssh

I have setup a Windows server and installed ssh using Chocolatey. If I run this manually I have no problems connecting and running my commands. When I try to use Terraform to run my commands it connects successfully but doesn't run any commands.
I started by using winrm and then I could run commands but due to some problem with creating a service fabric cluster over winrm I decided to try using ssh instead and when running things manually it worked and the cluster went up. So that seems to be the way forward.
I have setup a Linux VM and got ssh working by using the private key. So I have tried to use the same config as I did with the Linux VM on the Windows but it still asked me to use my password.
What could the reason be for being able to run commands over ssh manually and using Terraform only connect but no commands are run? I am running this on OpenStack with Windows 2016
null_resource.sf_cluster_install (remote-exec): Connecting to remote host via SSH...
null_resource.sf_cluster_install (remote-exec): Host: 1.1.1.1
null_resource.sf_cluster_install (remote-exec): User: Administrator
null_resource.sf_cluster_install (remote-exec): Password: true
null_resource.sf_cluster_install (remote-exec): Private key: false
null_resource.sf_cluster_install (remote-exec): SSH Agent: false
null_resource.sf_cluster_install (remote-exec): Checking Host Key: false
null_resource.sf_cluster_install (remote-exec): Connected!
null_resource.sf_cluster_install: Creation complete after 4s (ID: 5017581117349235118)
Here is the script im using to run the commands:
resource "null_resource" "sf_cluster_install" {
# count = "${local.sf_count}"
depends_on = ["null_resource.copy_sf_package"]
# Changes to any instance of the cluster requires re-provisioning
triggers = {
cluster_instance_ids = "${openstack_compute_instance_v2.sf_servers.0.id}"
}
connection = {
type = "ssh"
host = "${openstack_networking_floatingip_v2.sf_floatIP.0.address}"
user = "Administrator"
# private_key = "${file("~/.ssh/id_rsa")}"
password = "${var.admin_pass}"
}
provisioner "remote-exec" {
inline = [
"echo hello",
"powershell.exe Write-Host hello",
"powershell.exe New-Item C:/tmp/hello.txt -type file"
]
}
}
Put the connection block inside the provisioner block:
provisioner "remote-exec" {
connection = {
type = "ssh"
...
}
inline = [
"echo hello",
"powershell.exe Write-Host hello",
"powershell.exe New-Item C:/tmp/hello.txt -type file"
]
}