MPICH2 on multiple machines (HYDU_sock_connect error)

MPICH2 on multiple machines (HYDU_sock_connect error) - ssh

I am trying to execute an MPI program in 2 different PCs. However, when I ran this command in pc1:
mpirun -hosts user#host -n 4 bin/Demo_01.exe
I'm getting this error:
[proxy:0:0#pc2] HYDU_sock_connect (./utils/sock/sock.c:203): unable to connect from "pc2" to "pc1" (Connection refused)
[proxy:0:0#pc2] main (./pm/pmiserv/pmip.c:209): unable to connect to server ubuntu at port 57395 (check for firewalls!)
Although I configured SSH connections as without password and disabled firewalls on each machines, the error is still there. My operating system is Ubuntu 12.04 and mpi is MPICH2.
Is there anyone to help?

the error is caused by the the client not connecting back to server as it doesnt know the ip of the server i.e
..main (./pm/pmiserv/pmip.c:209): unable to connect to server ubuntu at...etc
the fix is to add each of hostname and related ip in the /etc/hosts i.e
172.17.0.2 master
172.17.0.3 node1
172.17.0.4 node2
this should allow for bi-directional communication of the master and the node clients

I had the same error, but the accepted answer did not help me.
For me in the hosts file I had:
localhost:8
CPUX:2
I should of had:
CPUZ:8
CPUX:2
I.e the name of the node instead of localhost. Maybe this might help some one.

Fixed. After I followed these steps, the error disappeared:
Create administrator user accounts in both machines with the same username and password.
Define hostnames by editing the file: /etc/hosts
Make a clean install of ssh in both machines.
Configure ssh for connecting without a password. To do this follow these links:
http://www.thegeekstuff.com/2008/11/3-steps-to-perform-ssh-login-without-password-using-ssh-keygen-ssh-copy-id/ and http://dustymabe.com/2012/08/18/exchanging-ssh-keys-using-ssh-copy-id/
Locate the executable MPI program into the same paths in both machines.

montekristo_07's answer is mostly correct but not minimal; steps #2 and #3 are not strictly necessary.
You do not need to edit all your hosts' /etc/hosts files, and, if your LAN uses DHCP and you have any local DNS service running, you should not edit all your hosts' /etc/hosts files.
Insure that:
only externally-resolvable hostnames are referenced in your mpiexec command line (i.e. not "localhost"), and
the /etc/hosts file on the master (the machine on which you run mpiexec) does not have a line associating the public name of the master with the loopback address (127.0.0.1)
A simple test is to use literal IP addresses in your mpiexec command line. If this fixes your problem, then it's a hostname resolution problem...somewhere.
What is essential is to remember is that what is passed on your mpiexec command line, in particular host names, are going to be sent to and resolved on remote hosts.

Related

Ubuntu Jump Host in Open Telekom Cloud not working as expected

Currently, I have built a small datacenter environment in OTC with Terraform. based on Ubuntu 20.04 images.
The idea is to have a jump host in the setup phase and for operational purposes that allows spontaneous access to service frontends via ssh proxy jumps without permanently routing them to the public net.
Basic setup works fine so far - I can access the jump host with ssh, and can access the internal machines from there with ssh when I put the private key onto the jump host. So, cloudwise the security seems to be fine. Key pair is generated with ed25519, I use the same key for jump host and internal servers (for now).
What I cannot achieve is the proxy jump as a chained command from my outside machine.
On the jump host, I set AllowTcpForwarding to "yes" in /etc/ssh/sshd_config and restarted ssh and sshd services.
My current local ssh config looks like this:
Host otc
User ubuntu
Hostname <FloatingIP-Address>
Port 22
StrictHostKeyChecking=no
UserKnownHostsFile=/dev/null
IdentityFile= ~/.ssh/ssh_access
ControlPath ~/.ssh/cm-%r#%h:%p
ControlMaster auto
ControlPersist 10m
Host 10.*
User ubuntu
Port 22
IdentityFile=~/.ssh/ssh_access
ProxyJump otc
StrictHostKeyChecking=no
UserKnownHostsFile=/dev/null
I can use this to ssh otc to the jump host.
What I would expect is that I could use e.g. ssh 10.0.0.56 to reach an internal host without further ado. As well I should be able to use commands like ssh -L 8080:10.0.0.56:8080 10.0.0.56 -N to map an internal server's port to a localhost port on my external machine. This is how I managed that successfully on other hosting scenarios in the public cloud.
All I get is:
Stdio forwarding request failed: Session open refused by peer
kex_exchange_identification: Connection closed by remote host
Journal on the Jump host says:
Jul 30 07:19:04 dev-nc-o-bastion sshd[2176]: refused local port forward: originator 127.0.0.1 port 65535, target 10.0.0.56 port 22
What I checked as well:
ufw is off on the Jump Host.
replaced ProxyJump configuration with ProxyCommand
So I am at the end of my knowledge. Has anyone a hint what else could be the reason? Any help welcome!

Ok, cause is found (but not yet fully explained).
My local ssh setting was allowing multiplexed forwards (ControlMaster auto ) which caused the creation of a unix socket file for the Controlpath in ~/.ssh.
I had to login to the jump host to AllowTcpForwarding in the first place.
After rebooting the sshd, I returned to the local machine and the failure occured when trying to forward to the remote internal machine.
After deleting the socket file in ~/.ssh, the connection can now be established as needed. Obviously, the persistent tunnel was not impacted by the restarted daemon on the jump host and simply refused to follow the new directive.
This cost me two days. On the bright side, I learned a lot about ssh :o

Connecting erlang observer to remote machine via public IP

Background
I have a machine in production running an elixir application (no access to iex, only to erl) and I am tasked with running an analysis on why we are consuming so much CPU. The idea here would be to launch observer, check the processes tab and see the processes with the most reductions.
How am I connecting?
To connect I am following a tutorial from a blog:
https://sgeos.github.io/elixir/erlang/observer/2016/09/16/elixir_erlang_running_otp_observer_remotely.html 1
Their instructions are as follows:
launch the app in the production machine with a cookie and a name
from local run: ssh user#public_ip "epmd -names" to get the name of the app and the port used
from local create a ssh tunnel to the remote machine: ssh -L 4369:user#public_ip:4369 -L 42877:user#public_ip:42877 user#public_ip (4369 is the epmd port by default, 42877 is the port of the app)
from local connect to the remote machine using the node's name: erl -name "user#app_name" -setcookie "mah_cookie" -hidden -run observer
Problem
And now in theory I should be able to use observer on the machine. Instead however I am greeted with the following error:
Protocol ‘inet_tcp’: register/listen error: epmd_close
So, after scouring the dark side of internet, I decided to use sudo journalctl -f to check all the logs of the machine and I found this:
channel 3: open failed: administratively prohibited: open failed
my_app_name sshd[8917]: error: connect_to flame#99.999.99.999: unknown host (Name or service not known)
/scripts/watchdog.sh")
my_app_name CRON[9985]: pam_unix(cron:session): session closed for user flame
Where:
erlang -name: my_app_name
machine user: flame
machine public ip: 99.999.99.999 (obviously not real)
so it tells me, unknown host ?? I am confused since 99.999.99.999 is the public IP of the machine itself!
Questions
What am I doing wrong?
I read that in older versions of erlang I can’t monitor a machine with observer if they are in different networks (which is the case, because I want to monitor this machine from my localhost) but I didn’t find any information regarding this in modern days.
If this is in fact impossible, what alternatives do I have?

Solution
After 3 days of non-stop searching, I finally found something that works.
To summarize I am putting it here everything I did.
All steps in local machine:
get the ports from the remote server:
> ssh remote-user#remote-ip "epmd -names"
epmd: up and running on port 4369 with data:
name super_duper_app at port 43175
create a ssh tunel with the ports:
ssh remote-user#remote-ip -L4369:localhost:4369 -L43175:localhost:43175
On another terminal in your local machine, run a iex terminal with the cookie the app in your remote server is using. Then connect to it and start observer:
iex --name observer#127.0.0.1 --cookie super_duper_cookie
Node.connect :"super_duper_app#127.0.0.1"
> true
:observer.start
With observer started, select the machine from the Nodes menu.
Possible setbacks
If you have tried this and it didn't work there are a few things you can check for:
Check if the EPMD port on your local machine is free, if not, kill the process using it and free it.
Check your ssh tunneling keys and configurations for permissions. As #Roberto Aloi pointed out this link can be useful: https://unix.stackexchange.com/questions/14160/ssh-tunneling-error-channel-1-open-failed-administratively-prohibited-open

GPG key forwarding on Debian

I'm trying to use the private key from my openpgp card from my Debian laptop to a RPi. I followed the different hints found on google, in particular:
extra-socket in ~/.gnupg/gpg-agent.conf
removed it again when founding that this extra socket already will be created in /run/user/<uid>/gnupg
forward this socket using ~/.ssh/config
Host homegear
HostName homegear
RemoteForward ~/.gnupg/S.gpg-agent /run/user/1000/gnupg/S.gpg-agent.extra
changed the order of the both sockets in the RemoteForward line since I'm always confused which one should be the first one
add the following into /etc/ssh/sshd_config of the RPi
StreamLocalBindUnlink yes
reload the gpg-agent on the laptop
open new ssh connection to RPi
But I always get
Warning: remote port forwarding failed for listen path ~/.gnupg/S.gpg-agent
when connecting to the RPi.
openssh on both laptop and RPi is 7.4 (Debian Stretch), gpg is 2.1.18.
Forwarding the agent connect to be used as ssh private key (for connecting to gitlab from RPi) works perfectly, forwarding gpg private key (for signing commits) doesn't. I'm a bit helpless at the moment. Is there anything obviously wrong? Or is there still a problem with forwarding unix domain socket and I need to use the socat workaround?
Thank you!

I've run into exactly the same issue except between two Macs running 10.14.2 and GPG 2.2.11, and the only way I was able to get it to work was to specify the absolute path to the sockets on both ends. :( Having a relative path for either the remote or local socket both failed in various ways, which makes it a bit of a pain if you're connecting as different usernames on various machines.
I was able to work around that by specifying a number of different Match exec blocks in my ~/.ssh/config:
# Source machine is a personal Mac, connecting to another personal Mac on my local network
Match exec "hostname | grep -F .core" Host *.core
RemoteForward /Users/virtualwolf/.gnupg/S.gpg-agent /Users/virtualwolf/.gnupg/S.gpg-agent.extra
# Source machine is a personal Mac, connecting to my Linux box
Match exec "hostname | grep -F .core" Host <name of the host block for my Linux box>
RemoteForward /home/virtualwolf/.gnupg/S.gpg-agent /Users/virtualwolf/.gnupg/S.gpg-agent.extra
# Source machine is my work Mac, connecting to my Linux box
Match exec "hostname | grep -F <work machine name>" Host <name of the host block for my Linux box>
RemoteForward /home/virtualwolf/.gnupg/S.gpg-agent /Users/<work username>/.gnupg/S.gpg-agent.extra
(SSH bits are taken from this answer).

could not resolve hostname with scp

I am accessing an ubuntu server over ssh with putty on my windows machine and trying to download a single file to my local windows machine
my windows username is Mark and my hostname per cmd is Marks I am trying the following command on the remote server
scp backup.sql mark#marks:desktop
and I get could not resolve hostname I have tried to put in what I think myip address is and the connection times out

The syntax is this, relative to where you're issuing the command:
scp user#host_from:location/file user#host_to:location/file
And of course if you're local you can omit the user#host prefixes:
scp local_file me#host_to:~/local_file
The direction is always from > to relative to where you issue the command.

binarysubstrate is right about the syntax. The problem is, if the OP puts the name (or address) of his windows client in the 'to' part of the scp command, it probably won't work for a number of reasons:
his windows machine may not have a resolvable FQDN,
his windows machine may be behind a NAT firewall that is not setup to port-forward SSH requests,
he probably does not have an SSH daemon running on his windows machine.
To simply copy a file from the remote server down to a windows client, I would recommend WinSCP.

From the ser you ping your machine name ? Try replace machine name for the IP Address, or add your machine name to hosts configuration file from the server.

ssh server connect to host xxx port 22: Connection timed out on linux-ubuntu [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed last month.
Improve this question
I am trying to connect to remote server via ssh but getting connection timeout.
I ran the following command
ssh testkamer#test.dommainname.com
and getting following result
ssh: connect to host testkamer#test.dommainname.com port 22: Connection timed out
but if try to connect on another remote server then I can login successfully.
So I think there is no problem in ssh and other person try to login with same login and password he can successfully login to server.
Please help me
Thanks.

Here are a couple of things that could be preventing you from connecting to your Linode instance:
DNS problem: if the computer that you're using to connect to your
remote server isn't resolving test.kameronderdehamer.nl properly
then you won't be able to reach your host. Try to connect using the
public IP address assigned to your Linode and see if it works (e.g.
ssh user#123.123.123.123). If you can connect using the public IP
but not using the hostname that would confirm that you're having
some problem with domain name resolution.
Network issues: there
might be some network issues preventing you from establishing a
connection to your server. For example, there may be a misconfigured
router in the path between you and your host, or you may be
experiencing packet loss. While this is not frequent, it has
happenned to me several times with Linode and can be very annoying.
It could be a good idea to check this just in case. You can have a look
at Diagnosing network issues with MTR (from the Linode
library).

That error message means the server to which you are connecting does not reply to SSH connection attempts on port 22. There are three possible reasons for that:
You're not running an SSH server on the machine. You'll need to install it to be able to ssh to it.
You are running an SSH server on that machine, but on a different port. You need to figure out on which port it is running; say it's on port 1234, you then run ssh -p 1234 hostname.
You are running an SSH server on that machine, and it does use the port on which you are trying to connect, but the machine has a firewall that does not allow you to connect to it. You'll need to figure out how to change the firewall, or maybe you need to ssh from a different host to be allowed in.
EDIT: as (correctly) pointed out in the comments, the third is certainly the case; the other two would result in the server sending a TCP "reset" package back upon the client's connection attempt, resulting in a "connection refused" error message, rather than the timeout you're getting. The other two might also be the case, but you need to fix the third first before you can move on.

I got this error and found that I don't have my SSH port (non standard number) whitelisted in config server firewall.

Just adding this here because it worked for me. Without changing any settings (to my knowledge), I was no longer able to access my AWS EC2 instance with: ssh -i /path/to/key/key_name.pem admin#ecx-x-x-xxx-xx.eu-west-2.compute.amazonaws.com
It turned out I needed to add a rule for inbound SSH traffic, as explained here by AWS. For Port range 22, I added 0.0.0.0/0, which allows all IPv4 addresses to access the instance using SSH.
Note that making the instance accessible to all IPv4 addresses is a security risk; it is acceptable for a short time in a test environment, but you'll likely need a longer term solution.

If you are on Public Network, Firewall will block all incoming connections by default. check your firewall settings or use private network to SSL

The possibility could be, the SSH might not be enabled on your server/system.
Check sudo systemctl status ssh is Active or not.
If it's not active, try installing with the help of these commands
sudo apt update
sudo apt install openssh-server
Now try to access the server/system with following command
ssh username#ip_address

This happens because of firewall connection.
Reset your firewall connection from your hosting website.
It will start working.
After connecting to the server again add this to your (ufw) security
sudo ufw allow 22/tcp

There can be many possible reasons for this failure.
Some are listed above. I faced the same issue, it is very hard to find the root cause of the failure.
I will recommend you to check the session timeout for shh from ssh_config file.
Try to increase the session timeout and see if it fails again

My VPN connection was not enabled. I was trying all possible way to open up the Firwall and Ports until I realized, I am working from home and my VPN connection was down.
But yes, Firewall and ssh configurations can be a reason.

Try connecting to a vpn, if possible. That was the reason I was facing problem.
Tip: if you're using an ec2 machine, try rebooting it. This worked for me the other day :)

I had this issue while trying to ssh into a local nextcloud server from my Mac.
I had no issues ssh-ing in once, but if I tried to have more than one concurrent connection, it would hang until it timed out.
Note, I was sshing to my user#public-ip-address.
I realized the second connection only didn't work when I tried to ssh into it when on the same network, ie my home network
Furthermore, when I tried ssh user#server-domain it worked!
The end fix was to use ssh user#server-domain rather than ssh user#public-ip

I have experienced a couple of nasty issues that lead to these errors, and these are different from everyone else's answer here:
Wrong folder access rights. You need to have specific directory permissions on you ssh folders and files.
a. The .ssh directory permissions should be 700 (drwx------).
b. The public key (.pub file) should be 644 (-rw-r--r--).
c. The private key (id_rsa) on the client host, and the authorized_keys file on the server, should be 600 (-rw-------).
Nasty docker network configuration. This just happened to me on an AWS EC2 instance. It turned out that I had a docker network with an ip range that interfered with the ssh access granted by the security group and VPC. The docker network's range was e.g. 192.168.176.0/20 (i.e. a range from 192.168.176.1->192.168.191.254), whereas the security group had a range of 192.168.179.0/24; interfering with the SSH access.

I had this error when trying to SSH into my Raspberry pi from my MBP via bash terminal. My RPI was connected to the network via wifi/wlan0 and this IP had been changed upon restart by my routers DHCP.
Check IP being used to login via SSH is correct. Re-check IP of device being SSH'd into (in my case the RPI), which can be checked using hostname -I
Confirm/amend SSH login credentials on "guest" device (in my case the MBP) and it worked fine in my attempt.

I faced a similar issue. I checked for the below:
if ssh is not installed on your machine, you will have to install it firstly. (You will get a message saying ssh is not recognized as a command).
Port 22 is open or not on the server you are trying to ssh.
If the control of remote server is in your hands and you have permissions, try to disable firewall on it.
Try to ssh again.
If port is not an issue then you would have to check for firewall settings as it is the one that is blocking your connection.
For me too it was a firewall issue between my machine and remote server.I disabled the firewall on the remote server and I was able to make a connection using ssh.

my main machine is windows 10 and I have CEntOS 7 VBox
Search in your main machine for "known_hosts"
usually, known_host location in windows in "user/.ssh/known_host"
open it using notepad and delete the line where your centos vbox ip
then try connect in your terminal
in mac os user you can find known_hosts in "~/.ssh/known_hosts"

Make sure to ask the admin to authorize your device.
On Linux run:
sudo zerotier-cli listnetworks
if it returns status ACCESS DENIED ask the admin to authorize your node. This is mentioned here.
https://discuss.zerotier.com/t/solved-cant-join-network/1919

This issue is also caused if the Dynamic Host Configuration Protocol is not set-up properly.
To solve this first check if your IP Address is configured using
ping ipaddress,
If there is no packet loss and the IP Address is working fine try any other solution. If there is no response and you have 100% packet loss, it means that your IP Address is not working and not configured.
Now configure your IP Address using,
sudo dhclient -v devicename
To check your device you can use the 'ip a' command
For eg. My device was usb0 since I had connected the device through usb
This will configure an IP Address automatically and you can even see which one is configured. You can again check with the 'ip a' command to confirm.

This may be very case specific and work in some cases only but
check to see if you were previously connecting through some VPN software/application.
Try connecting again to the VPN. Worked in my case.

This happened to me after enabling port 22 with "sudo ufw allow ssh". Before that, I was getting a refusal from my machine when entering with ssh from another one. After enabling it, I thought it would work, but instead it showed the message "connection timed out". As I had just installed Ubuntu with the option of installing basic functions alongside, I checked whether I had the openssh-server with the command sudo apt list --installed | grep openssh-server. It turned out that Ubuntu had installed by defect the openssh-client instead. I uninstalled it and installed the openssh-server following the basic commands:
sudo apt-get purge openssh-client
sudo apt update
sudo apt install openssh-server
After that, a simple "sudo ufw allow ssh" worked perfectly and I was finally able to access the machine with an ssh command.

What worked for me was that i went to my security group and reset my IP and it worked

Here are some considerations which i took to resolve a similar issue that I had:
Port 22
IGW (Internet Gateway)
VPC
Scene 1> This is for port 22 not enabled with right configurations. If the port is set to custom or myip, the probable scene is this won't work.
Scene 2> When you delete the internet gateway, the network is created and the instance will be functional too, but the routing from the internet will not work. Hence make sure that if there is a VPC, it has an Internet Gateway attached.
Scene 3> Check the VPC for the subnet associations and routing table entries. This might probably tell you the cause. I found one in this kind of troubleshooting. The route used to land up in a "blackhole" (shows up in the route table section of the console). To fix this I had to check and find out my internet gateway and found the issue with the IGW.
Moral of the story: always trace backward in the network!

In my case I'm on windows, I reset my firewall settings, and it fixed

If you get any error check the basic a version control request with ssh -V and If it is not installed, install it with the sudo apt-get install openssh-server command.
Check your virtual machine ssh connection with sudo service ssh status at console.
Check "Active" rows and if write a inactive(dead) the console write sudo service ssh start
Result: Now you can check your connection with sudo service ssh status command and send ssh connection request.

Reset the firewall and reboot your VPS from your hosting service, it will start working perfectly fine

check whether accidentally you have deleted the default vpc or default subnets ,while creating your own vpc and subnets.
I have done this mistake while creating vpc, hence got this error while connecting via ssh.
alos check whether u have attched IGW to public subnets.

Its not complicated.
First, go disable your firewall(USE YOUR CONTROL PANEL)after you check if your openssh is active.
Disable firewall, then use putty or any alternative to basically disable using this command sudo ufw disable
try now

Update the security group of that instance. Your local IP must have updated. Every time it’s IP flips. You will have to go update the Security group.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

MPICH2 on multiple machines (HYDU_sock_connect error) - ssh

I had the same error, but the accepted answer did not help me. For me in the hosts file I had: localhost:8 CPUX:2 I should of had: CPUZ:8 CPUX:2 I.e the name of the node instead of localhost. Maybe this might help some one.

Related

Ubuntu Jump Host in Open Telekom Cloud not working as expected

Connecting erlang observer to remote machine via public IP

GPG key forwarding on Debian

could not resolve hostname with scp

ssh server connect to host xxx port 22: Connection timed out on linux-ubuntu [closed]

Categories

Resources