Can SSH be fault-tolerant, or, is there a way to overcome RST messing up my TCP connections (some kind of retry pipe at both ends?) - ssh

I'm trying to use "scp" to copy TB-sized files, which is fine, until whatever router or other issue throws a tantrum and drops my connections (lost packets or unwanted RSTs or whatever).
# scp user#rmt1:/home/user/*z .
user#rmt1's password:
log_backups_2019_02_09_07h44m14.gz
16% 6552MB 6.3MB/s 1:27:46 ETAclient_loop: send disconnect: Broken pipe
lost connection
It occurs to me that (if ssh doesn't already support this) it should be possible for something at each end point and in between the connection to simply connect with its peer, and when "stuff goes wrong", to transparently just bloody handle-it (to re-try indefinitely and reconnect basically).
Anyone know the solution?
My "normal" way of tunnelling remote machines into a local connection is using ssh of course, catch-22 - that's the thing that's breaking so I can't do that here...

SSH uses TCP, and TCP is generally designed to be relatively fault-tolerant, with retries for dropped packets, acknowledgements, and other techniques to overcome occasional network problems.
If you're seeing dropped connections nevertheless, then you are seeing excessive network problems, more than any standard protocol can be expected to handle, or you are seeing a malicious attacker intentionally try to disrupt the connection, which cannot be avoided. Those are both issues that no reasonable network protocol can overcome, and so you're going to have to deal with them. That's true whether you're using SSH or some other protocol.
You could try using SFTP instead of SCP, because SFTP supports resuming (e.g., put -a), but that's the best that's going to be possible. You can also try a command like lftp, which may have more scripting possibilities to copy and retry (e.g., mirror --continue --loop), and can also use SFTP under the hood.
Your best bet is to find out what the network problem is and get that fixed. mtr may be helpful for finding where your packet loss is.

Related

What is the simplest way to emulate a bidirectional UDP connection between two ports on localhost?

I'm adapting code that used a direct connection between udp://localhost:9080 and udp://localhost:5554 to insert ports 19080 and 15554. On one side, 9080 now talks and listens to 19080 instead of directly to 5554. Similarly, 5554 now talks and listens to 15554. What's missing is a bidirectional connection between 19080 and 15554. All the socat examples I've seen seem to ignore this simplest of cases in favor of specialized ones of limited usefulness.
I previously seemed to have success with:
sudo socat UDP4:localhost:19080 UDP4:localhost:15554 &
but I found that it may have been due to a program bug that bypassed the connection. It no longer works.
I've also been given tentative suggestions to use a pair of more cryptic commands that likewise don't work:
sudo socat UDP4-RECVFROM:19080,fork UDP4-SENDTO:localhost:15554 &
sudo socat UDP4-RECVFROM:15554,fork UDP4-SENDTO:localhost:19080 &
and additionally seem to overcomplicate the manpage statement that "Socat is a command line based utility that establishes two bidirectional byte streams and transfers data between them."
I can see from Wireshark that both sides are correctly using their respective sides of the connection to send UDP packets, but neither side is receiving what the other side has sent, due to the opacity of socat used in either of these ways.
Has anyone implemented this simplest of cases simply, reproducibly, and unambiguously? It was suggested to me as a way around writing my own emulator to pass packets back and forth between the ports, but the time spent getting socat to cooperate could likewise be put to better use.
You use fix ports, and you do not specify if one direction is initiating the transfers.
Therefore the datagram addresses are to prefer. Something like the following command should do the trick:
socat \
UDP-DATAGRAM:localhost:9080,bind=localhost:19080,sourceport=9080 \
UDP-DATAGRAM:localhost:5554,bind=localhost:15554,sourceport=5554
Only the 5-digit port numbers belong in the socat commands. The connections from or to 9988, 9080, and 5554 are direct existing connections. I only need socat for the emulated connections that would exist if an actual appliance existed.
I haven't tested this but it appears possible that the two 'more cryptic' commands might cause a non-desirable loop... perhaps the destination ports could be modified as shown below and perhaps that may help achieve your objective. This may not be viable based on your application as you may need to adjust your receive sockets accordingly.
sudo socat UDP4-RECVFROM:19080,fork UDP4-SENDTO:localhost:5554 &
sudo socat UDP4-RECVFROM:15554,fork UDP4-SENDTO:localhost:9080 &

Extra TCP connections on the RabbitMQ server after resource alarm

I have RabbitMQ Server 3.6.0 installed on Windows (I know it's time to upgrade, I've already done that on the other server node).
Heartbeats are enabled on both server and client side (heartbeat interval 60s).
I have had a resource alarm (RAM limit), and after that I have observed the raise of amount of TCP connections to RMQ Server.
At the moment there're 18000 connections while normal amount is 6000.
Via management plugin I can see there is a lot of connections with 0 channels, while our "normal" connection have at least 1 channel.
And even RMQ Server restart won't help: all connections would re-establish.
   1. Does that mean all of them are really alive?
Similar issue was described here https://github.com/rabbitmq/rabbitmq-server/issues/384, but as I can see it was fixed exactly in v3.6.0.
   2. Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
Maybe important: we have haProxy between the server and the clients. 
   3. Could haProxy be an explanation for this extra connections? Maybe it prevents client from receiving a signal the connection was closed due to resource alarm?
Are all of them alive?
Only you can answer this, but I would ask - how is it that you are ending up with many thousands of connections? Really, you should only create one connection per logical process. So if you really have 6,000 logical processes connecting to the server, that might be a reason for that many connections, but in my opinion, you're well beyond reasonable design limits even in that case.
To check, see how many connections decrease when you kill one of your logical processes.
Do I understand right that before RMQ Server v3.6.0 the behavior after resource alarm was like that: several TCP connections could hang on server side per 1 real client autorecovery connection?
As far as I can tell, yes. It looks like the developer in this case ran across a common problem in sockets, and that is the detection of dropped connections. If I had a dollar for every time someone misunderstood how TCP works, I'd have more money than Bezos. So, what they found is that someone made some bad assumptions, when actually read or write is required to detect a dead socket, and the developer wrote code to (attempt) to handle it properly. It is important to note that this does not look like a very comprehensive fix, so if the conceptual design problem had been introduced to another part of the code, then this bug might still be around in some form. Searching for bug reports might give you a more detailed answer, or asking someone on that support list.
Could haProxy be an explanation for this extra connections?
That depends. In theory, haProxy as is just a pass-through. For the connection to be recognized by the broker, it's got to go through a handshake, which is a deliberate process and cannot happen inadvertently. Closing a connection also requires a handshake, which is where haProxy might be the culprit. If haProxy thinks the connection is dead and drops it without that process, then it could be a contributing cause. But it is not in and of itself making these new connections.
The RabbitMQ team monitors this mailing list and only sometimes answers questions on StackOverflow.
I recommended that this user upgrade from Erlang 18, which has known TCP connection issues -
https://groups.google.com/d/msg/rabbitmq-users/R3700QdIVJs/taDYKI6bAgAJ
I've managed to reproduce the problem: in the end it was a bug in the way our client used RMQ connections.
It created 1 auto-recovery connection (that's all fine with that) and sometimes it created a separate simple connection for "temporary" purposes.
Step to reproduce my problem were:
Reach memory alarm in RabbitMQ (e.g. set up an easily reached RAM
limit and push a lot of big messages). Connections would be in state
"blocking".
Start sending message from our client with this new "temp" connection.
Ensure the connection is in state "blocked".
Without eliminating resource alarm, restart RabbitMQ node.
The "temp" connection itself was here! Despite the fact auto-recovery
was not enabled for it. And it continued sending heartbeats so the
server didn't close it.
We will fix the client to use one and the only connection always.
Plus we of course will upgrade Erlang.

libevent2 http not detecting client network broken

I am not pasting the source code here, if any one whats to reproduce the problem, download the code from this github project:
It is a Comet server, the server use libevent-2.0.21-stable http.
To reproduce the problem:
start the icomet-server from machine S
run curl http://ip:8100/stream from another machine C, the server S will show message that C has connected
if I press CTRL + C to terminate curl, the server knows that C is disconnected as expected.
if I pull out the network line from machine C(a physical network broken), the server will NOT know that C is disconnected, which it SHOULD know!
I will askany one who is familiar with libevent, how to make libevent 2 to detecting client network broken?
When the physical network link is interrupted, you won't always get a packet back to tell you that you lost the connection. If you want to find out about a disconnection, send a ping (a request that just asks for a no-op reply) periodically, and if the reply doesn't come within some reasonable timeout, assume something went wrong. Or just disconnect the client if they're idle for long enough.
When you did that Ctrl-C, the OS that the other end was running on was still working, and so it was able to generate a TCP RST packet to inform your server that the client has gone away. But when you break that physical link, the client is no longer capable of sending that cry for help. Something else has to infer that the client went away.
Now, if you try to send the client some data, the server kernel will notice (sooner or later) that the client is not replying to its messages. At this point you'll see the disconnect - but it may take several minutes for this to happen. If you're not sending any data, then it'll stay open until either you disconnect it, or the kernel attempts a TCP keepalive (a low-level way of the kernel asking "Hey, I haven't heard from you for a while, are you still there?") potentially hours later (or it might not even do a keepalive at all for you, depending on how things are configured).

using "vim" can lead ssh timeout but "top" not

When I use ssh to log in a remote server and open vim, if I don't type any words the session will timeout and I have to log in again.
But if I run command like top the session will never timeout?
What's the reason?
Note that the behavior you're seeing isn't related to vim or to top. Chances are good some router along the way is culling "dead" TCP sessions. This is often done by a NAT firewall or a stateful firewall to reduce memory pressure and protect against simple denial of service attacks.
Probably the ServerAliveInterval configuration option can keep your idle-looking sessions from being reaped:
ServerAliveInterval
Sets a timeout interval in seconds after which if no
data has been received from the server, ssh(1) will
send a message through the encrypted channel to request
a response from the server. The default is 0,
indicating that these messages will not be sent to the
server, or 300 if the BatchMode option is set. This
option applies to protocol version 2 only.
ProtocolKeepAlives and SetupTimeOut are Debian-specific
compatibility aliases for this option.
Try adding ServerAliveInterval 180 to your ~/.ssh/config file. This will ask for the keepalive probes every three minutes, which should be faster than many firewall timeouts.
vim will just sit there waiting for input, and (unless you've got a clock or something on the terminal screen) will also produce no output. If this continues for very long, most firewalls will see the connection as dead and kill them, since there's no activity.
Top, by comparison, updates the screen once every few seconds, which is seen as activity and the connection is kept open, since there IS data flowing over it on a regular basis.
There are options you can add the SSH server's configuration to send timed "null" packets to keep a connection alive, even though no actual user data is going across the link: http://www.howtogeek.com/howto/linux/keep-your-linux-ssh-session-from-disconnecting/
Because "top" is always returning data through your SSH console, it will remain active.
"vim" will not because it is static and only transmits data according to your key presses.
The lack of transferred data causes the SSH session to time out

Too many TIME_WAIT connections

We have a fairly busy website (1 million page views/day) using Apache mod proxy that keeps getting overloaded with connections (>1,000) in the TIME_WAIT state. The connections are to port 3306 (mysql), but mysql only shows a few connections (show process list) and is performing fine.
We have tried changing a bunch of things (keep alive on/off), but nothing seems to help. All other system resources are within reasonable range.
I've searched around, which seems to indicate changing the tcp_time_wait_interval. But that seems a bit drastic. I've worked on busy website before, but never had this problem.
Any suggestions?
Each time_wait connection is a connection that has been closed.
You're probably connecting to mysql, issuing a query, then disconnecting. Repeat for each query on the page. Consider using a connection pooling tool, or at very least, a global variable that holds on to your database connection. If you use a global, you'll have to close the connection at the end of the page. Hopefully you have someplace common you can put that, like a footer include.
As a bonus, you should get a faster page load. MySQL is quick to connect, but not having to re-connect is even faster.
If your client applications are using JDBC, you might be hitting this bug:
http://bugs.mysql.com/bug.php?id=56979
I believe that php has the same problem
Cheers,
Gilles.
We had a similar problem, where our web servers all froze up because our php was making connections to a mysql server that was set up to do reverse host lookups on incoming connections.
When things were slow it worked fine, but under load the responstimes shot through the roof and all the apache servers got stuck in time_wait.
The way we figured the problem out was through using xdebug to create profiling data on the scripts under high load, and looking at that. the mysql_connect calls took up 80-90% of the execution time.