According to the images below (Rabbit 3.6.6-1) I am wondering where is all the memory being used for "Binaries" when it doesnt show the same memory usage on the "Binary references" / breakdown
Can anyone enlighten?
I suspect something needs to be "Cleaned up"... but what?
This big consumption of "Binaries" can also be seen on machines with 4 queues and no messages...
EDIT 17/07/2017:
We have found that this is mainly due to the fact that we open and close multiple connections to rabbitmq, which somehow does not seem to free up the memory in a clean way.
The fact that the biggest part of used memory isn't associated with any particular messages (the "binary references" part of your screenshot), suggests this memory is being used by operating system resources not directly managed by RabbitMQ. My biggest suspect would be open connections.
To test this theory, you can run netstat and see if you get similar results (assuming you're running rabbitmq on the default port - 5672):
root#rabbitmqhost:~# netstat -ntpo | grep -E ':5672\>'
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
tcp6 0 0 xxx.xxx.xxx.xxx:5672 yyy.yyy.yyy.yyy:57656 ESTABLISHED 27872/beam.smp off (0.00/0/0)
tcp6 0 0 xxx.xxx.xxx.xxx:5672 yyy.yyy.yyy.yyy:49962 ESTABLISHED 27872/beam.smp off (0.00/0/0)
tcp6 0 0 xxx.xxx.xxx.xxx:5672 yyy.yyy.yyy.yyy:56546 ESTABLISHED 27872/beam.smp off (0.00/0/0)
tcp6 0 0 xxx.xxx.xxx.xxx:5672 yyy.yyy.yyy.yyy:50726 ESTABLISHED 27872/beam.smp off (0.00/0/0)
⋮
The interesting part is the last column showing "Timer off". This indicates this connection isn't using keepalives, which means they'll just dangle there eating up resources if the client dies without a chance of closing them gracefully.
There are two ways to avoid this problem:
TCP keepalives
application protocol heartbeats
TCP Keepalives
These are handled by the kernel. Whenever a connection sees no packages for a certain amount of time, the kernel tries to send some probes to see if the other side's still there.
Since the current Linux (e.g. 4.12) timeout defaults are pretty high (7200 seconds + 9 probes every 75 seconds > 2 hours), rabbitmq does not use them by default.
To activate them, you have to add it to your rabbitmq.config:
[
{rabbit, [
⋮
{tcp_listen_options,
[
⋮
{keepalive, true},
⋮
]
},
⋮
]},
⋮
].
and probably lower the timeouts to some more sensible values. Something like this might work (but of course YMMV):
root#rabbitmqhost:~# sysctl net.ipv4.tcp_keepalive_intvl=30
root#rabbitmqhost:~# sysctl net.ipv4.tcp_keepalive_probes=3
root#rabbitmqhost:~# sysctl net.ipv4.tcp_keepalive_time=60
Application protocol heartbeats
These are handled by the actual messaging protocols (e.g. AMQP, STOMP, MQTT), but require the client to opt-in. Since each protocol is different, you have to check the documentation to set it up in your client applications.
Conclusion
The safest option, from the perspective of avoiding dangling resources, are TCP keepalives, since you don't have to rely on your client applications behaving.
However, they are less versatile and if misconfigured on a high-throughput but "bursty" system, may lead to worse performance, since false-positives will cause reconnects.
Application protocol hearbeats are the more fine-grained option if you need to avoid this problem while also keeping your system's performance, but they require more coordination, in the sense that clients have to opt in and chose their own sensible timeouts.
Since you can never be 100% sure your clients won't die without gracefully closing connections, enabling TCP keepalives as a fallback (even with higher timeouts) might also be a good idea.
Related
Trying to configure a syslog-ng server to send all of the logs that it receives, to another syslog-ng server over TLS. Both running RHEL 7. Everything seems to be working from an encryption and cert perspective. Not seeing any error messages in the logs, an openssl s_client test connection works successfully, I can see the packets coming in over the port that I'm using for TLS, but nothing is being written to disk on the second syslog-ng server. Here's the summary of the config on the syslog server that I'm trying to send the logs to:
source:
source s_encrypted_syslog {
syslog(ip(0.0.0.0) port(1470) transport("tls")
tls(key-file("/etc/syslog-ng/key.d/privkey.pem")
certfile("/etc/syslog-ng/cert.d/servercert.pem")
peer-verify(optional-untrusted)
}
#changing to trusted once issue is fixed
destination:
destination d_syslog_facility_f {
file("/mnt/syslog/$LOGHOST/log/$R_YEAR-$R_MONTH-$R_DAY/$HOST_FROM/$HOST/$FACILITY.log" dir-owner ("syslogng") dir-group("syslogng") owner("syslogng") group("syslogng"));
log setting:
log { source (s_encrypted_syslog); destination (d_syslog_facility_f); };
syslog-ng is currently running as root to rule out permission issues. selinux is currently set to permissive. Tried increasing the verbosity on syslog-ng logs and turned on debugging, but not seeing anything jumping out at me as far as errors or issues go. Also the odd thing is, I have very similar config on the first syslog-ng server and it's receiving and storing logs just fine.
Also, I should note that there could be some small typo's in the config above as I'm not able to copy and paste it. Syslog-ng allows me to start up the service with no errors with the config that I have loaded currently. It's simply not writing the data that it's receiving to the destination that I have specified.
It happens quite often that the packet filter prevents a connection to the syslog port, or in your case port 1470. In that case the server starts up successfully, you might even be able to connect using openssl s_client on the same host, but the client will not be able to establish a connection to the server.
Please check that you can actually connect to the server from the client computer (e.g. via openssl s_client, or at least with something like netcat or telnet).
If the connection works, another issue might be that the client is not routing messages to this encrypted destination. syslog-ng only performs the SSL handshake as messages are being sent. No messages would result in the connection being open but not really exchanging packets on the TCP level.
Couple of troubleshooting tips:
You can check if there is a connection between the client and the server with "netstat -antp | grep syslog-ng" on the server or the client. You should see connections in the ESTABLISHED state on both sides of the connection (with local/remote addresses switched of course).
Check that your packet filter lets port 1470 connections through. You are most likely using iptables, try reviewing your ruleset and see if port 1470 on TCP is allowed to pass in the INPUT chain. You could try adding a "LOG" rule right before the default rule to see if the packets are dropped at that level. If you already have LOG rules, you might check the kernel logs of the server to see if that LOG rule produced any messages.
You can also confirm if there's traffic with tcpdump on the server (e.g. tcpdump -pen port 1470). If you write the traffic dump to a file (e.g. the -w argument to tcpdump, along with -s 0 to avoid truncation), then this dump file can be analyzed with wireshark to see if the negotiation takes place. You should at the very least see a "Client Hello" and a "Server Hello" packet which are not encrypted at the beginning of the handshake.
I have a pcap file that I modified with tcprewrite to set source and destination IP = 127.0.0.1, while the port numbers are different. I also set both mac addresses to 00:00:00:00:00:00 as I understand that comms over localhost ignore MAC. I made sure checksum was fixed.
When I run tcpreplay -i lo test-lo.pcap in one shell, and tcpdump -i lo -p udp port 50001 in another, I see the traffic. Yet, when I try to view the traffic with netcat -l -u 50001, it sees nothing. Wireshark is capturing the traffic correctly.
Side note: I'm seeing the following warning when running tcpreplay on localhost:
Warning: Unsupported physical layer type 0x0304 on lo. Maybe it works, maybe it won't. See tickets #123/318 That seems worrisome.
I'm asking because my own UDP listener code is also having the same problem as netcat and thought that maybe I'm missing something. Why would traffic be seen by tcpdump and wireshark, and not by netcat?
I'm asking because my own UDP listener code is also having the same problem as netcat and thought that maybe I'm missing something. Why would traffic be seen by tcpdump and wireshark, and not by netcat?
Look at this image of the kernel packet flow from wikipedia:
As you can see, there are different places along the path where packets can be accessed. Wireshark uses libpcap, which uses an AF_PACKET socket to see packets. Your UDP listener, like netcat, uses regular user-space sockets. Let's highlight both on this image. Wireshark obtains packets via the red path, netcat via the purple one:
As you can see, there is a whole sequence of steps packets have to go through in the kernel to get to a local process socket. These steps include bridging, routing, filtering etc. Which step drops your packets? I don't know. You can try tweaking the packets and maybe you'll get lucky.
If you want a more systematic approach, use a tool like dropwatch. It hooks into the kernel and shows you counters of where the kernel drops packets.
What is the default tcp-keepalive for redis 3 if i don't specify? i commented the tcp-keepalive option in redis.conf file.
# A reasonable value for this option is 60 seconds.
#tcp-keepalive 0
The default is 0.
You can verify this by running CONFIG GET tcp-keepalive
127.0.0.1:6379> CONFIG GET tcp-keepalive
1) "tcp-keepalive"
2) "0"
or looking at the source code.
Depends.
"…
TCP keepalive
Recent versions of Redis (3.2 or greater) have TCP keepalive (SO_KEEPALIVE socket option) enabled by default and set to about 300 seconds. This option is useful in order to detect dead peers (clients that cannot be reached even if they look connected). Moreover, if there is network equipment between clients and servers that need to see some traffic in order to take the connection open, the option will prevent unexpected connection closed events.
…"
I'm tunneling all of my internet traffic through a remote computer hosting Debian using sshd. But my internet connection becomes so slow (something around 5 to 10 kbps!). Can be anything wrong with the default configuration to cause this problem?
Thanks in advance,
Tunneling TCP within another TCP stream can sometimes work -- but when things go wrong, they go wrong very quickly.
Consider what happens when the "real world" loses one of your TCP packets: after a certain amount of not getting an ACK packet back in response to new data packets, the sending side realizes a packet has gone missing and re-sends the data.
If that packet happens to be a TCP packet whose payload is another TCP packet, then you have two TCP stacks that are upset about their missing packet. The tunneled TCP layer will re-send packets and the outer TCP layer will also resend packets. This causes a giant pileup of duplicate packets that will eventually be delivered and must be dropped on the floor -- because the outer TCP reliably delivered the packet, eventually.
I believe you would be much better served by a more dedicated tunneling method such as GRE tunnels or IPSec.
Yes, tunelling traffic over tcp connection is not a good idea. See http://sites.inka.de/bigred/devel/tcp-tcp.html
I'm on a local LAN with only 8 connected computers using a netgear 24 port gigabit switch, network load is really low and send/receive buffers on all involved nodes(running slackware 11) have been set to 16mb. I'm also running tcpdump on each node to monitor the traffic.
A sending node sends a 10044byte large UDP packet which more often than not (3/4 times) does not end up in the receiving side application, in these cases I notice(using tcpdump) that the first x fragments are missing and only the last 3 (all with offsets > 0 and in order) are caught by tcpdump. The fragmented UDP package can therefore not be reassembled and is most likely thrown away.
I find the missing fragments strange since I have also tried a simple load test bursting out 10000 UDP messages of the same size, the receiving application sends a response and all tests so far gives 100% responses back.
Any clues or hints?
Update!
After resuming the testing of the above mentioned software I found a repeatable way of recreating the error.
Using windump on the sending windows machine, and tcpdump on the receiving machine, after having left the application idle for some time(~5 minutes), I tried sending the udp message but only end up with a single fragment caught by windump and tcpdump, the 3 remaining fragments are lost. Sending the same message one more time works fine and booth windump and tcpdump catches all 4 fragments and the application on the receiving side gets the message. The pattern is repeatable.
Started searching and found the following information, but to me, still not a clear answer.
http://www.eggheadcafe.com/software/aspnet/32856705/first-udp-message-to-a-sp.aspx
Re examining the logs I now notice the ARP request/reply being sent, which matches one of the ideas given in the link above.
NOTE! I filter windump on the sending side using: "dst host receivernode"
Capture from windump: first failed udp message, should be 4 fragments long
14:52:45.342266 arp who-has receivernode tell sendernode
14:52:45.342599 IP sendernode> receivernode : udp
Capture from windump: second udp message, exactly the same contents, all 4 fragments caught
14:52:54.132383 IP sendernode.10104 > receivernode .10113: UDP, length 6019
14:52:54.132397 IP sendernode> receivernode : udp
14:52:54.132406 IP sendernode> receivernode : udp
14:52:54.132414 IP sendernode> receivernode : udp
14:52:54.132422 IP sendernode> receivernode : udp
14:52:56.142421 arp reply sendernode is-at 00:11:11:XX:XX:fd (oui unknown)
Anyone who has a good idea about whats happening? please elaborate!