RabbitMQ - TCP connection left open after Celery worker finishes, no heartbeats are sent - rabbitmq

Info
A hanging tcp connection remains on the RabbitMQ broker server after any celery worker finishes.
Using pre-emtible instances in Google Cloud Platform as workers in a processing pipeline. The number of connection builds up until eventually the Debian server runs out of memory.
Scenario summary
Worker boots and connects to Rabbit MQ, 2 tcp connections are established
Worker finishes and the instance is stopped and removed
Worker is dead, connection A is closed, connection B remains
Same problem appears running two different RabbitMQ as well as Erlang versions:
RabbitMQ 3.7.17 + Erlang 22.0.7-1
RabbitMQ 3.10.14 + Erlang 25.0.4-1
Scenario
Worker boots and connects to Rabbit MQ, 2 tcp connections are established.
Two connections on two different ports are established from the worker's IP to the rabbit MQ instance
Listing connections ...
user peer_host peer_port state
epic 10.240.60.56 A running
epic 10.240.60.56 B running
netstat shows two connections to Rabbit MQ (5672)
Worker finishes and the instance is stopped and removed
Connection on port 36654
tcpdump shows the following
# 36654 last package from worker originating on port B
17:05:24.395092 IP 10.240.50.2.5672 > 10.240.60.56.B: Flags [P.], seq 2769:2790, ack 48864, win 273, options [nop,nop,TS val 991690205 ecr 1201716502], length 21
# broker B ACK last message
17:05:24.395252 IP 10.240.60.56.B > 10.240.50.2.5672: Flags [.], ack 2790, win 507, options [nop,nop,TS val 1201716502 ecr 991690205], length 0
# broker to worker on port A trying to resend last 8 bytes of "4232:4240" ?
17:05:29.922421 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991691587 ecr 1201692028], length 8
17:05:30.127621 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991691639 ecr 1201692028], length 8
17:05:30.335615 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991691691 ecr 1201692028], length 8
17:05:30.771599 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991691800 ecr 1201692028], length 8
17:05:31.603593 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991692008 ecr 1201692028], length 8
17:05:33.267555 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991692424 ecr 1201692028], length 8
17:05:36.563603 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991693248 ecr 1201692028], length 8
17:05:43.219601 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991694912 ecr 1201692028], length 8
17:05:56.531566 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991698240 ecr 1201692028], length 8
17:06:23.923626 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991705088 ecr 1201692028], length 8
# closing A after giving up
17:06:59.920635 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [R.], seq 4240, ack 1324, win 58, options [nop,nop,TS val 991714087 ecr 1201692028], length 0
Worker is dead, connection A is closed, connection B remains
Rabbit MQ says one connection reimains, we also see a netstat showing connection to 5672
Listing connections ...
user peer_host peer_port state
epic 10.240.60.56 B running
This connection remains until RabbitMQ or server is restarted.
I expect RabbitMQ to send heartbeats on the remaining connection, it should then discover that the peer is not there and then close the connection. It seems heartbeat is not sent.
Tried the following:
upgrading RabbitMQ version and Erlang, the same problem remained => no effect
lowering kernel TCP keepalive from 60 seconds to 5. net.ipv4.tcp_keepalive_time => no effect
lowering Rabbit MQ hearbeat interval from 60s to 10s => no effect
Debugging tools
To see connections I use:
sudo rabbitmqctl list_connections
and (RabbitMQ runs on port 5672)
sudo netstat -ntpo | grep -E ':5672\>'|wc -l
To see what packages are sent I use tcpdump and the IP+port to identify the two different connections. For readability I'll replace the two worker ports with A and B

Related

problem with testpmd on dpdk and ovs in ubuntu 18.04

i have a X520-SR2 10G Network Card, i gonna use that to create 2 virtual interfaces with OpenvSwitch that compiled with dpdk (installed from repository of ubuntu 18.04) and test this virtual interfaces with testpmd, i do following jobs :
Create Bridge
$ ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev
bind dpdk ports
$ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-devargs=0000:01:00.0 ofport_request=1
$ ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk options:dpdk-devargs=0000:01:00.1 ofport_request=2
create dpdkvhostuser ports
$ ovs-vsctl add-port br0 dpdkvhostuser0 -- set Interface dpdkvhostuser0 type=dpdkvhostuser ofport_request=3
$ ovs-vsctl add-port br0 dpdkvhostuser1 -- set Interface dpdkvhostuser1 type=dpdkvhostuser ofport_request=4
define flow directions
# clear all directions
$ ovs-ofctl del-flows br0
Add new flow directions
$ ovs-ofctl add-flow br0 in_port=3,dl_type=0x800,idle_timeout=0,action=output:4
$ ovs-ofctl add-flow br0 in_port=4,dl_type=0x800,idle_timeout=0,action=output:3
Dump flow directions
$ ovs-ofctl dump-flows br0
cookie=0x0, duration=851.504s, table=0, n_packets=0, n_bytes=0, ip,in_port=dpdkvhostuser0 actions=output:dpdkvhostuser1
cookie=0x0, duration=851.500s, table=0, n_packets=0, n_bytes=0, ip,in_port=dpdkvhostuser1 actions=output:dpdkvhostuser0
now i run testpmd:
$ testpmd -c 0x3 -n 4 --socket-mem 512,512 --proc-type auto --file-prefix testpmd --no-pci --vdev=virtio_user0,path=/var/run/openvswitch/dpdkvhostuser0 --vdev=virtio_user1,path=/var/run/openvswitch/dpdkvhostuser1 -- --burst=64 -i --txqflags=0xf00 --disable-hw-vlan
EAL: Detected 32 lcore(s)
EAL: Auto-detected process type: PRIMARY
EAL: No free hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: VFIO support initialized
Interactive-mode selected
Warning: NUMA should be configured manually by using --port-numa-config and --ring-numa-config parameters along with --numa.
USER1: create a new mbuf pool <mbuf_pool_socket_0>: n=155456, size=2176, socket=0
USER1: create a new mbuf pool <mbuf_pool_socket_1>: n=155456, size=2176, socket=1
Configuring Port 0 (socket 0)
Port 0: DA:17:DC:5E:B0:6F
Configuring Port 1 (socket 0)
Port 1: 3A:74:CF:43:1C:85
Checking link statuses...
Done
testpmd> start tx_first
io packet forwarding - ports=2 - cores=1 - streams=2 - NUMA support enabled, MP over anonymous pages disabled
Logical Core 1 (socket 0) forwards packets on 2 streams:
RX P=0/Q=0 (socket 0) -> TX P=1/Q=0 (socket 0) peer=02:00:00:00:00:01
RX P=1/Q=0 (socket 0) -> TX P=0/Q=0 (socket 0) peer=02:00:00:00:00:00
io packet forwarding packets/burst=64
nb forwarding cores=1 - nb forwarding ports=2
port 0:
CRC stripping enabled
RX queues=1 - RX desc=128 - RX free threshold=0
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX queues=1 - TX desc=512 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX RS bit threshold=0 - TXQ flags=0xf00
port 1:
CRC stripping enabled
RX queues=1 - RX desc=128 - RX free threshold=0
RX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX queues=1 - TX desc=512 - TX free threshold=0
TX threshold registers: pthresh=0 hthresh=0 wthresh=0
TX RS bit threshold=0 - TXQ flags=0xf00
testpmd> stop
Telling cores to stop...
Waiting for lcores to finish...
---------------------- Forward statistics for port 0 ----------------------
RX-packets: 0 RX-dropped: 0 RX-total: 0
TX-packets: 64 TX-dropped: 0 TX-total: 64
----------------------------------------------------------------------------
---------------------- Forward statistics for port 1 ----------------------
RX-packets: 0 RX-dropped: 0 RX-total: 0
TX-packets: 64 TX-dropped: 0 TX-total: 64
----------------------------------------------------------------------------
+++++++++++++++ Accumulated forward statistics for all ports+++++++++++++++
RX-packets: 0 RX-dropped: 0 RX-total: 0
TX-packets: 128 TX-dropped: 0 TX-total: 128
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Done.
testpmd>
version of softwares:
OS: Ubuntu 18.04
Linux Kernel: 4.15
OVS: 2.9
DPDK: 17.11.3
what should i do now ??
where is the problem from?
finally catch the problem , The problem is size of socket memory allocation, i change --socket-mem value to 1024,1024 (1024M for each numa node) and create packets with pktgen (as same using --socket-mem 1024,1024).Everything works fine.

Making UDP broadcast work with wifi router

I'd like to test out UDP broadcast on a very simple network: an old wifi router (WRT54GS) that's not connected to the internet at all, an android tablet, and my macbook:
[Tablet] <\/\/\/\/\/> [Wifi Router] <\/\/\/\/\/> [Macbook]
where the wavy lines indicate wireless connections.
The Macbook has IP address 192.168.1.101, the tablet has IP address 192.168.1.102. The router is 192.168.1.1.
To avoid too much low-level detail, I wanted to use netcat to do my testing. I decided to use port 11011 because it was easy to type.
As a first step, I thought I'd try just making this work from the macbook back to itself. In two terminal windows, I ran these programs
Window 1: % nc -ul 11011
which I started up first, and then:
Window 2: % echo 'foo' | nc -v -u 255.255.255.255 11011
Nothing showed up in Window 1. The result in Window 2 was this:
found 0 associations
found 1 connections:
1: flags=82<CONNECTED,PREFERRED>
outif (null)
src 192.168.1.2 port 61985
dst 255.255.255.255 port 11011
rank info not available
I'm fairly certain I'm missing something obvious here. Can someone familiar with nc spot my obvious error?
This is a multi-part answer, gleaned from other SO and SuperUser answers and a bit of guesswork.
Mac-to-mac communication via UDP broadcast over wifi
The first thing is that the mac version of netcat (nc) as of Oct 2018 doesn't support broadcast, so you have to switch to "socat", which is far more general and powerful in what it can send. As for the listening side, what worked for me, eventually, was
Terminal 1: % nc -l -u 11011
What about the sending side? Well, it turns out I needed more information. For instance, trying this with the localhost doesn't work at all, because that particular "interface" (gosh, I hate the overloading of words in CS; as a mathematician, I'd hope that CS people might have learned from our experience what a bad idea this is...) doesn't support broadcast. And how did I learn that? Via ifconfig, a tool that shows how your network is configured. In my case, the output was this:
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
options=1203<RXCSUM,TXCSUM,TXSTATUS,SW_TIMESTAMP>
inet 127.0.0.1 netmask 0xff000000
inet6 ::1 prefixlen 128
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
nd6 options=201<PERFORMNUD,DAD>
gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
ether 98:01:a7:8a:6b:35
inet 192.168.1.101 netmask 0xffffff00 broadcast 192.168.1.255
media: autoselect
status: active
en1: flags=963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX> mtu 1500
options=60<TSO4,TSO6>
ether 4a:00:05:f3:ac:30
media: autoselect <full-duplex>
status: inactive
en2: flags=963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX> mtu 1500
options=60<TSO4,TSO6>
ether 4a:00:05:f3:ac:31
media: autoselect <full-duplex>
status: inactive
bridge0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=63<RXCSUM,TXCSUM,TSO4,TSO6>
ether 4a:00:05:f3:ac:30
Configuration:
id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0
maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200
root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0
ipfilter disabled flags 0x2
member: en1 flags=3<LEARNING,DISCOVER>
ifmaxaddr 0 port 5 priority 0 path cost 0
member: en2 flags=3<LEARNING,DISCOVER>
ifmaxaddr 0 port 6 priority 0 path cost 0
media: <unknown type>
status: inactive
p2p0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 2304
ether 0a:01:a7:8a:6b:35
media: autoselect
status: inactive
awdl0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1484
ether 7e:00:76:6d:5c:09
inet6 fe80::7c00:76ff:fe6d:5c09%awdl0 prefixlen 64 scopeid 0x9
nd6 options=201<PERFORMNUD,DAD>
media: autoselect
status: active
utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 2000
inet6 fe80::773a:6d9e:1d47:7502%utun0 prefixlen 64 scopeid 0xa
nd6 options=201<PERFORMNUD,DAD>
most of which means nothing to me. But look at "en0", the ethernet connection to the wireless network (192.168). The data there really tells you something. The flags tell you that it supports broadcast and multicast. Two lines late, the word broadcast appears again, followed by 192.168.1.255, which suggested to me that this might be the right address to which to send broadcast packets.
With that in mind, I tried this:
Terminal 2: % echo -n "TEST" | socat - udp-datagram:192.168.1.255:11011,broadcast
with the result that in Terminal 1, the word TEST appeared!
When I retyped the same command in Terminal 2, nothing more appeared in Terminal 1; it seems that the "listen" is listening for a single message, for reasons I do not understand. But hey, at least it's getting me somewhere!
Mac to tablet communication
First, on the tablet, I tried to mimic the listening side of the mac version above. The termux version of nc didn't support the -u flag, so I had to do something else. I decided to use socat. As a first step, I got it working mac-to-mac (via the wifi router of course). It turns out that to listen for UDP packets, you have to use udp-listen rather than udp-datagram, but otherwise it was pretty simple. In the end, it looked like this:
Terminal 1: % socat udp-listen:11011 -
meaning "listen for stuff on port 11011 and copy to standard output", and
Terminal 2: % echo -n "TEST" | socat - udp-datagram:192.168.1.255:11011,broadcast
Together, this got data from Terminal 2 to Terminal 1.
Then I tried it on the tablet. As I mentioned, nc on the tablet was feeble. But socat was missing entirely.
I tried it, found it wasn't installed, and installed it.
Once I'd done that, on the Tablet I typed
Tablet: % socat udp-listen:11011 -
and on the mac, in Terminal 2, I once again typed
Terminal 2: echo -n "TEST" | socat - udp-datagram:192.168.1.255:11011,broadcast
and sure enough, the word TEST appeared on the tablet!
Even better, by reading the docs I found I could use
socat udp:recv:11011 -
which not only listens, but continues to listen, and hence will report multiple UDP packets, one after another. (udp-listen, by contrast, seems to wait for one packet and then try to communicate back with the sender of that packet, which isn't what I wanted at all.)

ryu-manager's --observe-links option generates 'Unknown version (0x04)' on switches

I am trying to configure a SDN using 1 Ryu controller and 3 OpenvSwitch datapaths.
Here is the code I run on my datapaths to let them talk to the controller:
ovs-vsctl set bridge br0 protocols=[OpenFlow13]
ovs-vsctl set-controller br0 tcp:192.168.100.1:6633
Then trying to get the topology of the network via HTTP/REST I run this on the controller:
ryu-manager --observe-links /path-to-apps/rest_topology.py
Running tcpdump on anyone of the switches I read errors like this:
version unknown (0x04), type 0x03, length 8, xid 0x0000000 09:56:34.645491 IP 192.168.100.1.6633 > 192.168.100.2.53550: Flags [P.], seq 1:9, ack 8, win 235, options [nop,nop,TS val 2070367608 ecr 1308752524], lenght 8: OpenFlow
(I get this error for every ryu application I run, even "simple_switch_13.py")
I tried removing the line ovs-vsctl set bridge br0 protocols[OpenFlow13] but it did not work: switches were not connecting to the controller at all.
Any suggestion?
Thanks
Version unknown means that the tcpdump tool does not know which protocol "0x04" is.
That is a well made packet, not an error!
So if you want to know what 0x04 is try using Wireshark or a more complete software.
It will turn out it's a OpenFlow protocol packet.

Dynamic port forwarding fails after turning off and on Google Cloud virtual machine (compute engine)

I'm connecting to my Spark cluster master node with dynamic port forwarding so that I can open jupyter notebook web interface in my local machine.
I followed the instructions from this Google Cloud Dataproc tutorial: https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
I created ssh funnel with the following command as advised:
gcloud compute ssh --zone=<cluster-zone> --ssh-flag="-D" --ssh-flag="10000" --ssh-flag="-N" "<cluster-name>-m"
And opened web interface:
<browser executable path> \
"http://<cluster-name>-m:8123" \
--proxy-server="socks5://localhost:10000" \
--host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \
--user-data-dir=/tmp/
It worked perfectly fine first time I tried.
However, once I turned my goole compute engine off and turned it on after a while, the exact same commands doesn't work, giving out error message below:
debug1: Connection to port 10000 forwarding to socks port 0 requested.
debug2: fd 8 setting TCP_NODELAY
debug3: fd 8 is O_NONBLOCK
debug3: fd 8 is O_NONBLOCK
debug1: channel 2: new [dynamic-tcpip]
debug2: channel 2: pre_dynamic: have 0
debug2: channel 2: pre_dynamic: have 3
debug2: channel 2: decode socks5
debug2: channel 2: socks5 auth done
debug2: channel 2: pre_dynamic: need more
debug2: channel 2: pre_dynamic: have 0
debug2: channel 2: pre_dynamic: have 19
debug2: channel 2: decode socks5
debug2: channel 2: socks5 post auth
debug2: channel 2: dynamic request: socks5 host cluster-1-m port 8123 command 1
channel 2: open failed: connect failed: Connection refused
debug2: channel 2: zombie
debug2: channel 2: garbage collecting
debug1: channel 2: free: direct-tcpip: listening port 10000 for cluster-1-m port 8123, connect from ::1 port 49535 to ::1 port 10000, nchannels 3
debug3: channel 2: status: The following connections are open:
Waiting for help:D
The Jupyter notebook kernel is not relaunched after reboots. You'll need to manually restart the notebook yourself once the machine has booted, e.g.:
gcloud compute ssh <cluster-name>-m
nohup /usr/local/bin/miniconda/bin/jupyter notebook --no-browser > /var/log/jupyter_notebook.log 2>&1 &
Once the kernel is up and running, you should be able to access the web UI by proxy.
Note: In general, Dataproc does not support stopping or restarting the entire cluster.

Google compute network load balancing health checks are failing

I have debian instances with nginx on 80 port. Firewall rules allow 80 port :
Source Ranges: 0.0.0.0/0
Allowed Protocols or Ports: tcp:80
GCE health checks are failing for that instances while curl correctly returns a 200 OK response.
On those instances i have installed upstart instead of default System V init.
Could it be related?! Are there any special services that should be running on the instance to get health check working?!
Here is the instance tcpdump output showing there is no ack flags coming from load balancer (169.254.169.254 as described here) :
19:13:20.513882 IP 169.254.169.254.49291 > 130.211.125.185.80: Flags [S], seq 503850, win 8096, options [mss 1024], length 0
19:13:23.016788 IP 169.254.169.254.49291 > 130.211.125.185.80: Flags [S], seq 503850, win 8096, options [mss 1024], length 0
19:13:26.017750 IP 169.254.169.254.49291 > 130.211.125.185.80: Flags [S], seq 503850, win 8096, options [mss 1024], length 0
Since you changed the init daemon, it is very probable your problem is related to google-address-manager script not running. You can try to start the process manually or adding the ip address of the load balancer as described in Google Compute Engine health checks failing.