We have an issue on the acceptance environment during a handshake process. External service tries to send some data and during handshake sometimes we reset the connection after timeout around 2 minutes. In the picture below you can see communication between two services our server IP ends with 11 and external service IP ends with 5.
The strange thing is that is more than 50% it's working and when it's happened (they will try to send us data every hour and we reject each of them). In between, if we send data to them the next try from them will be successful (picture below). In this case, they use the server IP ends with .6.
Does someone have a clue what can be a problem here? We have tried to find something in our logs but nothing wasn't logged. Some help regarding additional logging will be appreciated (we tried with https://learn.microsoft.com/en-us/dotnet/framework/network-programming/how-to-configure-network-tracing and https://learn.microsoft.com/en-us/dotnet/framework/wcf/diagnostics/tracing/configuring-tracing?redirectedfrom=MSDN). Our backed in written in C# WCF. The additional fact, when we try to send data to them we never have an issue, it's always working.
Currently, we plan to upgrade our product to use MQ(RabbitMQ or ActiveMQ) for message transfer between server and client. And now we are using a network lib(evpp) for doing so.
Because I don't use MQ before, so excpet for a lot of new features of MQ, I can't figure out the essential difference between them, and don't know exactly when and where should we use MQ or just use network library is fine.
And the purpose that we want to use MQ is that we want to solve the unreliability of communication, such as message loss or other problems caused by unstable network environment.
Hope there is someone familiar with both of them could release my confusion. Thanks for advance.
Message queuing systems (MQ, Qpid, RabbitMQ, Kafka, etc.) are higher-layer systems purpose-built for handling messages reliably and flexibly.
Network programming libraries/frameworks (ACE, asio, etc.) are helpful tools for building message queueing (and many other types of) systems.
Note that in the case of ACE, which encompasses much more than just networking, you can use a message queuing system like the above and drive it with a program that also uses ACE's classes for thread management, OS abstraction, event handling, etc.
Like in any network-programming, when a client sends a request to the server, the server responds with a response. But for this to happen the following conditions must be met
The server must be UP and running
The client should be able to make some sort of connection between them
The connection should not break while the server is sending the response to the client or vice-versa
But in case of a message queue, whatever the server wants to tell the client, the message is placed in a message-queue i.e., separate server/instance. The client listens to the message-queue and processes the message. On a positive acknowledgement from the client, the message is removed from the message queue. Obviously a connection has to made by the server to push a message to the message-queue instance. Even if the client is down, the message stays in the queue.
I have an app which creates two instances of RTCPeerConnection (within the same JS context) which attempt to connect to each other. While I'm developing, I reload the page often, maybe several times per minute. About 10% of the time, WebRTC will fail to progress to the 'iceConnectionState == "connected"' stage. This failure occurs even when I pass no STUN/TURN servers to createPeer().
I primarily use Chrome (OSX, currently version 81.0.4044.138). I have never been able to reproduce this on Firefox.
I have captured nearly-identical dumps of the success and failure cases using chrome://webrtc-internals.
I have spent many hours on this and haven't found any clue as to why this might be failing. Is it just some kind of temporary local network outage? Is there anything I can do within the code to have a 100% local connection rate?
I have seen similar flakiness because of mDNS candidates. Try disabling #enable-webrtc-hide-local-ips-with-mdns in chrome://flags and see if that helps!
After that I would grab a tcpdump and confirm you see ICE traffic flowing each way.
I am not pasting the source code here, if any one whats to reproduce the problem, download the code from this github project:
It is a Comet server, the server use libevent-2.0.21-stable http.
To reproduce the problem:
start the icomet-server from machine S
run curl http://ip:8100/stream from another machine C, the server S will show message that C has connected
if I press CTRL + C to terminate curl, the server knows that C is disconnected as expected.
if I pull out the network line from machine C(a physical network broken), the server will NOT know that C is disconnected, which it SHOULD know!
I will askany one who is familiar with libevent, how to make libevent 2 to detecting client network broken?
When the physical network link is interrupted, you won't always get a packet back to tell you that you lost the connection. If you want to find out about a disconnection, send a ping (a request that just asks for a no-op reply) periodically, and if the reply doesn't come within some reasonable timeout, assume something went wrong. Or just disconnect the client if they're idle for long enough.
When you did that Ctrl-C, the OS that the other end was running on was still working, and so it was able to generate a TCP RST packet to inform your server that the client has gone away. But when you break that physical link, the client is no longer capable of sending that cry for help. Something else has to infer that the client went away.
Now, if you try to send the client some data, the server kernel will notice (sooner or later) that the client is not replying to its messages. At this point you'll see the disconnect - but it may take several minutes for this to happen. If you're not sending any data, then it'll stay open until either you disconnect it, or the kernel attempts a TCP keepalive (a low-level way of the kernel asking "Hey, I haven't heard from you for a while, are you still there?") potentially hours later (or it might not even do a keepalive at all for you, depending on how things are configured).
Short story: My DDS subscriber cannot see data from my DDS publisher. What am I missing?
Long story:
QNX 6.4.1 VM A -- Broken Publisher. IP ends with .113
QNX 6.4.1 VM B -- Working Publisher. IP ends with .114
Windows 7 -- Subscriber/Main Dev box. IP ends with .203
RTI DDS 5.0 -- Middleware version
I have a QNX VM (hosted on the network, not on my machine) that is publishing some data via RTI DDS. The data never shows up in my Windows 7 subscriber application.
Interestingly enough, I can put the same code on VM B, and the subscriber gets data. Thinking this must be a Windows 7 firewall issue, I swapped VM A's IP address with VM B, but this did not help.
Using Wireshark, I can see some heartbeat traffic from VM A, but no data. From VM B, I see the heartbeat traffic and the data. Below is a sanitized Wireshark snippet.
NDDS_DISCOVERY_PEERS is set to include both multicast and the explicit IP address of the other side of each conversation. The QOS profiles are the same, and the RTI Analyzer indicates the Match Analysis was successful (all green).
VM A:
NDDS_DISCOVERY_PEERS=udpv4://239.255.0.1,udpv4://127.0.0.1,udpv4://BLAH.203
VM B:
NDDS_DISCOVERY_PEERS=udpv4://239.255.0.1,udpv4://127.0.0.1,udpv4://BLAH.203
Windows 7 box:
NDDS_DISCOVERY_PEERS=udpv4://239.255.0.1,udpv4://127.0.0.1,udpv4://BLAH.113,udpv4://BLAH.114
When I include them in the NDDS_DISCOVERY_PEERS line, other folks on the network can see DDS traffic from VM A with DDS SPY on their Windows 7 box. My Windows 7 box can not.
Windows 7 event log does not appear to show any firewall or WFP stopping the data packets.
RTI DDS Spy run from my Windows 7 machine shows that VM A (0A061071) writers are visible on the network, but no data is being received. It also shows that the readers in my subscriber on my Windows 7 machine are visible, though it shows up at an odd IP address.
Bonus question (out of curiosity only, NOT the primary question): why does traffic on my local machine show up in DDS SPY as 192.168.11.1 instead of my machine's IP or even 127.0.0.1?
Main Question: What am I missing?
Update:
route print on my Windows 7 box appears to show that I have joined a multicast group with VM A.
netsh interface ip show joins seemed to concur.
Investigation Update:
I rebooted the VM to no effect.
I rebooted the Windows box to no effect.
I removed the multicast from the NDDS_DISCOVERY_PEERS environment
variables on both sides to no effect.
The Windows 7 box has three network interfaces (plus loopback): 1
LAN connection and 2 (unrelated) VM adapters. We are working with
the LAN connection. The QNX VM has one network interface (plus
loopback). Note that the working VM and the broken VM use a
different ethernet driver than each other, as they are slightly
different flavors of QNX 6.4.1. The broken one has wm0 as the
interface, and the working one has en0 as the interface. I don't believe this is the issue, but it is a difference.
I ran DDS SPY on the "broken" QNX VM while it was playing back, and
I got DDS data. I don't have a good method to sniff the network
between where the VM is hosted and my Windows 7 machine to see if it
makes it out of the interface, but looking at the transmitted packet
count out of the ethernet interface on the QNX VM indicates that it
is definitely transmitting something, and the Wireshark captures on the Windows 7 machine itself show that at least some traffic is making it through.
Other folks on the LAN here can see the DDS traffic from the
"broken" VM, which leads me to believe it is a Windows setup issue,
rather than a broken VM--I just can't see what it could be. I've
re-checked the firewall, to no avail. I would have thought that if it were a firewall issue, the problem would have gone away when I swapped IP addresses between VM A and VM B. In any case, the Windows 7 firewall is currently off, to no avail.
Below are several screens of Wireshark output. I skipped a bunch between the third and the fourth, as after the fourth, the traffic tended to look like the bottom of the fourth until the end.
(Skipped a bunch here)
(Pretty much continues on like the last 11 lines above)
What else should I try?
Update:
To answer Rose's question below, using rtiddsping -publisher on the bad VM and rtiddsping -subscriber works appropriately.
I wonder if this issue is caused by the weird IP address. The IP address it happens to publish and somehow latch on to is a local VM ethernet adapter (separate from VM A). See the screenshot below.
The address I would like it to attach to is 10.6.6.203. Any way I can specify that?
More than a year later this happened to me again with a different virtual machine. I had it working yesterday, so I was very suspicious. I scoured all my code changes for the past 24 hours for issues, but didn't find any. Then I decided to see if IT had pushed any patches to my computer.
Guess what? The Windows Firewall had been aggressively updated since the day before. Rules missing or changed, etc. The log said packets were being dropped. I opened the firewall filters up a bit, and suddenly, everything worked again. I hesitate to close this issue, as I am not 100% this was exactly the same as last year, but it feels very similar. I suspect that last year the settings in the firewall were not logging the packet drops.
Long and short of it: if DDS suddenly stops working, check your firewall settings.
A couple of things to try:
Try running rtiddsping -publisher on the broken VM and rtiddsping -subscriber on Windows. This has two advantages:
The data type is small and well-known, so if there's some problem with the data being fragmented due to the different Ethernet drivers, it will not happen with rtiddsping, and may help track down the problem.
Rtiddsping prints out when the publisher and subscriber discover each other, so you will be able to confirm that discovery is completing correctly on both sides. I am guessing discovery is working, because Analyzer is showing both applications, but it is good to confirm.
If you see the same problem with rtiddsping that you see with your application, increase the verbosity to rtiddsping -verbosity 3, and then 5. At the highest verbosity level, this will print (a lot of) additional output, which may give us a hint about what is happening.
To answer your bonus question about spy: The reason why spy is showing that IP address is because that is one of the addresses that is being announced as part of discovery. During discovery, a DomainParticipant can announce up to four IP addresses that can be used to reach it. Spy will choose one of those to display, but it may not be the actual address that is being used to communicate with the application. If your machine does not have any interface with the 192.168.11.1 IP address, this could indicate a larger problem. (This may be normal, though - as long as the correct IP is one of the four that are announced.)
Looking through the packet trace images, there is nothing that is obviously the problem. A few things I notice:
There seems to be a normal pattern of heartbeats/ACKNACKs in the final packet trace image. This indicates that there is some bidirectional communication between the two applications.
It is difficult to tell from these images whether the DATA being sent from .113 to .203 consists of participant-to-participant messages, or real discovery messages - except for two packets: packet #805, and packet #816 (fragments 811-815) look like discovery announcements that are being sent to .203. This indicates that you have at least four entities (DataWriters or DataReaders) in your application on .113.
So, discovery data is being sent by the application on .113. It is being received and reassembled by WireShark, but that doesn't always mean it was received correctly by the application.
Packet #816 has a heartbeat on the end of it. It is possible that packet #818 or #819 might be the ACKNACK that is responding to that heartbeat, but I can't be sure from the image. The next step is to look at those ACKNACKs from .203 to .113 to see if .203 thinks it has received all the discovery data. Here is an example of a HB/ACKNACK pair where a discovery DataReader has received all data:
Submessage: HEARTBEAT
...
firstSeqNumber: 1
lastSeqNumber: 1
The heartbeat sequence number is 1, which indicates it has only sent an announcement about a single DataReader.
Submessage: ACKNACK
...
readerSNState: 2/0:
bitmapBase: 2
numBits: 0
The readerSNState is 2/0, meaning it has received everything before sequence number two, and there is nothing missing. If there is something other than a 0 in the bitmap, it indicates the DataReader did not receive some data.
If you can confirm that the application is receiving all the discovery data correctly, it will be helpful if you can use a WireShark filter to show only user data, since the images aren't highlighting discovery vs. user data.
WireShark filter for just rtps2 user data:
rtps2 && (rtps2.traffic_nature == 3 || rtps2.traffic_nature == 1)
We had a similar issue with this. Here is the environment in a very summarized way:
A publisher
A working subscriber (laptop)
A non-working subscriber (desktop)
Both subscribers held exactly the same software (the desktop was a clone from the laptop, through Clonezilla), but rtiddsspy was blind from the desktop point of view; however, the opposite way worked well: the publisher machine's rtiddsspy saw the desktop. Laptop and publisher machines' always worked well. Laptop and desktop too (they saw each other's subscriptions)
The workaround for this (based on https://community.rti.com/content/forum-topic/discovery-issues) was to increase the MTU on the desktop NIC. Don't ask me why, but it worked.
EDIT: At the beginning, the MTU in the publisher was set to a higher value than the subscriber. So, we changed the MTU in the subscriber to match the publisher's.