Seemingly random WebRTC ICE connection failure when connecting to same machine - webrtc

I have an app which creates two instances of RTCPeerConnection (within the same JS context) which attempt to connect to each other. While I'm developing, I reload the page often, maybe several times per minute. About 10% of the time, WebRTC will fail to progress to the 'iceConnectionState == "connected"' stage. This failure occurs even when I pass no STUN/TURN servers to createPeer().
I primarily use Chrome (OSX, currently version 81.0.4044.138). I have never been able to reproduce this on Firefox.
I have captured nearly-identical dumps of the success and failure cases using chrome://webrtc-internals.
I have spent many hours on this and haven't found any clue as to why this might be failing. Is it just some kind of temporary local network outage? Is there anything I can do within the code to have a 100% local connection rate?

I have seen similar flakiness because of mDNS candidates. Try disabling #enable-webrtc-hide-local-ips-with-mdns in chrome://flags and see if that helps!
After that I would grab a tcpdump and confirm you see ICE traffic flowing each way.

Related

When is an endpoint bundle-aware and when not?

From
Link: www.w3.org/TR/webrtc/#dom-rtcbundlepolicy
Content: 4.2.5 RTCBundlePolicy Enum
"If the remote endpoint is bundle-aware, all media tracks and data channels are bundled onto the same transport."
When is an endpoint bundle-aware and when not? And what does bundle-aware means?
To establish a p2p connection, WebRTC will allocate and do STUN network checks on up to 3 ports (multiplied by ways they can be reached) on either end, and as they're discovered (which takes time), ask JS to trickle-exchange info on each of these "ICE candidates" across a signaling channel, once for video, once for audio, and once for data (if you have it).
WebRTC does this mostly to support connecting to non-browser legacy devices, because all modern browsers support BUNDLE, which is when all but one candidate end up being thrown away, and all media gets bundled over that single port.
WebRTC even has a "max-compat" mode that goes even further, allocating a port for each piece of media, just in case the other endpoint is really old.
WebRTC doesn't know the other endpoint is a browser until it receives an "answer" from it, but if you know, you can specify "max-bundle" and save a couple of milliseconds.

Debugging Performance of Spring Cloud Gateway with Netty

I have Spring Cloud Gateway (Greenwich) running with Netty. This application receives request and then sends request downstream applications depending on the route configuration.
Randomly few request take lot of time(> 70s). Even though the downstream server responded back within 5 sec, Netty threads (reactor-http-epoll-*) are not picking up the response. I have enabled debug logs to see what those threads are doing. From preliminary analysis, it look like those threads are processing something else and are always in runnable state. When this happens the traffic to server is not unusual and it's same as before.
My question here is:
Why response was not processed by reactor threads while response was received(according to the logging of the downstream app, it sent the response. However, spring-cloud app received response way too late in the logs). Is it possible that all the threads are busy doing other things.
Is there any run book on how such issues should be investigated?
Some-places in logs I do see high number of inactive connections in logs but not sure if that is impacting anything. (Channel cleaned, now 56 active connections and 1400 inactive connections)
Any general guidance on how to proceed with investigation to understand why random slowness is happening in application will really help. Thanks for the help in advance.
Okay, so I ended up doing below things and after lot of investigation it started working fine for me.
Enable logging. Look at how many connections are getting created. In my case, lot of new connections were getting created and and they were not getting re-used.
io.netty.leakDetectionLevel=paranoid
logging.level.reactor.netty=DEBUG
logging.level.reactor.netty.channel.FluxReceive=DEBUG
spring.cloud.gateway.httpclient.wiretap=true
spring.cloud.gateway.httpserver.wiretap=true
Make sure there is no blocking code running on reactor-http-epoll-* threads.
I upgraded Spring Cloud dependencies from Greenwhich train to latest version of Hoxton train.

libevent2 http not detecting client network broken

I am not pasting the source code here, if any one whats to reproduce the problem, download the code from this github project:
It is a Comet server, the server use libevent-2.0.21-stable http.
To reproduce the problem:
start the icomet-server from machine S
run curl http://ip:8100/stream from another machine C, the server S will show message that C has connected
if I press CTRL + C to terminate curl, the server knows that C is disconnected as expected.
if I pull out the network line from machine C(a physical network broken), the server will NOT know that C is disconnected, which it SHOULD know!
I will askany one who is familiar with libevent, how to make libevent 2 to detecting client network broken?
When the physical network link is interrupted, you won't always get a packet back to tell you that you lost the connection. If you want to find out about a disconnection, send a ping (a request that just asks for a no-op reply) periodically, and if the reply doesn't come within some reasonable timeout, assume something went wrong. Or just disconnect the client if they're idle for long enough.
When you did that Ctrl-C, the OS that the other end was running on was still working, and so it was able to generate a TCP RST packet to inform your server that the client has gone away. But when you break that physical link, the client is no longer capable of sending that cry for help. Something else has to infer that the client went away.
Now, if you try to send the client some data, the server kernel will notice (sooner or later) that the client is not replying to its messages. At this point you'll see the disconnect - but it may take several minutes for this to happen. If you're not sending any data, then it'll stay open until either you disconnect it, or the kernel attempts a TCP keepalive (a low-level way of the kernel asking "Hey, I haven't heard from you for a while, are you still there?") potentially hours later (or it might not even do a keepalive at all for you, depending on how things are configured).

RTI DDS subscriber not getting data from publisher

Short story: My DDS subscriber cannot see data from my DDS publisher. What am I missing?
Long story:
QNX 6.4.1 VM A -- Broken Publisher. IP ends with .113
QNX 6.4.1 VM B -- Working Publisher. IP ends with .114
Windows 7 -- Subscriber/Main Dev box. IP ends with .203
RTI DDS 5.0 -- Middleware version
I have a QNX VM (hosted on the network, not on my machine) that is publishing some data via RTI DDS. The data never shows up in my Windows 7 subscriber application.
Interestingly enough, I can put the same code on VM B, and the subscriber gets data. Thinking this must be a Windows 7 firewall issue, I swapped VM A's IP address with VM B, but this did not help.
Using Wireshark, I can see some heartbeat traffic from VM A, but no data. From VM B, I see the heartbeat traffic and the data. Below is a sanitized Wireshark snippet.
NDDS_DISCOVERY_PEERS is set to include both multicast and the explicit IP address of the other side of each conversation. The QOS profiles are the same, and the RTI Analyzer indicates the Match Analysis was successful (all green).
VM A:
NDDS_DISCOVERY_PEERS=udpv4://239.255.0.1,udpv4://127.0.0.1,udpv4://BLAH.203
VM B:
NDDS_DISCOVERY_PEERS=udpv4://239.255.0.1,udpv4://127.0.0.1,udpv4://BLAH.203
Windows 7 box:
NDDS_DISCOVERY_PEERS=udpv4://239.255.0.1,udpv4://127.0.0.1,udpv4://BLAH.113,udpv4://BLAH.114
When I include them in the NDDS_DISCOVERY_PEERS line, other folks on the network can see DDS traffic from VM A with DDS SPY on their Windows 7 box. My Windows 7 box can not.
Windows 7 event log does not appear to show any firewall or WFP stopping the data packets.
RTI DDS Spy run from my Windows 7 machine shows that VM A (0A061071) writers are visible on the network, but no data is being received. It also shows that the readers in my subscriber on my Windows 7 machine are visible, though it shows up at an odd IP address.
Bonus question (out of curiosity only, NOT the primary question): why does traffic on my local machine show up in DDS SPY as 192.168.11.1 instead of my machine's IP or even 127.0.0.1?
Main Question: What am I missing?
Update:
route print on my Windows 7 box appears to show that I have joined a multicast group with VM A.
netsh interface ip show joins seemed to concur.
Investigation Update:
I rebooted the VM to no effect.
I rebooted the Windows box to no effect.
I removed the multicast from the NDDS_DISCOVERY_PEERS environment
variables on both sides to no effect.
The Windows 7 box has three network interfaces (plus loopback): 1
LAN connection and 2 (unrelated) VM adapters. We are working with
the LAN connection. The QNX VM has one network interface (plus
loopback). Note that the working VM and the broken VM use a
different ethernet driver than each other, as they are slightly
different flavors of QNX 6.4.1. The broken one has wm0 as the
interface, and the working one has en0 as the interface. I don't believe this is the issue, but it is a difference.
I ran DDS SPY on the "broken" QNX VM while it was playing back, and
I got DDS data. I don't have a good method to sniff the network
between where the VM is hosted and my Windows 7 machine to see if it
makes it out of the interface, but looking at the transmitted packet
count out of the ethernet interface on the QNX VM indicates that it
is definitely transmitting something, and the Wireshark captures on the Windows 7 machine itself show that at least some traffic is making it through.
Other folks on the LAN here can see the DDS traffic from the
"broken" VM, which leads me to believe it is a Windows setup issue,
rather than a broken VM--I just can't see what it could be. I've
re-checked the firewall, to no avail. I would have thought that if it were a firewall issue, the problem would have gone away when I swapped IP addresses between VM A and VM B. In any case, the Windows 7 firewall is currently off, to no avail.
Below are several screens of Wireshark output. I skipped a bunch between the third and the fourth, as after the fourth, the traffic tended to look like the bottom of the fourth until the end.
(Skipped a bunch here)
(Pretty much continues on like the last 11 lines above)
What else should I try?
Update:
To answer Rose's question below, using rtiddsping -publisher on the bad VM and rtiddsping -subscriber works appropriately.
I wonder if this issue is caused by the weird IP address. The IP address it happens to publish and somehow latch on to is a local VM ethernet adapter (separate from VM A). See the screenshot below.
The address I would like it to attach to is 10.6.6.203. Any way I can specify that?
More than a year later this happened to me again with a different virtual machine. I had it working yesterday, so I was very suspicious. I scoured all my code changes for the past 24 hours for issues, but didn't find any. Then I decided to see if IT had pushed any patches to my computer.
Guess what? The Windows Firewall had been aggressively updated since the day before. Rules missing or changed, etc. The log said packets were being dropped. I opened the firewall filters up a bit, and suddenly, everything worked again. I hesitate to close this issue, as I am not 100% this was exactly the same as last year, but it feels very similar. I suspect that last year the settings in the firewall were not logging the packet drops.
Long and short of it: if DDS suddenly stops working, check your firewall settings.
A couple of things to try:
Try running rtiddsping -publisher on the broken VM and rtiddsping -subscriber on Windows. This has two advantages:
The data type is small and well-known, so if there's some problem with the data being fragmented due to the different Ethernet drivers, it will not happen with rtiddsping, and may help track down the problem.
Rtiddsping prints out when the publisher and subscriber discover each other, so you will be able to confirm that discovery is completing correctly on both sides. I am guessing discovery is working, because Analyzer is showing both applications, but it is good to confirm.
If you see the same problem with rtiddsping that you see with your application, increase the verbosity to rtiddsping -verbosity 3, and then 5. At the highest verbosity level, this will print (a lot of) additional output, which may give us a hint about what is happening.
To answer your bonus question about spy: The reason why spy is showing that IP address is because that is one of the addresses that is being announced as part of discovery. During discovery, a DomainParticipant can announce up to four IP addresses that can be used to reach it. Spy will choose one of those to display, but it may not be the actual address that is being used to communicate with the application. If your machine does not have any interface with the 192.168.11.1 IP address, this could indicate a larger problem. (This may be normal, though - as long as the correct IP is one of the four that are announced.)
Looking through the packet trace images, there is nothing that is obviously the problem. A few things I notice:
There seems to be a normal pattern of heartbeats/ACKNACKs in the final packet trace image. This indicates that there is some bidirectional communication between the two applications.
It is difficult to tell from these images whether the DATA being sent from .113 to .203 consists of participant-to-participant messages, or real discovery messages - except for two packets: packet #805, and packet #816 (fragments 811-815) look like discovery announcements that are being sent to .203. This indicates that you have at least four entities (DataWriters or DataReaders) in your application on .113.
So, discovery data is being sent by the application on .113. It is being received and reassembled by WireShark, but that doesn't always mean it was received correctly by the application.
Packet #816 has a heartbeat on the end of it. It is possible that packet #818 or #819 might be the ACKNACK that is responding to that heartbeat, but I can't be sure from the image. The next step is to look at those ACKNACKs from .203 to .113 to see if .203 thinks it has received all the discovery data. Here is an example of a HB/ACKNACK pair where a discovery DataReader has received all data:
Submessage: HEARTBEAT
...
firstSeqNumber: 1
lastSeqNumber: 1
The heartbeat sequence number is 1, which indicates it has only sent an announcement about a single DataReader.
Submessage: ACKNACK
...
readerSNState: 2/0:
bitmapBase: 2
numBits: 0
The readerSNState is 2/0, meaning it has received everything before sequence number two, and there is nothing missing. If there is something other than a 0 in the bitmap, it indicates the DataReader did not receive some data.
If you can confirm that the application is receiving all the discovery data correctly, it will be helpful if you can use a WireShark filter to show only user data, since the images aren't highlighting discovery vs. user data.
WireShark filter for just rtps2 user data:
rtps2 && (rtps2.traffic_nature == 3 || rtps2.traffic_nature == 1)
We had a similar issue with this. Here is the environment in a very summarized way:
A publisher
A working subscriber (laptop)
A non-working subscriber (desktop)
Both subscribers held exactly the same software (the desktop was a clone from the laptop, through Clonezilla), but rtiddsspy was blind from the desktop point of view; however, the opposite way worked well: the publisher machine's rtiddsspy saw the desktop. Laptop and publisher machines' always worked well. Laptop and desktop too (they saw each other's subscriptions)
The workaround for this (based on https://community.rti.com/content/forum-topic/discovery-issues) was to increase the MTU on the desktop NIC. Don't ask me why, but it worked.
EDIT: At the beginning, the MTU in the publisher was set to a higher value than the subscriber. So, we changed the MTU in the subscriber to match the publisher's.

Maximum number of concurrent NSURLConnections to the same host?

I'm running in to an issue in an OS X app that creates multiple, persistent connections to the same host using NSURLConnection. I create a separate connection for different rooms, and it stays connected the entire time the room is open to consume a streaming API. When opening many rooms, it stops working correctly.
I created a separate sample app that creates 10 connections, and it seems to only allow 6 connections to work, and the others are queued. Does anyone know if there is a way to override this limit? I can't find it documented anywhere. The only workaround I've found is it seems to be per host name, so testing with "localhost" and "127.0.0.1" allows 6 connections per host. I uploaded a sample project with client and server here - http://cl.ly/1x3K0D1F072V3U2T0C0I.
I filed a Radar for something that seems like the same issue but on iOS. I found that you can't have more than 5 connections open at once. The connections don't have to be pointing to the same domain. Anything after that would be queued. So if you have 5 connections open to an extremely slow endpoint, any other connections will not go through.
Radar: http://openradar.appspot.com/radar?id=2542401
Apple's reply:
This is the effect of our NSURLConnection connection cache. It is expected. We expect to address this type of configuration with new API.
I asked if they could give me anymore information (does it vary? does the type of connection affect it?) and they said:
Unfortunately, we can't give details about the connection limit behavior.
User agents in general (Chrome, Firefox, Safari) use six simultaneous TCP connections per hostname, with potential one-offs.
You could break this limitation by using CFNetwork API (CFHTTPMessage).
Here is the CFNetwork Programming Guide.
https://developer.apple.com/library/mac/documentation/Networking/Conceptual/CFNetwork/Introduction/Introduction.html#//apple_ref/doc/uid/TP30001132
BTW, if you decide to use CFNetwork, you'll need to work around the proxy and authenticate.
Wish this could helped!