RTI DDS reader fails to identify topic - data-distribution-service

and subsequently obviously to read/take the topic. The problematic topic is published under BuiltinQosLibExp::Generic.KeepLastReliable.TransientLocal policy and the message is fired only once at the startup of the publisher application. Few things to consider:
Im not using this policy and taking the default policy configuration in code
dds::sub::qos::DataReaderQos tempQos = inSubScriber->default_datareader_qos();
m_EntitySpecReader = new dds::sub::DataReader<XXX_ICD::Entity_Specification_DT>(*inSubScriber, topicLocal, tempQos, m_EntitySpecListener);
from subscriber
The problem is not Firewall or some connection issue, as I know to receive other cyclic topics without any problem.
It is frustrating that I see this topic if Im trying to monitor either with rtiddsspy or RTI administration console.
Last bullet and most frustrating, when I actually felt stuck, is that I have a listener configured with all available callbacks and I thought to receive if not the data at least some callback clue regarding the possible mismatch, lost, something .... but it keeps silence no matter what Im trying to do :)
Will be more than happy to understand if somebody has an answer or potential direction to check :)

You are using the default QoS for your DataReader. This means that its Durability policy is VOLATILE. Even though the DataWriter is configured as TRANSIENT_LOCAL, it will not deliver "old" samples to your DataReader since it is not requesting those due to its volatile durability. In this context, "old" samples are samples that were written before the DataWriter discovered the DataReader.
Things should start working as expected when you configure your DataReader with a Durability policy as TRANSIENT_LOCAL as well.
If you instrumented a Listener on the DataReader, it should show you that a match has taken place though, or that it has failed. If you implemented both the on_subscription_matched and on_requested_incompatible_qos callbacks, then at least one of those two should fire if you have both applications started and if they are able to discover each other.
Since you discovered that the problem was a type mismatch, I wanted to show how the AdminConsole tool could have helped you finding that. Reproducing your issue, this is what it showed:

Related

Handling Stale Data in Prometheus Gauges

Context: I want to build my own exporter for RabbitMQ. For that I've set an HTTP server that queries the management API, parses the response and builds the appropriate response with Prometheus format
I'm measuring number of messages in a queue to get alerted when a queue has too many messages in it. For that, i've set up the following gauge:
rabbitmq_queue_messages{queue_name="Q1"}
My question is: what happens if a queue gets deleted? for example:
at T1 the exporter returns rabbitmq_queue_messages{queue_name="Q1"} 5
at T2 the queue is being deleted for some reason
at T3 my exporter is being asked for the metrics again.
As I understand, at T3, even though the queue doesn't exist anymore, it will return the same rabbitmq_queue_messages{queue_name="Q1"} 5 response since this is how gauges work on Prometheus. For me it seems odd because at T3, Q1 doesn't exist anymore so I'd expect to stop receiving data points for this queue, instead of receiving stale data.
The workaround I found for this is to build a new prometheus registry on each request to the exporter to start with a clean sheet, but it seems a bit hacky and I really don't feel comfortable working this way.
So, how can I avoid having stale gauge data in a more Prometheus idiomatic way?
If this is a Java exporter, written using client_java, you can simply clear your Gauge (in myGauge.clear()) instead of building a whole new Registry.
Or, if that is too heavy-weight and you have a way to get notified when a queue is deleted, just call Gauge.remove(queueName) when you get the notification.
Edit: Never actually seen any Ruby code before, but it would seem that Registry.unregister("rabbitmq_queue_messages") might be the less heavy-handed way of clearing just the one metric (with all its label combinations, i.e. in your case for all queues). I don't see anything similar to the Java client's Gauge.remove() that would allow removing just one sample/label combination, but I might be missing something.

Continuous device and connection issues with routed Tokbox session

We’ve been using the Tokbox platform for several months now with a Javascript web-client as well as an Android phone client, where sessions and connections are managed by a Python server. While integration and bring-up went well on both ends (client and server), we continue to encounter problems with the in-session audio and video experience.
Sessions are always routed and always between two participants only, with much use of a collaborative editor.
The in-session experience is like a coin toss: we never know how it’s going to go, and that’s becoming a business threat.
Web-Client: A/V Resources
The most common problem is the acquisition of audio and/or video: at the beginning of a session, one or the other participants may have problems hearing or seeing the other. Allocating a new connection to establish new streams does not fix that, nor does restarting the browser.
Question: What’s the recommended way to detect possible resource locks (e.g. does another application hog the camera/microphone)?
Web-Client: Network
Bandwidth and packet loss are a challenge, for example this inspector graph:
Audio and video of both participants is all over the place, and while we can not control the network connections the web-client should be able to reliably give useful information.
Question: Other than continuous connection monitoring with getStats() and maybe the experimental navigator.connection property, how can the web-client monitor network connectivity?
Pre-Call Test
We recommend to customers to run a pre-call test and have implemented it on our site as well. However, results of that test often times do not reflect the in-session connectivity. Worse, a pre-call test may detect a low (no video) bandwidth while Skype works just fine.
Question: How can that be?
I'm a member of the TokBox development team. I remember you reported an issue with the Python SDK, thanks for that!
Web-Client: A/V Resources
Most acquisition issues are detected by the JS SDK and if they aren't then we'd really like to hear about it! Please report reproduction steps or affected session IDs to TokBox support (referencing this StackOverflow question): https://support.tokbox.com/hc/en-us/requests/new
Most acquisition errors appear as OT_HARDWARE_UNAVAILABLE or OT_MEDIA_ERR_ABORTED errors. Are you detecting and surfacing these errors to your users? There is also the special OT_CHROME_MICROPHONE_ACQUISITION_ERROR error which is due to a known issue with Chrome that has been mostly fixed since Chrome 63 (see https://bugs.chromium.org/p/webrtc/issues/detail?id=4799).
Web-Client: Network
Unfortunately this is one of the more difficult issues to troubleshoot. Yes, Subscriber#getStats() is the best tool we have at our disposal and is a wrapper around the native RTCPeerConnection#getStats() function. Unfortunately we don't have much control over the values returned by the native function and if you think our SDK is returning incorrect values when compared with values from RTCPeerConnection#getStats() then please let us know!
It would be worthwhile confirming whether the issue is reproducible in all browsers or only a particular one. If you have detailed data regarding the inaccuracy of the native RTCPeerConnection#getStats() function then we could work together to report it to the browser vendor(s).
Fortunately we have just released the new Publisher#getStats() function which lets you get the publisher side of the stats. This should help you narrow down the cause of a connectivity issue to either a publisher or subscriber side. Please let us know if this helps with tracking down these issues.
Pre-Call Test
Again, these tests are based on Subscriber#getStats() which in turn are based on RTCPeerConnection#getStats(), the accuracy of which is out of our hands, but we'd love any reproduction steps to either fix a bug in our client SDK or report a bug to the browser vendors.
Just to confirm though, when you say you've implemented a pre-call test in your site, did you use the official JavaScript network test module? https://github.com/opentok/opentok-network-test-js This is actually what's used by the TokBox pre-call test.
#Aiham, thanks for responding, I've been looking at the the new Publisher#getStats() you linked to (thank you!), so we too can give our users some way of visibly seeing the network conditions that might be affected the quality of their call (and who's causing it). However, it seems as though bytes / packets sent goes up sharply as the number of subscribers increases, even though we're in a routed session.
Am I wrong to expect the Publisher#getStats() statistics to stay fairly stable regardless of the number of subscribers then receiving that stream in a routed session? I expected the nature of a routed call to mean it's sent once to the OpenTok Media Servers, and the statistics would end there.

Ice connection state , Completed vs Connected

Can someone please clarify the difference between iceConnectionstate:completed vs iceConnectionstate:connected.
When I connect to browsers with webrtc I am able to exchange data using datachannel but for some reason the the iceConnectionstate on browser that made the offer reaming completed wheres the browser that accepted the offers changes to connected.
Any idea if this is normal?
In short:
connected: Found a working candidate pair, but still performing connectivity checks to find a better one.
completed: Found a working candidate pair and done performing connectivity checks.
For most purposes, you can probably treat the connected/completed states as the same thing.
Note that, as mentioned by Ajay, there are some notable difference between how the standard defines the states and how they're implemented in Chrome. The main ones that come to mind:
There's no "end-of-candidates" signaling, so none of those parts of the candidate state definitions are implemented. This means if a remote candidate arrives late, it's possible to go from "completed" back to "connected" without an ICE restart. Though I assume this is rare in practice.
The ICE state is actually a combination ICE+DTLS state (see: https://bugs.chromium.org/p/webrtc/issues/detail?id=6145). This is because it was implemented before there was such thing as "RTCPeerConnectionState". This can lead to confusion if there's actually a DTLS-level issue, since the only way to really notice is to look in a native Chrome log.
We definitely plan on fixing all the discrepancies. But for a while we held off on it because the standard was still in flux. And right now our priority is more on implementing unified plan SDP and the RtpSender/RtpReceiver APIs.
ICE Connection state transition is a bit tricky, with below flow diagram you can get clear idea on possible transitions.
In simple words:
new/checking: Not at connected
connected/completed: Media path is available
disconnected/failed: Media path is not available (Whatever data you are sending on data channel won't reach other end)
Read full summary here
Still WebRTC team is working hard to make it stable & spec compliant.
Current chrome behavior is confusing so i filed a bug, you can star it to get notified.

blocked requests in io_service

I have implemented client server program using boost::asio library.
In my implementation there are times when io_service.run() blocks indefinitely. In case I pass another request to io_service, the blocked call begins to execute normally.
Is there any way to see what are the pending requests inside the io_service queue ?
I have not used work object to block the run call!
There are no official ways to query into the io_service to find all pending request. However, there are a few techniques to debug the problem:
Boost 1.47 introduced handler tracking. Simply define BOOST_ASIO_ENABLE_HANDLER_TRACKING and Boost.Asio will write debug output, including timestamps, an identifier, and the operation type, to the standard error stream.
Attach a debugger dig through the layers to find and examine operation queues. This answer covers both understanding handler tracking and using a debugger to examine an operation queue for the epoll_reactor.
Finally, if you believe it is a bug, then it may be worth updating to the latest version or checking the revision history for relevant changes. Regardless, describing the problem in more detail may allow others to help identify the source of the problem and potential solutions.
Now i spent a few hours reading and experimenting (i need more boost::asio functionality for work as well) and it turns out: Kind of.
But it is not as straightforward or readable as one might hope.
Under the hood (well, under the outermost hood) io_service has a bunch of other services registered, which do the work async_ operations of their respective fields require.
These are the "Services" described in the reference.
Now sadly, the services stay registered, wether there is work to do or not. For example if your io_service has a udp socket, it will still have all the corresponding services, even if the socket itself is inactive.
But you can ask your io_service which services it has. Lets say you want to know wether your io_service called m_io_service has an udp datagram_socket_service. Then you can call something like:
if (boost::asio::has_service<boost::asio::datagram_socket_service<boost::asio::ip::udp> >(m_io_service))
{
//Whatever
}
That does not help a lot, because it will be true no matter wether the socket is active or not. But after you know, that you have that service, you can get a ref to it using use_service instead of has_service but with the same elegant amount of <>.
And now you can inspect the service to see what it is up to. Sadly, it will not tell you what the outstanding handlers names are (probably partly because it does not know them) but if it is a socket, you can get its implemention_type and with that check whether it currently is_open or find either the local_endpoint as well as the remote_endpoint.
In case of a deadline_timer_service you can, among other stuff, find out when it expires_at.
See the reference for more information what the service is and is not willing to tell you.
http://www.boost.org/doc/libs/1_54_0/doc/html/boost_asio/reference.html
This information should then hopefully allow you to determine which async_ operation did not return.
And if not, at the very least you can cancel any unexpectedly active services.

WCF Trace Log analysis - help

I am having trouble deciphering a WCF trace file, and I hope someone can help me determine where in the pipeline I am incurring latency.
The trace for "Processing Message XX" is shown below, where there appears to be 997ms delay between the Activity Boundary and the transfer to "Process Action" where my service code is executed (which takes approx 50ms).
First, I am unsure whether I am right in understanding the "Time" column to represent start time for the activity item. I believe this to be the case because, drilling into the "Processing action" trace displays a list of activities with the first timestamp equal to the timestamp shown in the above trace for the "Processing action" item.
My primary question is this: how do I determine what is happening during this 997ms time span? As I read about the service trace viewer, it seems that this activity type involves "transport or security processing", which leads me to believe it is a network issue, but I cannot be sure.
In case it is relevant, below is a snapshot of the drill-down to "Process action" trace.
Does anyone have some insight on how to drill further into this activity to pinpoint the cause of delay?
(I should mention that the response time varies from ~60ms to over a full second, and only seems to do so in a specific environment, which further leads me to the idea of a networking issue)
Thank you in advance!
I was having the identical problem. My transfer times ranged from 100's of milliseconds to 4 seconds. I installed Wireshark on the server and saw numerous network packet transmission errors. It was impressive that the network stack could sort it all out and the messages eventually went through. Eventually I noticed that the "Speed and Duplex" setting for the server NIC driver was set at 100Mb Full. The test client was at Auto and there were a couple of switches between them. I would think that all the devices could sort this out, but evidently not. Changing the server value to Auto resolved the network errors and the trace transfer delays went away.
I would suggest adding an additional trace source specifically the network tracing trace sources see How to: Configure Network Tracing
You can System.Net and System.Net.Sockets. This should help corroborate your supected networking issue.
As an aside, you mentioned that the activity in question involves transport or security processing, in previous experience I have discovered that if you are using certificate based security for client identities or message security using certificates, the WCF channel can be affected by the latency of traversing the certificate chain to verify the certificates. This may not apply to you as you may not be using certificate based security.