WCF Trace Log analysis - help - wcf

I am having trouble deciphering a WCF trace file, and I hope someone can help me determine where in the pipeline I am incurring latency.
The trace for "Processing Message XX" is shown below, where there appears to be 997ms delay between the Activity Boundary and the transfer to "Process Action" where my service code is executed (which takes approx 50ms).
First, I am unsure whether I am right in understanding the "Time" column to represent start time for the activity item. I believe this to be the case because, drilling into the "Processing action" trace displays a list of activities with the first timestamp equal to the timestamp shown in the above trace for the "Processing action" item.
My primary question is this: how do I determine what is happening during this 997ms time span? As I read about the service trace viewer, it seems that this activity type involves "transport or security processing", which leads me to believe it is a network issue, but I cannot be sure.
In case it is relevant, below is a snapshot of the drill-down to "Process action" trace.
Does anyone have some insight on how to drill further into this activity to pinpoint the cause of delay?
(I should mention that the response time varies from ~60ms to over a full second, and only seems to do so in a specific environment, which further leads me to the idea of a networking issue)
Thank you in advance!

I was having the identical problem. My transfer times ranged from 100's of milliseconds to 4 seconds. I installed Wireshark on the server and saw numerous network packet transmission errors. It was impressive that the network stack could sort it all out and the messages eventually went through. Eventually I noticed that the "Speed and Duplex" setting for the server NIC driver was set at 100Mb Full. The test client was at Auto and there were a couple of switches between them. I would think that all the devices could sort this out, but evidently not. Changing the server value to Auto resolved the network errors and the trace transfer delays went away.

I would suggest adding an additional trace source specifically the network tracing trace sources see How to: Configure Network Tracing
You can System.Net and System.Net.Sockets. This should help corroborate your supected networking issue.
As an aside, you mentioned that the activity in question involves transport or security processing, in previous experience I have discovered that if you are using certificate based security for client identities or message security using certificates, the WCF channel can be affected by the latency of traversing the certificate chain to verify the certificates. This may not apply to you as you may not be using certificate based security.

Related

Error Domain=NSPOSIXErrorDomain Code=28 “No space left on device” UserInfo={_kCFStreamErrorCodeKey=28, _kCFStreamErrorDomainKey=1}

Lately numerous network requests with Alamofire made from our iOS device fail with the following error:
Error Domain=NSPOSIXErrorDomain Code=28 "No space left on device"
UserInfo={_NSURLErrorFailingURLSessionTaskErrorKey=LocalDataTask
.<3>,
_kCFStreamErrorDomainKey=1, _NSURLErrorRelatedURLSessionTaskErrorKey=( "LocalDataTask .<3>" ),
_kCFStreamErrorCodeKey=28}
Our app has a mechanism to send a network request if the user has moved +- 10 meters. This is checked every 5 seconds, so in theory every five seconds a call can be made. The network request fails occasionally with this message, returning no status code and the above error.
The message implies the error has to do with available disk/memory space on the device. However, after checking both there is no link to be found since there is plenty of space available. Also, the error occurs on multiple devices, all running iOS 14.4 or higher.
Is there information available regarding error code 28 and what could be the culprit on iOS devices? Even better; how can this error be prevented?
To answer the occurrence of the error itself:
NSPOSIXErrorDomain Code=28 "No space left on device"
With logs in the Xcode terminal:
2021-05-07 15:56:50.873428+0200 MYAPP[21757:7406020] [] nw_path_evaluator_create_flow_inner NECP_CLIENT_ACTION_ADD_FLOW 05CD829A-810D-412F-B86E-7524369359E8 [28: No space left on device]
2021-05-07 15:56:50.877243+0200 MYAPP[21757:7400322] Task <5504BCDF-7DFE-4045-BD4B-E75054636D5B>.<1> finished with error [28] Error Domain=NSPOSIXErrorDomain Code=28 "No space left on device" UserInfo={_NSURLErrorFailingURLSessionTaskErrorKey=LocalUploadTask <5504BCDF-7DFE-4045-BD4B-E75054636D5B>.<1>, _kCFStreamErrorDomainKey=1, _NSURLErrorRelatedURLSessionTaskErrorKey=(
"LocalUploadTask <5504BCDF-7DFE-4045-BD4B-E75054636D5B>.<1>"
), _kCFStreamErrorCodeKey=28}
It appears to get called when there are too many NSURLSessions created, reaching a limit of (in our tests) 600-700 sessions, which are not maintained or closed properly. The error started to get thrown since iOS 14, so it is interesting to see if there was a limit introduced.
Linked is a github issue raised stating the same issues on the ktor microservices framework by JetBrains, pointing in the same direction, mentioning the invalidation of sessions to prevent this issue:
https://github.com/ktorio/ktor/issues/1341
In our own project the origin of the problem turned out to be our implementation of the StarScream websocket library. This might not be relevant for the issues others are having, but explained anyways to create a complete picture of the problem. It is the cause and fix of our specific situation.
At first we assumed it had something to do with the URLSession created by Alamofire (networking library used) since POST requests started to get cancelled, and a kill of the app seemed the only solution to do requests again.
However, we also make use of websocket connections using the StarScream library, which attempts to connect to an socket, and if failed retry to connect every two seconds for a max time of two hours. This would mean for two hours, every two seconds, we connect to the socket -> receive a failure to connect -> disconnect the socket -> connect again. Using a singleton of the socket it was thought there was no possibility of creating multiple URLSessions, since the socket was only initiated once. However calling the connect to the socket again would create a new nw_connection object every single time, since the library did not handle the disconnect properly.
image of NWConcrete_nw_connection objects generated in socket connection
The way this was validated was using the instruments app to check for the creation of new nw_connection objects. Logged as a "memory leak" there, the creation of the nw_connection objects was getting logged and the solution was to make sure we disconnect the socket (invalidate the session) properly before connecting again.
I hope to answer a big part of the issue here, and I will mark my own question answered since this was the solution to the problem at hand. I think Apple should consider giving accurate reports on the number of objects created being limited, instead of giving an error "No space left on device".
Just wanted to chime in with more info, since we're experiencing the same issue.
Based on our analytics, this issue only started happening since iOS 14. We've verified it happening on 14.2, 14.4 and 14.5. Naturally the most straightforward cause for this error would be low memory or disk storage. We've excluded this option with additional logging, as you seem to have done as well.
A possibly related SO post has attributed the issue to a network inspecting framework that was enabled in their release build. It's worth checking if you use a similar tool.
Another report of this issue, this time on the Github of AFNetworking (predecessor to the Alamofire library you use), says they were able to fix it by limiting the creation of URLSession objects.
For us personally, neither of these did the trick. We created a support ticket with Apple, but this hasn't lead to a solution. They requested a small sample project that reproduces the issue, but the error only manifested after 7 days of continuous use in our app. If you have a faster way to reproduce this, it may be worth it to submit your own support ticket.
Hopefully this helps you find a solution, if you do please add this to your post to help others!

RTI DDS reader fails to identify topic

and subsequently obviously to read/take the topic. The problematic topic is published under BuiltinQosLibExp::Generic.KeepLastReliable.TransientLocal policy and the message is fired only once at the startup of the publisher application. Few things to consider:
Im not using this policy and taking the default policy configuration in code
dds::sub::qos::DataReaderQos tempQos = inSubScriber->default_datareader_qos();
m_EntitySpecReader = new dds::sub::DataReader<XXX_ICD::Entity_Specification_DT>(*inSubScriber, topicLocal, tempQos, m_EntitySpecListener);
from subscriber
The problem is not Firewall or some connection issue, as I know to receive other cyclic topics without any problem.
It is frustrating that I see this topic if Im trying to monitor either with rtiddsspy or RTI administration console.
Last bullet and most frustrating, when I actually felt stuck, is that I have a listener configured with all available callbacks and I thought to receive if not the data at least some callback clue regarding the possible mismatch, lost, something .... but it keeps silence no matter what Im trying to do :)
Will be more than happy to understand if somebody has an answer or potential direction to check :)
You are using the default QoS for your DataReader. This means that its Durability policy is VOLATILE. Even though the DataWriter is configured as TRANSIENT_LOCAL, it will not deliver "old" samples to your DataReader since it is not requesting those due to its volatile durability. In this context, "old" samples are samples that were written before the DataWriter discovered the DataReader.
Things should start working as expected when you configure your DataReader with a Durability policy as TRANSIENT_LOCAL as well.
If you instrumented a Listener on the DataReader, it should show you that a match has taken place though, or that it has failed. If you implemented both the on_subscription_matched and on_requested_incompatible_qos callbacks, then at least one of those two should fire if you have both applications started and if they are able to discover each other.
Since you discovered that the problem was a type mismatch, I wanted to show how the AdminConsole tool could have helped you finding that. Reproducing your issue, this is what it showed:

Continuous device and connection issues with routed Tokbox session

We’ve been using the Tokbox platform for several months now with a Javascript web-client as well as an Android phone client, where sessions and connections are managed by a Python server. While integration and bring-up went well on both ends (client and server), we continue to encounter problems with the in-session audio and video experience.
Sessions are always routed and always between two participants only, with much use of a collaborative editor.
The in-session experience is like a coin toss: we never know how it’s going to go, and that’s becoming a business threat.
Web-Client: A/V Resources
The most common problem is the acquisition of audio and/or video: at the beginning of a session, one or the other participants may have problems hearing or seeing the other. Allocating a new connection to establish new streams does not fix that, nor does restarting the browser.
Question: What’s the recommended way to detect possible resource locks (e.g. does another application hog the camera/microphone)?
Web-Client: Network
Bandwidth and packet loss are a challenge, for example this inspector graph:
Audio and video of both participants is all over the place, and while we can not control the network connections the web-client should be able to reliably give useful information.
Question: Other than continuous connection monitoring with getStats() and maybe the experimental navigator.connection property, how can the web-client monitor network connectivity?
Pre-Call Test
We recommend to customers to run a pre-call test and have implemented it on our site as well. However, results of that test often times do not reflect the in-session connectivity. Worse, a pre-call test may detect a low (no video) bandwidth while Skype works just fine.
Question: How can that be?
I'm a member of the TokBox development team. I remember you reported an issue with the Python SDK, thanks for that!
Web-Client: A/V Resources
Most acquisition issues are detected by the JS SDK and if they aren't then we'd really like to hear about it! Please report reproduction steps or affected session IDs to TokBox support (referencing this StackOverflow question): https://support.tokbox.com/hc/en-us/requests/new
Most acquisition errors appear as OT_HARDWARE_UNAVAILABLE or OT_MEDIA_ERR_ABORTED errors. Are you detecting and surfacing these errors to your users? There is also the special OT_CHROME_MICROPHONE_ACQUISITION_ERROR error which is due to a known issue with Chrome that has been mostly fixed since Chrome 63 (see https://bugs.chromium.org/p/webrtc/issues/detail?id=4799).
Web-Client: Network
Unfortunately this is one of the more difficult issues to troubleshoot. Yes, Subscriber#getStats() is the best tool we have at our disposal and is a wrapper around the native RTCPeerConnection#getStats() function. Unfortunately we don't have much control over the values returned by the native function and if you think our SDK is returning incorrect values when compared with values from RTCPeerConnection#getStats() then please let us know!
It would be worthwhile confirming whether the issue is reproducible in all browsers or only a particular one. If you have detailed data regarding the inaccuracy of the native RTCPeerConnection#getStats() function then we could work together to report it to the browser vendor(s).
Fortunately we have just released the new Publisher#getStats() function which lets you get the publisher side of the stats. This should help you narrow down the cause of a connectivity issue to either a publisher or subscriber side. Please let us know if this helps with tracking down these issues.
Pre-Call Test
Again, these tests are based on Subscriber#getStats() which in turn are based on RTCPeerConnection#getStats(), the accuracy of which is out of our hands, but we'd love any reproduction steps to either fix a bug in our client SDK or report a bug to the browser vendors.
Just to confirm though, when you say you've implemented a pre-call test in your site, did you use the official JavaScript network test module? https://github.com/opentok/opentok-network-test-js This is actually what's used by the TokBox pre-call test.
#Aiham, thanks for responding, I've been looking at the the new Publisher#getStats() you linked to (thank you!), so we too can give our users some way of visibly seeing the network conditions that might be affected the quality of their call (and who's causing it). However, it seems as though bytes / packets sent goes up sharply as the number of subscribers increases, even though we're in a routed session.
Am I wrong to expect the Publisher#getStats() statistics to stay fairly stable regardless of the number of subscribers then receiving that stream in a routed session? I expected the nature of a routed call to mean it's sent once to the OpenTok Media Servers, and the statistics would end there.

How do I correctly configure a WCF NetTcp Duplex Reliable Session?

Please excuse the Obvious Self-Q/A, but this information is widely misunderstood, and almost always incorrectly answered. So I Wanted to place this information here for people searching for a definitive answer to this problem.
Even so, there's still some information I haven't been able to nail down. I will put this towards the end of the question (skip to that if you are not interested in the preamble).
How do I correctly configure a WCF NetTcp Duplex Reliable Session?
There are many questions and answers regarding this topic, and nearly all of them suggest setting inactivityTimeout="Infinite" in your configuration. This doesn't really seem to work correctly, particularly for the case of NetTcp (It may work correctly for WSDualHttp Bindings, but I have never used those).
There are a number of other issues that are often associated with this: Including, Channel not faulting after client or server unexpectedly disconnected, Channel disconnecting after 10 minutes, Channel randomly disconnecting... Channel throwing exception when trying to open... Unable to configure Metadata on same endpoint...
Please note: There are two concepts that are important below. Infrastructure messages are internal to the way WCF communicates, and are used by the framework to keep things running smoothly. Operation messages are messages that occur because your app has done something, like send a message across the wire. Infrastructure messages are largely invisible to your app (but they still occur in the background) while operation messages are the result of an action your app has taken.
Information I have figured out, through hard won trial and error.
Infinite does not appear to be a valid configuration setting in all situations (and certainly, the visual studio validation schema doesn't know about it).
There are two special configuration converters, called InfiniteIntConverter and InfiniteTimeSpanConverter which will sometimes work to convert the value Infinite to either Int.MaxValue or TimeSpan.MaxValue, but I haven't yet figured out the situations in which this appears to be valid as sometimes it works, and sometimes it doesn't. What's more, it appears that some libraries will allow Infinite in the config, while others will not, so you can succeed in one part of a configuration, but fail in another.
You must configure BOTH inactivityTimeout and receiveTimeout, on both the client and the server. While these values do not HAVE to be the same, they probably should be as they will probably cause confusion if they are not. (technically, you can leave inactivityTimeout to its default value if you want, but you should be aware of its value, and what it does)
inactivityTimeout should NEVER be set to a large value, much less Infinite or TimeSpan.MaxValue.
inactivityTimeout has two functions (and this is not widely understood). The first function defines the maximum amount of time that can elapse on a channel without receiving any "infrastructure" or "operation" messages. The second function defines the time period in which infrastructure messages are sent (half the time specified). If no infrastructure or operation messages have been received during the timeout period, the connection is aborted.
receiveTimeout specifies the maximum amount of time that can elapse between operation messages only. This value can be set to a large value, such as TimeSpan.MaxValue (particularly if your channel runs internally over a trusted network or over a vpn). This value is what defines how long the reliable session will "stay alive" if there is no activity between client and server (other than infrastructure messages). ie, your client does not call any methods of the interface, and your server does not call back into the client.
setting a short inactivityTimeout and a large receiveTimeout keeps your reliable session "tacked up" even when there is no operational activity between your client and server. The short inactivity timeout (i like to keep the default 10 minutes or less) sends infrastructure "ping" messages to keep the TCP connection alive while the long receive timeout keeps the reliable session active. while at the same time providing a reasonable timeout in case of disconnection.
If you set inactivityTimeout to a large value, then the reliable session will not be reliable as it has no way to keep the Tcp connection alive, nor does it have any way to verify the integrity of the connection. It won't know if a user has disconnected unexpectedly until you try and send a message to that client and find out the connection is no longer there. This is why many people who use Infinite for this setting resort to creating a "Ping" method in their service, which is completely unnecessary if you've configured these settings correctly.
If you set inactivityTimeout to a value larger than receiveTimeout then it will likewise also be unreliable, as you will still be governed by the receiveTimeout for operation messages. ie. if you forget to set receiveTimeout and leave it at the default 10 minutes, then if the user is idle for 10 minutes, the connection will be aborted.
When the client or server unexpectedly disconnects (app crashes, network failure, someone trips over the power cord, etc..), the other side may not notice right away. I have attached various ChannelFaulted event handlers in various test situations, and sometimes the connection is faulted right away... other times it doesn't seem to fault at all. What i have discovered through trial and error is that the when it doesn't seem to fault, it will actually fault after the inactivityTimeout expires on that end. (so if it's set to 10 minutes, then after 10 minutes it will call the ChannelFaulted event).
I have not yet figured out why in some situations it notices the disconnection right away, and others it waits for the timer to expire. In both cases, I notice internal first chance communication exceptions thrown and handled by the framework, and there are calls to Abort the connection... but somehow the call to the event handler gets lost and it must wait for the timeout. My suspicion is this is somehow thread related.
When trying to configure Metadata to work across the NetTcp channel, I have had sporadic results. Sometimes it works, sometimes it doesn't. I've read many reports that Metadata doesn't work over NetTcp and that you have to use an Http channel for the Metadata, but I have in fact had it work on several occasions using the net.tcp:// url to generate the proxy. Then I would change something, recompile and it would no longer work. Changing things back, it wouldn't work again. So it was very confusing what magic incantation was necessary to make Metadata function over net.tcp, shared with the endpoint on the same port (obviously with a different address).
When configuring both a NetTcp and Metatdata endpoint on the same service, and specifying non-default settings for connection parameters like listenBacklog, and maxConnections, you also need to make sure the Metadata endpoint uses the same settings, which typically means you have to define a custom binding, since these settings are not available from the standard tcp mex binding. This includes setting listenBacklog and maxPendingConnections on tcpTransport, and groupName and maxOutboundConnectionsPerEndpoint on connectionPoolSettings.
The default setting for the Ordered setting of ReliableSession is True. This uses a lot more overhead than turning it off. If you don't need ordered messages, i would suggest turning it off (need to set this on both sides)
-
Configuration I still need to understand:
How do I correctly configure the shared net.tcp Metadata endpoint? (I will add an example when I get a chance) Currently, i'm specifying an http get url to bypass the problem. It's so inconsistent as to why it sometimes works and sometimes does not. I kept getting the error `The URI Prefix is not recognized' when generating the proxy in Visual Studio.
Why does WCF sometimes Fault the channel immediately upon disconnect, and sometimes waits for inactivityTimeout to expire? What controls/causes one vs the other behavior?

Exactly how many users can Support Blazeds Messeging service ? for much user support what we need to do(pooling)?

I designed one On line Trading Application, which uses blazeds & jetty,
in that i used AMF-LongPooling as channel, with following parameter,
Here is the problem is Each message is not reaching all the user,who are connected, messages are missing to few users (300 recieving out of 600)...
what we need to do to provided instant messages to all Online. ??
Please help me one?
Your question is too generic, it's not possible to give an answer because it depends on too many things: network, size of the messages, your system architecture etc. My suggestion is to invest heavily in reading BlazeDS developer guide and to turn the debug messages on (there is a lot of useful information displayed by BlazeDS). It would also help to study BlazeDS source code.
In case of AMF-longpolling the request is parked on the server and if too many requests are parked at a time, they will consume all available threads for the server. And the next client won't be able to connect.
In your case I am assuming the message size is not very big. And the solution can be one of the followings:
To increase the number of available threads. For that you can have multiple server instances and distribute your clients over them.
You can make use of LCDS.
You don't get that problem in LCDS as it makes use of NIO end points that don't block the thread. I have come to know that this thread restriction is not a problem with Servlet 3.0 and in that case you can support more clients with blazeds itself. You can check more about it HERE.