We have an issue on the acceptance environment during a handshake process. External service tries to send some data and during handshake sometimes we reset the connection after timeout around 2 minutes. In the picture below you can see communication between two services our server IP ends with 11 and external service IP ends with 5.
The strange thing is that is more than 50% it's working and when it's happened (they will try to send us data every hour and we reject each of them). In between, if we send data to them the next try from them will be successful (picture below). In this case, they use the server IP ends with .6.
Does someone have a clue what can be a problem here? We have tried to find something in our logs but nothing wasn't logged. Some help regarding additional logging will be appreciated (we tried with https://learn.microsoft.com/en-us/dotnet/framework/network-programming/how-to-configure-network-tracing and https://learn.microsoft.com/en-us/dotnet/framework/wcf/diagnostics/tracing/configuring-tracing?redirectedfrom=MSDN). Our backed in written in C# WCF. The additional fact, when we try to send data to them we never have an issue, it's always working.
Related
I have some images in my queue and I pass each image to my flask server where processing on images is done and a response is received in my rabbitmq server. After receiving response, I get this error "pika.exceptions.StreamLostError: Stream connection lost(104,'Connection reset by peer')". This happens when rabbitmq channel again starts consuming the connection. I don't understand why this happens. Also I would like to restart the server again automatically if this error persists. Is there any way to do that?
Your consume process is probably taking too much time to complete and send Ack/Nack to the server. Therefore, server does not receive heartbeat from your client, and thereby stops from serving. Then, on the client side you receive:
pika.exceptions.StreamLostError: Stream connection lost(104,'Connection reset by peer')
You should see server logs as well. It is probably like this:
missed heartbeats from client, timeout: 60s
See this issue for mor information.
Do your work on another thread. See this code as an example -
https://github.com/pika/pika/blob/master/examples/basic_consumer_threaded.py
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.
You can change stream connection limit if you set heartbeat in ConnectionParameters
connection_params = pika.ConnectionParameters(heartbeat=10)
wher number in seconds. It say yout TCP connection keepalive to 10 seconds for example.
More information https://www.rabbitmq.com/heartbeats.html and https://www.rabbitmq.com/heartbeats.html#tcp-keepalives
For a benchmarking test, I have a very basic test setup wherein I have a single user looping for 100 times (loop delay 100ms) hitting an https endpoint (GET) with HttpClient4 implementation, keep-alive has been turned on.
In the test results, I have observed a pattern wherein every 5/6th request the connect metric is higher as if a full SSL handshake is occurring, check the image below. I am a bit confused with this, any ideas on whats going on here and why the connect times are higher every n request?
[UPDATE]
I was able to troubleshoot this issue a bit further today after turning on access logs on the load balancer (target of this test) and I can see a pattern wherein JMeter seems to be switching the ports on the client side every few requests - the frequency matches the pattern observed previously with the JMeter test results.
This should probably explain the elevated connect times, now the question is why JMeter switches the port?
This could be keep-alive, it certainly was for my issue. Firstly make sure it's enabled on the sampler. Then there's also this JMeter setting to say how long to keep connections alive for.
httpclient4.time_to_live
I've set to 120000 in jmeter.properties but looking at the docs user.properties file should be used. I know jmeter.properties with a setting of 120000 worked for me.
I set the value high to see if it is an http keep alive causing the port switch. Whatever you set it to you need to ensure the client you are emulating does the same.
As you get some quick results I would guess it is a short timer somewhere and not the server side not allowing keep alive at all. Wireshark can help you pin point this as it could be the server side resetting the connection after a certain time. The above config extends the client side time which may get the information you need, if not have a look at the server side equivalent which will vary depending on what services the endpoint.
On this previous question: Tell when wcf client lost connection One of the commenters states:
Your service should not care whether a network cable was disconnected.
One feature of TCP is that unless someone is actively sending data, it
can tolerate momentary interruptions in network connectivity.
This is even more true in WCF, where there are layers of extra
framework to help protect you against network unreliability.
I'm having an issue where this is not working correctly. I have WCF client that makes a connection to the server using a DuplexChannelFactory. The connection stays open for 3 minutes. I disconnect the client from the internet and reconnect. The client regains internet connection, however any calls made from the server to that client fail. Once the client reconnects it begins working again.
When I pull the plug on the internet, the client throws several exceptions but the channel is still listed as being in an open state. Once the connection is regained and I made a request from the server to the client, I get errors such as: The communication object, System.ServiceModel.Channels.ServiceChannel, cannot be used for communication because it has been Aborted.
Obviously if the request comes in when the client is offline it won't work, but I'm trying to get it so this channel will recover once the internet comes back without having to set up a new connection.
Should this be working as-is, based on the comment I listed above? Or is there something I need to change to make that actually work?
The issue here is that the channel you're trying to use is in a faulted state, and cannot be used any longer (as the error message indicates).
Your client needs to trap (catch) that exception, and then abort the current channel and create a new one. WCF will not do that for you automatically, you have to code for it yourself.
You could also check the CommunicationState of the channel to see if it is faulted, and recover that way.
A final option would be to use the OnFaulted event handler, and when the channel is faulted, abort the channel and create a new one.
I am not pasting the source code here, if any one whats to reproduce the problem, download the code from this github project:
It is a Comet server, the server use libevent-2.0.21-stable http.
To reproduce the problem:
start the icomet-server from machine S
run curl http://ip:8100/stream from another machine C, the server S will show message that C has connected
if I press CTRL + C to terminate curl, the server knows that C is disconnected as expected.
if I pull out the network line from machine C(a physical network broken), the server will NOT know that C is disconnected, which it SHOULD know!
I will askany one who is familiar with libevent, how to make libevent 2 to detecting client network broken?
When the physical network link is interrupted, you won't always get a packet back to tell you that you lost the connection. If you want to find out about a disconnection, send a ping (a request that just asks for a no-op reply) periodically, and if the reply doesn't come within some reasonable timeout, assume something went wrong. Or just disconnect the client if they're idle for long enough.
When you did that Ctrl-C, the OS that the other end was running on was still working, and so it was able to generate a TCP RST packet to inform your server that the client has gone away. But when you break that physical link, the client is no longer capable of sending that cry for help. Something else has to infer that the client went away.
Now, if you try to send the client some data, the server kernel will notice (sooner or later) that the client is not replying to its messages. At this point you'll see the disconnect - but it may take several minutes for this to happen. If you're not sending any data, then it'll stay open until either you disconnect it, or the kernel attempts a TCP keepalive (a low-level way of the kernel asking "Hey, I haven't heard from you for a while, are you still there?") potentially hours later (or it might not even do a keepalive at all for you, depending on how things are configured).
I have a WCF service that is set up to use MSMQ to transmit to a service on another machine. We are trying to move the client onto a different machine, but it's not working. Enabling the MSMQ.End2End event log gives us
Message with ID {6940f8fa-3d31-4db0-ae2b-59bc98c99f2c}\25321 was sent to queue DIRECT=OS:iisapp1-vvpm\private$\TransactionalEmailService/TransactionalEmail.Service.TransactionalEmailService.svc
which makes me think that it is working correctly from our machine, but we can't find any trace of it on the target machine. The service is not being invoked, and we can't find the message in the dead-letter queue (or anywhere else we can think of to look).
Also, running the code directly from Visual Studio on my machine causes it to work.
Changing the receiving queue to the DEV machine also causes the code to work, which makes me further think it's a problem with the receiving machine. (I just have no idea what)
UPDATE 1:
I came back to it and noticed all the messages I tried to send in the transactional dead-letter queue. The error message is "the time-to-reach-queue has elapsed". Looking at the connection state, it's inactive, and sending another message won't cause it to become connected. I restart the machine, and it is "Connected" again. I try to send the message again, and look at the queue state. There are 12 messages, all of which are unacknowledged (0 are unprocessed).
So it started happening again once the the endpoint machine got restarted. I came across this article, which was the real solution:
http://blogs.msdn.com/b/johnbreakwell/archive/2007/02/06/msmq-prefers-to-be-unique.aspx
( In case the link goes away: )
It looks like there is an ID in MSMQ that is sent as part of the message. The remote host uses that ID as a key into a cache to determine who to send the ack back to. If you clone a machine, it gets the same value in the registry for that ID, which makes the other machine not know who to send it back to. So it will send it to the wrong machine, which will discard it, and the other machine will be stuck with a bunch of messages it keeps trying to send. This also explains why it just started working one day... the cache expired and the "correct" machine got put in as the endpoint.
Reinstalling MSMQ on the cloned machine fixes the issue.
I'm really not sure this may be the case here (I don't have any experience with WCF in the context of MSMQ), but one of the more common reasons of this kind of behaviour you're talking of is missing the obligatory camel casing on FormatName in your queue name when using a MessageQueue constructor (like 'FormatName:DIRECT=...'), or getting the name somehow wrong. The queue name in the message looks a bit odd with the svc ending, but that could just be a WCF thing? Hope this helps at least getting you on the right direction.
Not sure what to say here, but it works now. Reading some stuff helped point me to the status of the queue (click on Outgoing Queues under Features/Message Queueing). From there I found this KB article with a hotfix: http://support.microsoft.com/kb/976438 It didn't seem like it was applicable, but the symptoms people were having were all the same. Our guys tried to install it, but it failed and they didn't restart... but for some reason the message queues started working.
If someone comes along with some insight, I'll gladly upvote them or give them the bounty (if it's soon enough). But I'll just accept this as the answer for now.
This is usually caused by permissions on the remote queue, usual scenario is if you are using a private queue and this is accessed remotely by your wcf service.
Try using a public queue.