I'm working on a backoff strategy for a robot that connects through the Twitter Streaming API. The API documentation states:
Back off linearly for TCP/IP level network errors. These problems are generally temporary and tend to clear quickly. Increase the delay in reconnects by 250ms each attempt, up to 16 seconds.
I understand this errors to be when – for whatever reason – the client cannot communicate with the server (ie: no Internet service). However, I'm not sure if HTTP status codes equal or greater than 500 should be treated as TCP/IP level network errors too (ie: 503 service unavailable), because, in order to receive this error codes, a successful connection between client and server should have happened already.
Could someone please help me understand this?
Thanks.
Since this post earned me a "Tumbleweed" badge :-), I decided to post the reply I received at Twitter Developers:
#kurrik
I think your intuition is correct in that a HTTP status code is not a TCP/IP error and you should use the exponential backoff. The slower backoff for these kinds of errors is so that your connection does not get rate limited. A 5XX error is a bit unusual, as it indicates an error which may have happened before or after the connection attempt was logged by the rate limiter. To be safe, I'd say use exponential backoff for this case (although most 503 issues should be cleared after the first reconnect attempt).
I ended up using this strategy:
Disconnections and 500 errors: linear backoff starting at 0.25s and adding 0.25s each time, up to 16s.
All errors greater than 500: exponential backoff starting at 5s and doubling the amount each time, up to 5 minutes.
Rate limited (420 / 429): exponential backoff starting at 1m and doubling the amount each time.
No reconnection: all 4XX errors different from the rate limit errors.
Related
Every once in a while Bitmex disconnects our websocket connection which forces us to reconnect. However, they provide a connection pool of 40 connections per hour. In times of low volatility it seems not to be a problem AT ALL, however as soon as trading activity goes up, we are running through these 40 connections in no time leaving our connection dead eventually.
We do have a keep-alive but it does not solve the problem at all.
We haven’t seen any specifics on the API documentation regarding how to deal with this problem, or the specific reasons we get so many close opcodes whenever the volatility raises
Does anyone know if we are doing something wrong?
EDIT: heartbeat is also in place
I suggest implementing heartbeats as per https://www.bitmex.com/app/wsAPI#Heartbeats
In general, WebSocket connections can drop if the connection remains idle for too long without transmitting any data.
If API gateway fails (single entry point to the system), then unable to access all the services. Any HA(High Availability) design to handle API gateway failure?
1) As per your project location, you can choose one more region as your disaster recovery plan. When ever something fails in one region then immediately you can switch to another region by just changing the end point.
2) You can use services like route53 to divide your traffic between two regions or two api gateways. That way you will save atleast part of your traffic flowing even if one apig fails.
3) Always keep cloudwatch alarms to get notification about any failures in your system.
4) It is very unlikely that a api gateway will fail. It is AWS my friend.
"node_saini" has a great response and it's correct. I tried to comment but don't have the reputation to do so yet... the comment would say:
5) Configure your timeout to fail ASAP based on baselines and implement retries with exponential backoff on 5xx errors to alleviate any small percentage of failures which may occur.
With all applications, temporary failures are expected but permanent failures after retry can be a sign of a real problem brewing.
RabbitMQ allows you to "heartbeat" a connection, i.e. from time to time the client and the server check (using empty messages) that the other party is still there and available. So far, so good.
Unfortunately, I was not able to find a place in the documentation where a suggestion is made what a reasonable value for this is. I know that you need to specify the heartbeat in seconds, but what is a real-world best practice value?
Obviously, it should not be too often (traffic), but also not too rare (proxies, …). Any suggestions?
Is 15 seconds fine? 30? 60? …?
This answer if for RabbitMQ < 3.5.5, for newer versions see the answer from #bmaupin.
It depends on your application needs. Out of the box it is 10 min for RabbitMQ. If you fail to ack heartbeat twice (20min of inactivity), connection will be closed immediately without sending any connection.close method or any error from the broker side.
The case to use heartbeat is firewalls that closes inactive for a long time connection or some other network settings that doesn't allow you to have waiting connections.
In fact, hearbeat is not a must, from RabbitMQ config doc
heartbeat
Value representing the heartbeat delay, in seconds, that the server sends in the connection.tune frame. If set to 0, heartbeats are disabled. Clients might not follow the server suggestion, see the AMQP reference for more details. Disabling heartbeats might improve performance in situations with a great number of connections, but might lead to connections dropping in the presence of network devices that close inactive connections.
Default: 580
Note, that having hearbeat interval too short may result in significant network overhead. Keep in mind, that hearbeat frames are sent when there are no other activity on the connection for a hearbeat time interval.
The RabbitMQ documentation now provides a recommended heartbeat timeout value between 5 and 20 seconds:
Setting heartbeat timeout value too low can lead to false positives (peer being considered unavailable while it really isn't the case) due to transient network congestion, short-lived server flow control, and so on. This should be taken into consideration when picking a timeout value.
Several years worth of feedback from the users and client library maintainers suggest that values lower than 5 seconds are fairly likely to cause false positives, and values of 1 second or lower are very likely to do so. Values within the 5 to 20 seconds range are optimal for most environments.
Source: https://www.rabbitmq.com/heartbeats.html#false-positives
In addition, as of RabbitMQ 3.5.5 the default heartbeat timeout value is 60 seconds (https://www.rabbitmq.com/heartbeats.html#heartbeats-timeout)
I have a request that takes more than 30 seconds and it breaks.
What is the solution for this? I am not sure if I add more dynos this will work.
Thanks
You should probably see the Heroku devcenter article regarding this, as the information will be more helpful, here's a small summary:
To answer the timeout question:
Cedar supports long-polling and streaming responses. Your app has an initial 30 second window to respond with a single byte back to the client. After each byte sent (either recieved from the client or sent by your application) you reset a rolling 55 second window. If no data is sent during the 55 second window your connection will be terminated.
(That is, if you had Cedar instead of Aspen or Bamboo you could send a byte every thirty seconds or so just to trick the system. It might work.)
To answer your dynos question:
Additional concurrency is of no help whatsoever if you are encountering request timeouts. You can crank your dynos to the maximum and you'll still get a request timeout, since it is a single request that is failing to serve in the correct amount of time. Extra dynos increase your concurrency, not the speed of your requests.
(That is, don't bother adding more dynos.)
On request timeouts:
Check your code for infinite loops, if you're doing something big:
If so, you should move this heavy lifting into a background job which can run asynchronously from your web request. See Queueing for details.
Getting this WCF error, and no idea how to fix it:
System.ServiceModel.CommunicationException: The sequence has been terminated by the remote endpoint. The user specified maximum retry count for a particular message has been exceeded. Because of this the reliable session cannot continue. The reliable session was faulted.
Any ideas welcome :(
From the error message, it would appear that you're using reliable messaging. One of its features is that if a message transfer fails, it will be retried - up to a maximum number of attempts.
Obviously, in your setup, this max number has been maxed out. This might indicate a problem with the network, or your service code, or both. Really hard to tell from here without knowing what you're doing and what your setup is......
I guess the main question would be: do you really need the reliable messaging feature? What are you trying to achieve with this? If you could turn it off, you wouldn't be seeing those errors... can you switch to some other mechanism, maybe message queueing (MSMQ)? Or can you rearchitect your app so you can live with the odd chance that one message might get delivered "out of band" ?