An AWS Step Function State Machine has a Lambda Function at its core, that does heavy writes to a S3 bucket. When the State Machine gets a usage spike, the function starts failing due to S3 blocking further requests (com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate.). This obviously leads to failures of the state machine execution as a whole and it takes the whole system some minutes to fully recover.
I looked into the AWS Lambda Function Scaling Documentation and found out, that when we reduce the reserved concurrency flag, the function will start to return 429 status codes, as soon as it can't handle new events.
So my idea to load control the function execution can be summarized as following:
Set the reserved concurrency to some lower value.
Catching the 429 errors in the step function and retrying with a backoff rate.
I'd like to have feedback from you guys, on the following aspects:
a. Does my approach make sense or am I missing some obvious better way? I first thought of looking into managing the load with AWS SQS or some execution wide locking/semaphore but didn't really see any further.
b. Is there maybe another way to tackle the issue from the S3 side?
This approach worked well for me:
States:
MyFunction:
Type: Task
End: true
Resource: "..."
Retry:
- ErrorEquals:
- TooManyRequestsException
IntervalSeconds: 30
MaxAttemtps: 5
BackoffRate: 2
Related
Does anyone know how to identify the issue and fix it?
Many thanks!
The latency does not look like a problem.
The Cloud Debugger API latency metrics will include ListActiveBreakpoint rpcs that deliberately stay open for a long period of time in order to reduce polling frequency. Essentially, the rpc will return when there is a new breakpoint set or when the request times out (at roughly 50 seconds, based on your screenshot).
The Cloud Debugger API requests occur in the background so the latency should not affect your system in any meaningful way.
Reading a lot about error handling for AWS Lambdas and nothing covers to topic of a running Lambda container just crashing.
Is this a possibility because it seems like one? I'm building an event driven system using Lambdas, triggered by a file upload to S3 and am uncertain if I should bother building in logic to pickup processing if a lambda has died.
e.g. File object is create on S3 -> S3 notifies Lambda of the event -> Lambda instance happens to crash before it can start processing -> Event is now gone forever* (assumption here, I'm unsure if that's true, but can't find anything to say the contrary).
I'm debating building in logic to reconcile what is on S3 and what was processed each day so I can detect the (albeit rare) scenario where a Lambda died (died and couldn't write a failure to a DLQ) and we need to process these files. Is this worth it? Would S3 somehow know that the lambda died and it needs to put the event on a DLQ of its own?
From https://docs.aws.amazon.com/fr_fr/lambda/latest/dg/with-s3.html AWS S3 are async.
Next from https://docs.aws.amazon.com/lambda/latest/dg/invocation-retries.html, Async lambda invocation are retries twice without any queuing.
I guess if more tries are needed, better to setup a SNS/SQS queuing.
I have implemented a simple lambda function which gets triggered whenever there is objected created on s3 bucket.
Whenever an object is created on S3 the lambda gets triggered.However , once the lambda is triggered, the lambda keeps executing at a certain interval even if there is upload on S3 bucket.
Any suggestions would be really helpful.
Your function is timing out because you aren't calling the callback or using the context.succeed() method. I believe retry is two with backoff for errors, but with timeout, S3 will keep retrying for a period of time that is not guaranteed but is usually quite long (a day?)
I'm developing an Elasticsearch plugin that extract terms from fields that match a pattern. To get all the scaffolding done, I have started out from this plugin: https://github.com/jprante/elasticsearch-index-termlist. So I extend TransportBroadcastOperationAction, and have it broadcast the request to activePrimaryShardsGrouped, and then in newResponse, I merge the shard results, count the failed shards, and pass the counter to the BroadcastOperationResponse constructor eventually.
I call this on the ES client like:
TermListResponse resp = TermListAction.INSTANCE.newRequestBuilder(client)
.setIndices("foo")
.setFields("bar", "baaz").setPattern("wombat*")
.execute().actionGet();
My problem is that the above will not throw exception when there were failed shards, although it indicates that in resp.getFailedShards(). Is it how it's supposed to be, or am I doing something wrong? Checking resp.getFailedShards() after all invocations doesn't look very safe, because someone can forget to do that and work with a partial term list accidentally.
Furthermore, the cause of the failed shards in my case was that the cluster was recently restarted and so the client could already connect but some shards weren't ready yet. I think it would nice if the action just waits for the broadcast target shards to become ready (with some timeout of course), just like search requests do, apparently. Maybe that means waiting for the "yellow" cluster health state, but where I'm supposed to do that, if I want to be true to the approach of ES?
I'm using Django Celery with Redis to run a few tasks like this:
header = [
tasks.invalidate_user.subtask(args = (user)),
tasks.invalidate_details.subtask(args = (user))
]
callback = tasks.rebuild.subtask()
chord(header)(callback)
So basically the same as stated in documentation.
My problem is, that when this task chord is called, celery.chord_unlock task keeps retrying forever. Tasks in header finish successfully, but because of chord_unlock never being done, callback is never called.
Guessing that my problem is with not being able to detect that the tasks from header are finished, I turned to documentation to look how can this be customized. I've found a section, describing how the synchronization is implemented, there is an example provided, what I'm missing is how do I get that example function to be called (i.e. is there a signal for this?).
Further there's a note that this method is not used with Redis backend:
This is used by all result backends except Redis and Memcached, which increment a counter after each task in the header, then applying the callback when the counter exceeds the number of tasks in the set.
But also says, that Redis approach is better:
The Redis and Memcached approach is a much better solution
What approach is that? How is it implemented?
So, why is chord_unlock never done and how can I make it detect finished header tasks?
I'm using: Django 1.4, celery 2.5.3, django-celery 2.5.5, redis 2.4.12
You don't have an example of your tasks, but I had the same problem and my solution might apply.
I had ignore_result=True on the tasks that I was adding to a chord, defined like so:
#task(ignore_result=True)
Apparently ignoring the result makes it so that the chord_unlock task doesn't know they're complete. After I removed ignore_result (even if the task only returns true) the chord called the callback properly.
I had the same error, I changed the broker to rabbitmq and chord_unlock is working until my task finishes (2-3 minutes tasks)
when using redis the task finishes and chord_unlock only retried like 8-10 times every 1s, so callback was not executing correctly.
[2012-08-24 16:31:05,804: INFO/MainProcess] Task celery.chord_unlock[5a46e8ac-de40-484f-8dc1-7cf01693df7a] retry: Retry in 1s
[2012-08-24 16:31:06,817: INFO/MainProcess] Got task from broker: celery.chord_unlock[5a46e8ac-de40-484f-8dc1-7cf01693df7a] eta:[2012-08-24 16:31:07.815719-05:00]
... just like 8-10 times....
changing broker worked for me, now I am testing #Chris solution and my callback function never receives the results from the header subtasks :S, so, it does not works for me.
celery==3.0.6
django==1.4
django-celery==3.0.6
redis==2.6
broker: redis-2.4.16 on Mac OS X
This could cause a problem such that; From the documentation;
Note:
If you are using chords with the Redis result backend and also overriding the Task.after_return() method, you need to make sure to call the super method or else the chord callback will not be applied.
def after_return(self, *args, **kwargs):
do_something()
super(MyTask, self).after_return(*args, **kwargs)
As my understanding, If you have overwritten after_return function in your task, it must be removed or at least calling super one.
Bottom of the topic:http://celery.readthedocs.org/en/latest/userguide/canvas.html#important-notes