aws ECS- Farget task submition pending time - aws-fargate

we have an ECS fargate cluster,that we have just created, and when testing, we noticed, that the submission of a new task takes about 2-3 minutes (PENDING to RUNNING).
since we run there a new task every minute, it's not good enough for us.
is there any way to optimize the PENDING to RUNNING time?

This is largely dependent on the size of your container. For example I use go from scratch containers heavily so they are only about 15MB, and I get launch times from nothing -> running in roughly 15-20 seconds.
The biggest thing you can do right now to increase launch times is to reduce the size of your container.

Related

Hangfire job creation throughput performance with Redis RDB

In official documentation, there is a chart, which tells, that creation job throughput with Redis RDB could be around 6 000 jobs per second. I have tried different Hangfire, Redis and HW configurations, but I always get max around 200 jobs per second. I even created simple example that reproduces it (Hangfire configuration, job creation).
Am I doing something wrong? What job creation throughput performance are you getting?
I am using latest versions: Hangfire 1.7.24, Hangfire.Pro 2.3.0, Hangfire.Pro.Redis 2.8.10 and Redis 6.2.1.
The point is that in the referenced sample application, background jobs are being created sequentially, one after another. In this case background jobs aren't created fast enough due to I/O delays (round-trips to the storage), to result in better throughput. And since there's also a call to Hangfire.Console that requires even more I/O, creation process is performed even slower.
Try to create background jobs in a Parallel.For loop to create background job in parallel and amortize the latency. And try to create all the background jobs before starting the server to make a clear distinction between created/sec and performed/sec metrics as shown below, otherwise everything will be mixed up.
var sw = Stopwatch.StartNew();
Parallel.For(0, 100000, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }, i =>
{
BackgroundJob.Enqueue(() => Empty());
});
Console.WriteLine(sw.Elapsed);
using (new BackgroundJobServer())
{
Console.ReadLine();
}
On my development machine I've got 7.7 sec to create 100,000 background jobs (~13,000 jobs/sec) and Dashboard UI told me that perform rate is ~3,500 jobs/sec that's a bit lower than displayed on the chart, but that's because there are more extension filters now in Hangfire than 6 years ago when that chart was created. And if we clear them with GlobalJobFilters.Filters.Clear(), we'll get about 4,000 jobs/sec.
To avoid the confusion, I've removed the absolute numbers from those charts today. Absolute numbers are different for different environments, e.g. on-premise (can be faster) and cloud (will be slower). That chart was created to show the relative difference between SQL Server and Redis in different modes, which is approximately the same in different env, not to show the precise numbers that depend on a lot of factors, especially when network is involved.

ADF Dataflows; Do I have any control or influence over cluster startup time. (NOT "TTL")

Yes, I know about TTL; Yes, I'm configuring that; No, that's not what I'm asking about here.
Spinning up an initial cluster for a Dataflow takes around 5 minutes.
Starting acquiring compute from an existing "warm" cluster (i.e. one which has been left 'Alive' using TTL), for a new dataflow still appears to take 1-2 minutes.
Those are pretty large numbers, especially if you have a multi-step ETL process, and have broken up your pipeline to separate concerns (or if you're executing the dataflows in a loop, to process data per-source-day)
Controlling the TTL gives me some control over which of those two possibilities I'm triggering, but even 2 minutes can be a quite substantial overhead. (I have a pipeline where fully half the execution time is waiting for those 1-2 minute 'Acquire Compute' startups)
Do I have any control at all, over how long startup takes in each case? Is there anything that I can do to speed up the startup, or anything that I should avoid to prevent making things even worse!
There's a new feature in town, to fix exactly this problem.
Release blog:
https://techcommunity.microsoft.com/t5/azure-data-factory/how-to-startup-your-data-flows-execution-in-less-than-5-seconds/ba-p/2267365
ADF has added a new option in the Azure Integration Runtime for data flow TTL: Quick re-use. ... By selecting the re-use option with a TTL setting, you can direct ADF to maintain the Spark cluster for that period of time after your last data flow executes in a pipeline. This will provide much faster sequential executions using that same Azure IR in your data flow activities.

Is it possible to limit number of oozie workflows running at the same time?

This is not clear to me from the docs. Here's our scenario and why we need this as succinctly as I can:
We have 60 coordinators running, launching workflows usually hourly, some of which have sub-workflows (some multiple in parallel). This works out to around 40 workflows running at any given time. However when cluster is under load or some underlying service is slow (e.g. impala or hbase), workflows will run longer than usual and back up so we can end up with 80+ workflows (including sub-workflows) running.
This sometimes results in ALL workflows hanging indefinitely, because we have only enough memory and cores allocated to this pool that oozie can start the launcher jobs (i.e. oozie:launcher:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ), but not their corresponding actions (i.e. oozie:action:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ).
We could simply allocate enough resources to the pool to accommodate for these spikes, but that would be a massive waste (hundreds of cores and GBs that other pools/tenants could never use).
So I'm trying to enforce some limit on number of workflows running, even if that means some will be running behind sometimes. BTW all our coordinators are configured with execution=LAST_ONLY, and any delayed workflow will simply catch up fully on the next run. We are on CDH 5.13 with Oozie 4.1; pools are setup with DRF scheduler.
Thanks in advance for your ideas.
AFAIK there is not a configuration parameter that let you control the number of workflows running at a given time.
If your coordinators are scheduled to run approximately in the same time-window, you could think to collapse them in just one coordinator/workflow and use the fork/join control nodes to control the degree of parallelism. Thus you can distribute your actions in a K number of queues in your workflow and this will ensure that you will not have more than K actions running at the same time, limiting the load on the cluster.
We use a script to generate automatically the fork queues inside the workflow and distribute the actions (of course this is only for actions that can run in parallel, i.e. there no data dependencies etc).
Hope this helps

Can a WinRT background task be long-lived if within CPU and Network limits?

Microsoft's documentation states:
Background tasks are meant to be short-lived tasks that do not consume a lot of resources.
It also says:
Each app on the lock screen receives 2 seconds of CPU time every 15 minutes, which can be used by all of the background tasks of the app. At the end of 15 minutes, each app on the lock screen receives another 2 seconds of CPU time for use by its background tasks.
I need to run a background task every two minutes to update my live-tile.
My app is a lock-screen-app.
Computation is within the CPU and network usage constraints
Can I create a permanent background task (e.g. something which polls a web service and pulls information, waits and loops) to create a OneShot TimeTrigger every two minutes or is there a better way of doing this?
My concern with the background task option is whether the runtime would deem the task inactive while it was sleeping and close it or something else like there's a limit on the number of times a live tile can be updated within 15 minutes...
Yes, if by long lived you mean under 25 minutes.
Time triggers cannot execute more frequent than 15 minutes. Creating a OneShot trigger that executes in 2 minutes is, that's an interesting idea and should work. Yes, background tasks can register other background tasks to keep this chain going. Should the user's machine be off when it execs it will queue later.
Having said that, updating your tile that frequently & using a background task is not a wise solution. Because, it is unreliable. Background tasks can be disabled, for one. But every 15 minutes, you are going to exceed your quota. Try using a Scheduled tile instead.

Rails processing background jobs in real-time

I use hirefire-gem with Delayed-Job 3 on heroku cedar-stack and it is working pretty good in terms of hiring/firing but performance of the job execution is terrible. firing up the background job and seeing the results in the UI takes about 5-8 seconds locally and about 25-30 seconds (!) on heroku.
Processing time of the jobs is about the same locally/deployed but hiring workers (scaling, up, starting, ...) seems to take a lot of time(?).
is that a common issue? is there a solution (rake tasks, etc.)?
Thanks a lot.
Best, Phil
It's down to the fact that your worker isn't running all the time but spinning up for each individual job. The lag is the code start-up time.
If you have a full time dyno the jobs should process almost instantaneously.