Hazelcast cluster member crash results in loosing all scheduled tasks - crash

We are running 4 instances of our java application in hazelcast cluster. We scheduled around 2000 task using schedule executor service schedule method. Hazelcast partition all these 2000 tasks across the 4 instances. Due to some reason one of the cluster member crashes then all the task that are assign to the partition that are owned by the crashed node are lost, rest all 3 cluster member completed their assign task.
So how can we overcome this problem to avoid the lost tasks.

Try using the Durable Executor
Probably a good idea also to find why the process crashed in the first place.

Related

Multiple service instances using Hangfire (shared tasks/objects), is it possible?

I need to run multiple instances of the same service, with the same database, for redundancy reason.
I found some question about "Hangfire multiple instances" but for a differenct purpose then mine: usually about running multiple instances for different tasks on the same database, or similar to this.
I need to know if there are problems of concurrency when 2 or more instances of Hangfire use the same Database (we want to use MongoDB) and if this is the solution to make the service resilient.
The goal is to have instance that take care of all the jobs when another instance goes down.
Any suggestion wellcome for covering this scenario.
In our environment, we have a replica set used by about 10 Hangfire servers. If there are multiple Hangfire servers servicing the same queue, it means they will share the load and whichever Hangfire server checks the queue first, picks up the job and continues. If you remove all but 1 server, the jobs will continue (as long as there are enough workers otherwise they will remain queued until a worker is available).
To answer your question, yes, you can have 2 or more Hangfire servers using the same MongoDB. MongoDB provides multi-threading support so its safe to have various servers accessing the same database backend. If you have two servers, both will be active and if one instance goes off line, other instance (based on queues) will continue to process the jobs in queue.
Keep in mind, Hangfire servers processes the jobs in Specific Queues. If both servers are part of the same queue then you are load balancing the jobs among the two servers. If they are part of different queues, then you read about that scenario where each Hangfire instance processes different jobs (because they are part of different queues).
Read about configuring Job Queues here

Is there a way to reuse a single running databricks cluster in multiple mapping data flows

Is there a way to reuse a databricks cluster that is started by a web activity before
we run the mapping data flows and use the same running cluster in all of the data flows instead of letting all the data flow instances spin up their
own clusters which takes around 6 minutes for setting up each cluster?
Yes. Set the TTL in the Azure Integration Runtime under "Data Flow Properties" to an amount of time that there is a gap in between data flow job executions. This way, we can set-up a VM pool for you and reuse those resource to minimize the cluster start-up time: https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-ttl-to-azure-ir-to-reduce-data-flow-activity-times/ba-p/878380.
To start the cluster, don't use a web activity. Use a "dummy" data flow as I demonstrate here: https://youtu.be/FFCbU4ujCiY?t=533.
In ADF, you cannot access the underlying compute engines (Databricks in this case), so you have to kick-off a dummy data flow to warm it up.
That cluster start-up will take 5-6 mins. But now, if you use that same Azure IR in your subsequent activities, as long as they are scheduled to execute within that TTL window, ADF can grab existing VM resources to spin-up the Spark clusters and marshall your data flow definition to the Spark job execution.
End-to-end that process should now take just 2 mins.

Is it possible to limit number of oozie workflows running at the same time?

This is not clear to me from the docs. Here's our scenario and why we need this as succinctly as I can:
We have 60 coordinators running, launching workflows usually hourly, some of which have sub-workflows (some multiple in parallel). This works out to around 40 workflows running at any given time. However when cluster is under load or some underlying service is slow (e.g. impala or hbase), workflows will run longer than usual and back up so we can end up with 80+ workflows (including sub-workflows) running.
This sometimes results in ALL workflows hanging indefinitely, because we have only enough memory and cores allocated to this pool that oozie can start the launcher jobs (i.e. oozie:launcher:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ), but not their corresponding actions (i.e. oozie:action:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ).
We could simply allocate enough resources to the pool to accommodate for these spikes, but that would be a massive waste (hundreds of cores and GBs that other pools/tenants could never use).
So I'm trying to enforce some limit on number of workflows running, even if that means some will be running behind sometimes. BTW all our coordinators are configured with execution=LAST_ONLY, and any delayed workflow will simply catch up fully on the next run. We are on CDH 5.13 with Oozie 4.1; pools are setup with DRF scheduler.
Thanks in advance for your ideas.
AFAIK there is not a configuration parameter that let you control the number of workflows running at a given time.
If your coordinators are scheduled to run approximately in the same time-window, you could think to collapse them in just one coordinator/workflow and use the fork/join control nodes to control the degree of parallelism. Thus you can distribute your actions in a K number of queues in your workflow and this will ensure that you will not have more than K actions running at the same time, limiting the load on the cluster.
We use a script to generate automatically the fork queues inside the workflow and distribute the actions (of course this is only for actions that can run in parallel, i.e. there no data dependencies etc).
Hope this helps

Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances?

I am new to Amazon Services and facing some issues.
Suppose I am running some Job Flow on Amazon Elastic Mapreduce with total 3 instances. While running my job flow on it I found that my job is taking more time to execute. And in such case I need to add more instances into it so that my instances will increase and hence job will execute fast.
My question is that How to add such instance into an existing instances? Because If we terminate existed instance and again create the new instances with more number is time consuming.
Is there anyway to do it? If yes then please suggest me.
I am doing all this task through CLI. So please share the anwers with commands too along with GUI steps in AWS Management Console.
Thanks.
Yes, you can do this with the command line tool
to add more instances to the core group:
elastic-mapreduce --modify-instance-group CORE --instance-count 40
To create a task group (no datanodes), with 40 instances:
elastic-mapreduce --add-instance-group TASK --instance-count 40 --instance-type c1.medium
It's important to note that CORE instance-group instances can not be reduced since they participate as data nodes. They can be increased only.
TASK instances only do processing and can be increased and reduced.

weblogic 10 TimerManager avoiding propagation of security context to the scheduled tasks

We are using weblogic 10 and I am using the commonj's TimerManager which is part of weblogic to schedule a task, everything is fine but I have one problem. The securitycontext of the thread which scheduled the TimerListener task is somehow stored in the TimerListener task and is being used for the work done in the TimeListener task and this is causing the problem for me. Can anyone of you pls point me on how to avoid propagation of security context to the scheduled tasks from the thread which scheduled those tasks?
This is way late but anyway, one way to avoid propagating context is to use unmanaged threads i.e. spawn threads without commonj. Of this throws the baby out with the bathwater.