Pyhon APScheduler stop jobs before starting a new one - apscheduler

I need to start a job every 30 minutes, but before a new job is being started I want the old but same job being terminated. This is to make sure the job always fetches the newest data file which is constantly being updated.
Right now I'm using the BlockingScheduler paired with my own condition to stop the job (stop job if processed 1k data etc.), I was wondering if APScheduler supports this "only 1 job at the same time and stop old one before new one" behavior natively
I've read the docs but I think the closest is still the default behavior which equals max_instances=1, this just prevents new jobs firing before the old job finishes, which is not what I'm looking for.
Any help is appreciated. Thanks!

After further research I came to a conclusion that this is not supported natively in APScheduler, but by inspired by
Get number of active instances for BackgroundScheduler jobs
, I modified the answer into a working way of detecting the number of current running instances of the same job, so when you have a infinite loop/long task executing, and you want the new instance to replace the old instance, you can add something like
if(scheduler._executors['default']._instances['set_an_id_you_like'] > 1):
# if multiple instances break loop/return
return
and this is what should look like when you start:
scheduler = BlockingScheduler(timezone='Asia/Taipei')
scheduler.add_job(main,'cron', minute='*/30', max_instances=3, next_run_time=datetime.now(),\
id='set_an_id_you_like')
scheduler.start()
but like the answer in the link, please refrain from doing this if someday there's a native way to do this, currently I'm using APScheduler 3.10
This method at least doesn't rely on calculating time.now() or datetime.datetime.now() in every iteration to check if the time has passed compared when the loop started. In my case since my job runs every 30 minutes, I didn't want to calculate deltatime so this is what I went for, hope this hacky method helped someone that googled for a few days to come here.

Related

Airflow Operator BigQueryTablePartitionExistenceSensor Question

I'm trying to use this BigQueryTablePartitionExistenceSensor operator in Airflow and I was wondering if this operator checks whether the partition is fully loaded or can potentially mark to success even if the data isn't complete yet.
For example, if my table is partitioned on DAY and the load for 20220420 has started but isn't complete, would this sensor trigger? Or, would it wait until that load step has been completed before marking the sensor to success?
Thanks
The Operator will not wait until your data has loaded, it will just check for the existence of the partition value at that moment in time. So if a single row gets inserted into that partition then this sensor would return True. See the sensor code that gets called by this operator.
An idea I've used in the past for similar problems has been to use a sentinel Label on the partitioned table to mark a load as "in-progress" or "done"
As has already been answered, it does not await anything except the existence of the partition.
If your data is streamed into partitions, and you have ordered delivery, you can probably add a sensor for the next-day partition — on the assumption that the previous day is complete when events have started streaming into the next.
If the load is managed by the same Airflow instance, I'd suggest using an ExternalTaskSensor on the load job. If not, you might be able to use the more generic SqlSensor, and run a custom SQL query on metadata tables to determine if a partition is complete, perhaps you can add a label or something with the Load job that you then query for.

Check the current position in a Redis list of some list element

I have a simple job queue on Redis where new jobs are pushed with RPUSH and consumed with BLPOP. The jobs are stringified JSON objects that have an id field among other things (the json string is parsed by the workers).
Each job takes some time to do, so there can be a meaningful wait time. I'd like to be able to find a job's current position in the queue, so that I can give an update to whatever is waiting on that job. That is, be able to do something like "your current position is 300... 250... 200... 100... 10... your job is now being processed".
It can be assumed that the list may grow long but never too long, i.e. possibly 1000 entries but not 1 million.
After looking through the docs a bit, it seems like this is maybe easier said than done. A possible naive solution seems to be to just loop through the list until the element is found. Are there any performance issues with calling LINDEX a couple hundred times at a time like that?
Would appreciate any suggestions on other ways this can be done (or confirmation that LINDEX is the only way). The whole structure (even the usage of a list, or addition of some helper map/list) can be changed if needed, only requirement is that it run on Redis.
You can use a sorted set and a counter to more elegantly solve the problem.
Push a job
Call INCR counter to get a counter.
Use the counter as score of the job, and call ZADD jobs counter job-name
Pop a job
Call BZPOPMIN jobs to get the first unprocessed job.
Get job position
Call ZRANK jobs job-name to get the rank of the job, e.g. the current position of the job.

is it possible to use scheduleAtFixedRate to trigger a function every first of the month?

I am new using kotlin and I am wondering if I can do the following...
I wanna call a method on the first of each month, I found this and saw a couple of examples like this:
timer.schedule(1000) {
println("hello world!")
}
I am wondering if is possible to use (instead of a fixed time) a calendar day? like first of the month?
There's no built-in way to do this.
If the exact time of day doesn't matter*, then one approach is to schedule a task to fire every 24 hours, and have it check whether the current time is the first day of the month, and if so, perform the task.
(* It will drift slightly when summer time starts or ends, or leap-seconds get added, but that may not be significant.)
A more powerful (but more complex approach) is to set the timer to go off once, at the appropriate time on the 1st of next month.  Then, after performing the task, it could re-schedule itself for the 1st of the following month.  (You'd need to take care that it always did so, even if the task threw an exception.)
You could put all this into a timer class of your own, to separate it from the business logic you want to run.
For examples in Java (which will translate directly to Kotlin/JVM), see the answers to these questions.

RecurringJob or BackgroundJob + loop for infinite task

I'm wondering to use Hangfire to program an update task. I want this task to be executed every time. So when the task finish I want to execute it again.
I don't know if the best method to use is RecurringJob or use a loop and a BackgroundJob for it.
What do you recommend me? Are there any other options?
You can use RecurringJob which triggers every x minutes or x hours (based on your requirement - can set CRON expression) , which will trigger the task/work after every such interval. This you need to use in conjunction with DisableConcurrentExecution attribute, so that multiple instances of same task are not triggered, also this attributes makes sure once your first instance is completely, then only second one will be processed.
Alternatively, you can use BackGroundJob to enqueue a task/job but this does the processing of job only one time. So you need to write some code to check this Jobs status and re-enqueue the same job again, once the first is completed. In this approach you need to write some code to do that.
I would suggest the best way is to use RecurringJob.AddOrUpdate in conjunction with DisableConcurrentExecution

Understanding Domain Class in Project Job Scheduling

I am new to optaplanner, and right now I focus on trying to understand the project job scheduling. I trying to run this examples using the sample data from optaplanner manual like in this picture below:
I have some question about the domain classes in this example :
What is the difference of GlobalResource and LocalResource? In the examples, all the resource is GlobalResource right? Then what the use of LocalResource?
There are 3 JobType: SOURCE, STANDARD, SINK, what is the meaning each one of them? It is SOURCE mean the job should be the first to start before the others? STANDARD mean it is should be run after the predecessor job finished but not after the SINK job? SINK mean it is the last job to do after all job finished?
What is the meaning of property releaseDate and criticalPathDuration in Project class? If we related it with the picture above, what is the value for project Book1 and Book2?
What is the meaning of requirement in ResourceRequirement?
I will be really thankful if someone can help me create the xml sample data like in optaplanner distribution, cause it will help me more faster to understand this example. Thanks & Regards.
A LocalResource belongs to a specific Project, a GlobalResource is shared between the projects.
So a LocalResource only has to be worry about being used by other jobs in the same Project too, while a GlobalResource has to worry about all other tasks.
That's an implementation trick. The source and sink jobs are dummy's basically. Because a project might start with multiple jobs in parallel, a SOURCE job is put in front of it, to have a single root. Same for the end: it can end with multiple, so a SINK job is put after it, to have a single tail. This makes it easier and faster to determine makespan etc.
IIRC, releaseDate is the first date we are allowed to start the first job. For example: you have to create a book, but you 'll only get the actual final content next Monday, so the releaseDate is next Monday (you can't start any work before that date).
The criticalPathDuration is a theoretical minimum duration (if we can happily ignore resources IIRC). For example: if job A takes 5 days and job B takes 2 days and B has to be done AFTER A, then the critical path duration is 7 days. Adding job C which takes 1 day and can be done in parallel with the others, don't affect that.
ResourceRequirement is the many2many relationship between ExecutionMode and Resource. Remember that ExecutionMode belongs to a specific Job. For example: doing job A in executionMode A1 requires 1 laborers and 5 days. Doing job A in executionMode A2 requires 2 laborers and 3 days.