Execute multiple pyiron jobs with dependencies - pyiron

I have 4 jobs (A, B, C, D), which I want to start using pyiron. All jobs need to run on a remote cluster using SLURM. Some of the jobs need results from other jobs as input.
Ideally, I would like to have a workflow like:
Job A is started by the user.
Jobs B and C start automatically and in parallel (!) as soon as job A is done.
Job D starts automatically as soon as the jobs B and C are finished.
I realize that I could implement this in Jupyter using some if-conditions and the sleep-command.
However, the jobs A, B, and C could run for multiple days and I don't want to keep my Jupyter notebook running for so long.
Is there a more convenient way to realize these job dependencies in pyiron?

I guess the easiest way would be to submit the whole Jupyter notebook to the queue using the script job class:
job = pr.create.job.ScriptJob("script")
job.script_path = 'workflow.ipynb'
job.server.queue = 'my_queue'
job.server.cores = 32
job.run()
Here workflow.ipynb would be your current notebook, my_queue your SLURM queue for remote submission and 32 the total number of cores for allocation.

Related

Avoid running same job by two EC2 instances

I am using APScheduler in decorator way to run jobs at certain intervals. The problem is that when below code is deployed in two EC2 instances then same job runs twice at same with difference in milliseconds.
My question is : How to avoid running same job by two EC2 instances at same time or Do I need to follow different code design pattern in this case. I want to run this job only once either by one of the severs.
from datetime import datetime
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
sched.start()
#sched.scheduled_job('interval', id='my_job_id', hours=2)
def job_function():
print("Hello World")
If you can share any locking mechanism examples it would be appreciable
You can use AWS-SDK/AWS-CLI by using AWS-SDK/AWS-CLI you can set
If instance_id = "your instance id"
Write your code here
Now your cron will get execute on each instances you have and your code will be executed from that specific instance.

Python multiprocessing between ubuntu and centOS

I am trying to run some parallel jobs through Python multiprocessing. Here is an example code:
import multiprocessing as mp
import os
def f(name, total):
print('process {:d} starting doing business in {:d}'.format(name, total))
#there will be some unix command to run external program
if __name__ == '__main__':
total_task_num = 100
mp.Queue()
all_processes = []
for i in range(total_task_num):
p = mp.Process(target=f, args=(i,total_task_num))
all_processes.append(p)
p.start()
for p in all_processes:
p.join()
I also set export OMP_NUM_THREADS=1 to make sure that only one thread for one process.
Now I have 20 cores in my desktop. For 100 parallel jobs, I want to let it run 5 cycles so that each core run one job (20*5=100).
I tried to do the same code in CentOS and ubuntu. It seems that CentOS will automatically do a job splitting. In other words, there will be only 20 parallel running jobs at the same time. However, ubuntu will start 100 jobs simultaneously. As such, each core will be occupied by 5 jobs. This will significantly increase the total run time due to high work load.
I wonder if there is an elegant solution to teach ubuntu to run only 1 job per core.
To enable a process run on a specific CPU, you use the command taskset in linux. Accordingly you can arrive on a logic based on "taskset -p [mask] [pid]" that assigns each process to a specific core in a loop.
Also , python helps in incorporation of affinity control via sched_setaffinity that can be checked for confining a process to specific cores. Accordingly , you can arrive on a logic for usage of "os.sched_setaffinity(pid, mask)" where pid is the process id of the process whose mask represents the group of CPUs to which the process shall be confined to.
In python, there are also other tools like https://pypi.org/project/affinity/ that can be explored for usage.

Scrapyd: How to cancel all jobs with one command?

I am running over 40 spiders which are until now scheduled via cron and issued via scrapy crawl Due to several reasons I am now switching to scrapyd, one of them is to be able to see which jobs are running in case I need to do maintenance and reboot - so I can cancel a job.
Is it possible to cancel multiple jobs at once? I noticed that multiple jobs might be running at once with many waiting in quene with status "pending". Stopping the crawl might therefore require multiple calls of the cancel.json endpoint.
How to stop (or better pause) all jobs?
The scrapyd API (as of v1.3.0) does not support pausing. It does have stopping one job per call, however, so you have to loop the jobs yourself
I took #kolas's script from this question and update it to work with python 3.
import json, os
PROJECT_NAME = "MY_PROJECT"
cd = os.system('curl http://localhost:6800/listjobs.json?project={} > kill_job.text'.format(PROJECT_NAME))
with open('kill_job.text', 'r') as f:
a = json.loads(f.readlines()[0])
pending_jobs = list(a.values())[2]
for job in pending_jobs:
job_id = job['id']
kill = 'curl http://localhost:6800/cancel.json -d project={} -d job={}'.format(PROJECT_NAME, job_id)
os.system(kill)

Long running jobs redelivering after broker visibility timeout with celery and redis

I am using celery 4.3 + Redis + flower. I have a few long-running jobs with acks_late=True and task_reject_on_worker_lost=True. I am using celery grouping to group jobs run parallelly and append result and use in parent job.
In this scenario, my few jobs will run more than an hour, after every one hour the same child jobs are redelivering again to the worker.
The sample jobs as below.
#app.task(queue='q1', bind=True, acks_late=True, task_reject_on_worker_lost=True, max_retries=3)
def job_1(self):
do_something()
task_group = group(job_2.s(batch) for batch in range(0, len([1,2,3,4,5,6]), 3))
result_group = task_group.apply_async()
#app.task(queue='q1', bind=True, acks_late=True, task_reject_on_worker_lost=True, max_retries=3)
def job_2(self, batch):
do_something()
return result
The above job_2 will run more than hour and after one hour the same is redelivering again to the worker.
My celery setup and config as shown below:
c = Celery(app.import_name,
backend=app.config['CELERY_RESULT_BACKEND'],
broker=app.config['CELERY_BROKER_URL'])
config.py
CELERY_BROKER_URL = os.environ['CELERY_BROKER_URL']
CELERY_RESULT_BACKEND = os.environ['CELERY_RESULT_BACKEND']
CELERY_BROKER_TRANSPORT_OPTIONS = {'visibility_timeout': 36000}
I tried to increase visibility timeout to 10 hours after the redelivering issue like above in configuration. but it looks like not working.
Please help with this issue and let me know if any solution.

Celery task schedule (Celery, Django and RabbitMQ)

I want to have a task that will execute every 5 minutes, but it will wait for last execution to finish and then start to count this 5 minutes. (This way I can also be sure that there is only one task running) The easiest way I found is to run django application manage.py shell and run this:
while True:
result = task.delay()
result.wait()
sleep(5)
but for each task that I want to execute this way I have to run it's own shell, is there an easy way to do it? May be some king custom ot django celery scheduler?
Wow it's amazing how no one understands this person's question. They are asking not about running tasks periodically, but how to ensure that Celery does not run two instances of the same task simultaneously. I don't think there's a way to do this with Celery directly, but what you can do is have one of the tasks acquire a lock right when it begins, and if it fails, to try again in a few seconds (using retry). The task would release the lock right before it returns; you can make the lock auto-expire after a few minutes if it ever crashes or times out.
For the lock you can probably just use your database or something like Redis.
You may be interested in this simpler method that requires no changes to a celery conf.
#celery.decorators.periodic_task(run_every=datetime.timedelta(minutes=5))
def my_task():
# Insert fun-stuff here
All you need is specify in celery conf witch task you want to run periodically and with which interval.
Example: Run the tasks.add task every 30 seconds
from datetime import timedelta
CELERYBEAT_SCHEDULE = {
"runs-every-30-seconds": {
"task": "tasks.add",
"schedule": timedelta(seconds=30),
"args": (16, 16)
},
}
Remember that you have to run celery in beat mode with the -B option
manage celeryd -B
You can also use the crontab style instead of time interval, checkout this:
http://ask.github.com/celery/userguide/periodic-tasks.html
If you are using django-celery remember that you can also use tha django db as scheduler for periodic tasks, in this way you can easily add trough the django-celery admin panel new periodic tasks.
For do that you need to set the celerybeat scheduler in settings.py in this way
CELERYBEAT_SCHEDULER = "djcelery.schedulers.DatabaseScheduler"
To expand on #MauroRocco's post, from http://docs.celeryproject.org/en/v2.2.4/userguide/periodic-tasks.html
Using a timedelta for the schedule means the task will be executed 30 seconds after celerybeat starts, and then every 30 seconds after the last run. A crontab like schedule also exists, see the section on Crontab schedules.
So this will indeed achieve the goal you want.
Because of celery.decorators deprecated, you can use periodic_task decorator like that:
from celery.task.base import periodic_task
from django.utils.timezone import timedelta
#periodic_task(run_every=timedelta(seconds=5))
def my_background_process():
# insert code
Add that task to a separate queue, and then use a separate worker for that queue with the concurrency option set to 1.