Celery: Task Singleton? - singleton

I have a task that I need to run asynchronously from the web page that triggered it. This task runs rather long, and as the web page could be getting a lot of these requests, I'd like celery to only run one instance of this task at a given time.
Is there any way I can do this in Celery natively? I'm tempted to create a database table that holds this state for all the tasks to communicate with, but it feels hacky.

You probably can create a dedicated worker for that task configured with CELERYD_CONCURRENCY=1 then all tasks on that worker will run synchronously

You can use memcache/redis for that.
There is an example on the celery official site - http://docs.celeryproject.org/en/latest/tutorials/task-cookbook.html
And if you prefer redis (This is a Django realization, but you can also easily modify it for your needs):
from django.core.cache import cache
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
class SingletonTask(Task):
def __call__(self, *args, **kwargs):
lock = cache.lock(self.name)
if not lock.acquire(blocking=False):
logger.info("{} failed to lock".format(self.name))
return
try:
super(SingletonTask, self).__call__(*args, **kwargs)
except Exception as e:
lock.release()
raise e
lock.release()
And then use it as a base task:
#shared_task(base=SingletonTask)
def test_task():
from time import sleep
sleep(10)
This realization is nonblocking. If you want next task to wait for the previous task change blocking=False to blocking=True and add timeout

Related

Avoid running same job by two EC2 instances

I am using APScheduler in decorator way to run jobs at certain intervals. The problem is that when below code is deployed in two EC2 instances then same job runs twice at same with difference in milliseconds.
My question is : How to avoid running same job by two EC2 instances at same time or Do I need to follow different code design pattern in this case. I want to run this job only once either by one of the severs.
from datetime import datetime
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
sched.start()
#sched.scheduled_job('interval', id='my_job_id', hours=2)
def job_function():
print("Hello World")
If you can share any locking mechanism examples it would be appreciable
You can use AWS-SDK/AWS-CLI by using AWS-SDK/AWS-CLI you can set
If instance_id = "your instance id"
Write your code here
Now your cron will get execute on each instances you have and your code will be executed from that specific instance.

Why is Python object id different after the Process starts but the pid remains the same?

"""
import time
from multiprocessing import Process, freeze_support
class FileUploadManager(Process):
"""
WorkerObject which uploads files in background process
"""
def __init__(self):
"""
Worker class to upload files in a separate background process.
"""
super().__init__()
self.daemon = True
self.upload_size = 0
self.upload_queue = set()
self.pending_uploads = set()
self.completed_uploads = set()
self.status_info = {'STOPPED'}
print(f"Initial ID: {id(self)}")
def run(self):
try:
print("STARTING NEW PROCESS...\n")
if 'STARTED' in self.status_info:
print("Upload Manager - Already Running!")
return True
self.status_info.add('STARTED')
print(f"Active Process Info: {self.status_info}, ID: {id(self)}")
# Upload files
while True:
print("File Upload Queue Empty.")
time.sleep(10)
except Exception as e:
print(f"{repr(e)} - Cannot run upload process.")
if __name__ == '__main__':
upload_manager = FileUploadManager()
print(f"Object ID: {id(upload_manager)}")
upload_manager.start()
print(f"Process Info: {upload_manager.status_info}, ID After: {id(upload_manager)}")
while 'STARTED' not in upload_manager.status_info:
print(f"Not Started! Process Info: {upload_manager.status_info}")
time.sleep(7)
"""
OUTPUT
Initial ID: 2894698869712
Object ID: 2894698869712
Process Info: {'STOPPED'}, ID After: 2894698869712
Not Started! Process Info: {'STOPPED'}
STARTING NEW PROCESS...
Active Process Info: {'STARTED', 'STOPPED'}, ID: 2585771578512
File Upload Queue Empty.
Not Started! Process Info: {'STOPPED'}
File Upload Queue Empty.
Why does the Process object have the same id and attribute values before and after is has started. but different id when the run method starts?
Initial ID: 2894698869712
Active Process Info: {'STARTED', 'STOPPED'}, ID: 2585771578512
Process Info: {'STOPPED'}, ID After: 2894698869712
I fixed your indentation, and I also removed everything from your script that was not actually being used. It is now a minimum, reproducible example that anyone can run. In the future, please adhere to the site guidelines, and please proofread your questions. It will save everybody's time and you will get better answers.
I would also like to point out that the question in your title is not at all the same as the question asked in your text. At no point do you retrieve the process ID, which is an operating system value. You are printing out the ID of the object, which is a value that has meaning only within the Python runtime environment.
import time
from multiprocessing import Process
# Removed freeze_support since it was unused
class FileUploadManager(Process):
"""
WorkerObject which uploads files in background process
"""
def __init__(self):
"""
Worker class to upload files in a separate background process.
"""
super().__init__(daemon=True)
# The next line probably does not work as intended, so
# I commented it out. The docs say that the daemon
# flag must be set by a keyword-only argument
# self.daemon = True
# I removed a buch of unused variables for this test program
self.status_info = {'STOPPED'}
print(f"Initial ID: {id(self)}")
def run(self):
try:
print("STARTING NEW PROCESS...\n")
if 'STARTED' in self.status_info:
print("Upload Manager - Already Running!")
return # Removed True return value (it was unused)
self.status_info.add('STARTED')
print(f"Active Process Info: {self.status_info}, ID: {id(self)}")
# Upload files
while True:
print("File Upload Queue Empty.")
time.sleep(1.0)
except Exception as e:
print(f"{repr(e)} - Cannot run upload process.")
if __name__ == '__main__':
upload_manager = FileUploadManager()
print(f"Object ID: {id(upload_manager)}")
upload_manager.start()
print(f"Process Info: {upload_manager.status_info}",
f"ID After: {id(upload_manager)}")
while 'STARTED' not in upload_manager.status_info:
print(f"Not Started! Process Info: {upload_manager.status_info}")
time.sleep(0.7)
Your question is, why is the id of upload_manager the same before and after it is started. Simple answer: because it's the same object. It does not become another object just because you called one of its functions. That would not make any sense.
I suppose you might be wondering why the ID of the FileUploadManager object is different when you print it out from its "run" method. It's the same simple answer: because it's a different object. Your script actually creates two instances of FileUploadManager, although it's not obvious. In Python, each Process has its own memory space. When you start a secondary Process (upload_manager.start()), Python makes a second instance of FileUploadManager to execute in this new Process. The two instances are completely separate and "know" nothing about each other.
You did not say that your script doesn't terminate, but it actually does not. It runs forever, stuck in the loop while 'STARTED' not in upload_manager.status_info. That's because 'STARTED' was added to self.status_info in the secondary Process. That Process is working with a different instance of FileUploadManager. The changes you make there do not get automatically reflected in the first instance, which lives in the main Process. Therefore the first instance of FileUploadManager never changes, and the loop never exits.
This all makes perfect sense once you realize that each Process works with its own separate objects. If you need to pass data from one Process to another, that can be done with Pipes, Queues, Managers and shared variables. That is documented in the Concurrent Execution section of the standard library.

Python rq-scheduler: enqueue a failed job after some interval

I am using python RQ to execute a job in the background. The job calls a third-party rest API and stores the response in the database. (Refer the code below)
#classmethod
def fetch_resource(cls, resource_id):
import requests
clsmgr = cls(resource_id)
clsmgr.__sign_headers()
res = requests.get(url=f'http://api.demo-resource.com/{resource_id}', headers=clsmgr._headers)
if not res.ok:
raise MyThirdPartyAPIException(res)
....
The third-party API is having some rate limit like 7 requests/minute. I have created a retry handler to gracefully handle the 429 too many requests HTTP Status Code and re-queue the job after the a minute (the time unit changes based on rate limit). To re-queue the job after some interval I am using the rq-scheduler.
Please find the handler code attached below,
def retry_failed_job(job, exc_type, exc_value, traceback):
if isinstance(exc_value, MyThirdPartyAPIException) and exc_value.status_code == 429:
import datetime as dt
sch = Scheduler(connection=Redis())
# sch.enqueue_in(dt.timedelta(seconds=60), job.func_name, *job.args, **job.kwargs)
I am facing issues in re-queueing the failed job back into the task queue. As I can not directly call the sch.enqueue_in(dt.timedelta(seconds=60), job) in the handler code (As per the doc, job to represent the delayed function call). How can I re-queue the job function with all the args and kwargs?
Ahh, The following statement does the work,
sch.enqueue_in(dt.timedelta(seconds=60), job.func, *job.args, **job.kwargs)
The question is still open let me know if any one has better approach on this.

Most elegant way to execute CPU-bound operations in asyncio application?

I am trying to develop part of system that has the following requirement:
send health status to a remote server(every X seconds)
receive request for executing/canceling CPU bound job(s)(for example - clone git repo, compile(using conan) it.. etc).
I am using the socketio.AsyncClient to handle these requirements.
class CompileJobHandler(socketio.AsyncClientNamespace):
def __init__(self, namespace_val):
super().__init__(namespace_val)
// some init variables
async def _clone_git_repo(self, git_repo: str):
// clone repo and return its instance
return repo
async def on_availability_check(self, data):
// the health status
await self.emit('availability_check', " all good ")
async def on_cancel_job(self, data):
// cancel the current job
def _reset_job(self):
// reset job logics
def _reset_to_specific_commit(self, repo: git.Repo, commit_hash: str):
// reset to specific commit
def _compile(self, is_debug):
// compile logics - might be CPU intensive
async def on_execute_job(self, data):
// **request to execute the job(compile in our case)**
try:
repo = self._clone_git_repo(job_details.git_repo)
self._reset_to_specific_commit(repo, job_details.commit_hash)
self._compile(job_details.is_debug)
await self.emit('execute_job_response',
self._prepare_response("SUCCESS", "compile successfully"))
except Exception as e:
await self.emit('execute_job_response',
self._prepare_response(e.args[0], e.args[1]))
finally:
await self._reset_job()
The problem with the following code is that when execute_job message arrives, there is a blocking code running that blocks the whole async-io system.
to solve this problem, I have used the ProcessPoolExecutor and the asyncio event loop, as shown here: https://stackoverflow.com/questions/49978320/asyncio-run-in-executor-using-processpoolexecutor
after using it, the clone/compile functions are executed in another process - so that almost achieves my goals.
the questions I have are:
How can I design the code of the process more elegantly?(right now I have some static functions, and I don't like it...)
one approach is to keep it like that, another one is to pre-initialize an object(let's call it CompileExecuter and create instance of this type, and pre-iniailize it prior starting the process, and then let the process use it)
How can I stop the process in the middle of its execution?(if I received on_cancel_job request)
How can I handle the exception raised by the process correctly?
Other approaches to handle these requirements are welcomed

Celery task schedule (Celery, Django and RabbitMQ)

I want to have a task that will execute every 5 minutes, but it will wait for last execution to finish and then start to count this 5 minutes. (This way I can also be sure that there is only one task running) The easiest way I found is to run django application manage.py shell and run this:
while True:
result = task.delay()
result.wait()
sleep(5)
but for each task that I want to execute this way I have to run it's own shell, is there an easy way to do it? May be some king custom ot django celery scheduler?
Wow it's amazing how no one understands this person's question. They are asking not about running tasks periodically, but how to ensure that Celery does not run two instances of the same task simultaneously. I don't think there's a way to do this with Celery directly, but what you can do is have one of the tasks acquire a lock right when it begins, and if it fails, to try again in a few seconds (using retry). The task would release the lock right before it returns; you can make the lock auto-expire after a few minutes if it ever crashes or times out.
For the lock you can probably just use your database or something like Redis.
You may be interested in this simpler method that requires no changes to a celery conf.
#celery.decorators.periodic_task(run_every=datetime.timedelta(minutes=5))
def my_task():
# Insert fun-stuff here
All you need is specify in celery conf witch task you want to run periodically and with which interval.
Example: Run the tasks.add task every 30 seconds
from datetime import timedelta
CELERYBEAT_SCHEDULE = {
"runs-every-30-seconds": {
"task": "tasks.add",
"schedule": timedelta(seconds=30),
"args": (16, 16)
},
}
Remember that you have to run celery in beat mode with the -B option
manage celeryd -B
You can also use the crontab style instead of time interval, checkout this:
http://ask.github.com/celery/userguide/periodic-tasks.html
If you are using django-celery remember that you can also use tha django db as scheduler for periodic tasks, in this way you can easily add trough the django-celery admin panel new periodic tasks.
For do that you need to set the celerybeat scheduler in settings.py in this way
CELERYBEAT_SCHEDULER = "djcelery.schedulers.DatabaseScheduler"
To expand on #MauroRocco's post, from http://docs.celeryproject.org/en/v2.2.4/userguide/periodic-tasks.html
Using a timedelta for the schedule means the task will be executed 30 seconds after celerybeat starts, and then every 30 seconds after the last run. A crontab like schedule also exists, see the section on Crontab schedules.
So this will indeed achieve the goal you want.
Because of celery.decorators deprecated, you can use periodic_task decorator like that:
from celery.task.base import periodic_task
from django.utils.timezone import timedelta
#periodic_task(run_every=timedelta(seconds=5))
def my_background_process():
# insert code
Add that task to a separate queue, and then use a separate worker for that queue with the concurrency option set to 1.