Most elegant way to execute CPU-bound operations in asyncio application? - python-multiprocessing

I am trying to develop part of system that has the following requirement:
send health status to a remote server(every X seconds)
receive request for executing/canceling CPU bound job(s)(for example - clone git repo, compile(using conan) it.. etc).
I am using the socketio.AsyncClient to handle these requirements.
class CompileJobHandler(socketio.AsyncClientNamespace):
def __init__(self, namespace_val):
super().__init__(namespace_val)
// some init variables
async def _clone_git_repo(self, git_repo: str):
// clone repo and return its instance
return repo
async def on_availability_check(self, data):
// the health status
await self.emit('availability_check', " all good ")
async def on_cancel_job(self, data):
// cancel the current job
def _reset_job(self):
// reset job logics
def _reset_to_specific_commit(self, repo: git.Repo, commit_hash: str):
// reset to specific commit
def _compile(self, is_debug):
// compile logics - might be CPU intensive
async def on_execute_job(self, data):
// **request to execute the job(compile in our case)**
try:
repo = self._clone_git_repo(job_details.git_repo)
self._reset_to_specific_commit(repo, job_details.commit_hash)
self._compile(job_details.is_debug)
await self.emit('execute_job_response',
self._prepare_response("SUCCESS", "compile successfully"))
except Exception as e:
await self.emit('execute_job_response',
self._prepare_response(e.args[0], e.args[1]))
finally:
await self._reset_job()
The problem with the following code is that when execute_job message arrives, there is a blocking code running that blocks the whole async-io system.
to solve this problem, I have used the ProcessPoolExecutor and the asyncio event loop, as shown here: https://stackoverflow.com/questions/49978320/asyncio-run-in-executor-using-processpoolexecutor
after using it, the clone/compile functions are executed in another process - so that almost achieves my goals.
the questions I have are:
How can I design the code of the process more elegantly?(right now I have some static functions, and I don't like it...)
one approach is to keep it like that, another one is to pre-initialize an object(let's call it CompileExecuter and create instance of this type, and pre-iniailize it prior starting the process, and then let the process use it)
How can I stop the process in the middle of its execution?(if I received on_cancel_job request)
How can I handle the exception raised by the process correctly?
Other approaches to handle these requirements are welcomed

Related

Why is Python object id different after the Process starts but the pid remains the same?

"""
import time
from multiprocessing import Process, freeze_support
class FileUploadManager(Process):
"""
WorkerObject which uploads files in background process
"""
def __init__(self):
"""
Worker class to upload files in a separate background process.
"""
super().__init__()
self.daemon = True
self.upload_size = 0
self.upload_queue = set()
self.pending_uploads = set()
self.completed_uploads = set()
self.status_info = {'STOPPED'}
print(f"Initial ID: {id(self)}")
def run(self):
try:
print("STARTING NEW PROCESS...\n")
if 'STARTED' in self.status_info:
print("Upload Manager - Already Running!")
return True
self.status_info.add('STARTED')
print(f"Active Process Info: {self.status_info}, ID: {id(self)}")
# Upload files
while True:
print("File Upload Queue Empty.")
time.sleep(10)
except Exception as e:
print(f"{repr(e)} - Cannot run upload process.")
if __name__ == '__main__':
upload_manager = FileUploadManager()
print(f"Object ID: {id(upload_manager)}")
upload_manager.start()
print(f"Process Info: {upload_manager.status_info}, ID After: {id(upload_manager)}")
while 'STARTED' not in upload_manager.status_info:
print(f"Not Started! Process Info: {upload_manager.status_info}")
time.sleep(7)
"""
OUTPUT
Initial ID: 2894698869712
Object ID: 2894698869712
Process Info: {'STOPPED'}, ID After: 2894698869712
Not Started! Process Info: {'STOPPED'}
STARTING NEW PROCESS...
Active Process Info: {'STARTED', 'STOPPED'}, ID: 2585771578512
File Upload Queue Empty.
Not Started! Process Info: {'STOPPED'}
File Upload Queue Empty.
Why does the Process object have the same id and attribute values before and after is has started. but different id when the run method starts?
Initial ID: 2894698869712
Active Process Info: {'STARTED', 'STOPPED'}, ID: 2585771578512
Process Info: {'STOPPED'}, ID After: 2894698869712
I fixed your indentation, and I also removed everything from your script that was not actually being used. It is now a minimum, reproducible example that anyone can run. In the future, please adhere to the site guidelines, and please proofread your questions. It will save everybody's time and you will get better answers.
I would also like to point out that the question in your title is not at all the same as the question asked in your text. At no point do you retrieve the process ID, which is an operating system value. You are printing out the ID of the object, which is a value that has meaning only within the Python runtime environment.
import time
from multiprocessing import Process
# Removed freeze_support since it was unused
class FileUploadManager(Process):
"""
WorkerObject which uploads files in background process
"""
def __init__(self):
"""
Worker class to upload files in a separate background process.
"""
super().__init__(daemon=True)
# The next line probably does not work as intended, so
# I commented it out. The docs say that the daemon
# flag must be set by a keyword-only argument
# self.daemon = True
# I removed a buch of unused variables for this test program
self.status_info = {'STOPPED'}
print(f"Initial ID: {id(self)}")
def run(self):
try:
print("STARTING NEW PROCESS...\n")
if 'STARTED' in self.status_info:
print("Upload Manager - Already Running!")
return # Removed True return value (it was unused)
self.status_info.add('STARTED')
print(f"Active Process Info: {self.status_info}, ID: {id(self)}")
# Upload files
while True:
print("File Upload Queue Empty.")
time.sleep(1.0)
except Exception as e:
print(f"{repr(e)} - Cannot run upload process.")
if __name__ == '__main__':
upload_manager = FileUploadManager()
print(f"Object ID: {id(upload_manager)}")
upload_manager.start()
print(f"Process Info: {upload_manager.status_info}",
f"ID After: {id(upload_manager)}")
while 'STARTED' not in upload_manager.status_info:
print(f"Not Started! Process Info: {upload_manager.status_info}")
time.sleep(0.7)
Your question is, why is the id of upload_manager the same before and after it is started. Simple answer: because it's the same object. It does not become another object just because you called one of its functions. That would not make any sense.
I suppose you might be wondering why the ID of the FileUploadManager object is different when you print it out from its "run" method. It's the same simple answer: because it's a different object. Your script actually creates two instances of FileUploadManager, although it's not obvious. In Python, each Process has its own memory space. When you start a secondary Process (upload_manager.start()), Python makes a second instance of FileUploadManager to execute in this new Process. The two instances are completely separate and "know" nothing about each other.
You did not say that your script doesn't terminate, but it actually does not. It runs forever, stuck in the loop while 'STARTED' not in upload_manager.status_info. That's because 'STARTED' was added to self.status_info in the secondary Process. That Process is working with a different instance of FileUploadManager. The changes you make there do not get automatically reflected in the first instance, which lives in the main Process. Therefore the first instance of FileUploadManager never changes, and the loop never exits.
This all makes perfect sense once you realize that each Process works with its own separate objects. If you need to pass data from one Process to another, that can be done with Pipes, Queues, Managers and shared variables. That is documented in the Concurrent Execution section of the standard library.

Python rq-scheduler: enqueue a failed job after some interval

I am using python RQ to execute a job in the background. The job calls a third-party rest API and stores the response in the database. (Refer the code below)
#classmethod
def fetch_resource(cls, resource_id):
import requests
clsmgr = cls(resource_id)
clsmgr.__sign_headers()
res = requests.get(url=f'http://api.demo-resource.com/{resource_id}', headers=clsmgr._headers)
if not res.ok:
raise MyThirdPartyAPIException(res)
....
The third-party API is having some rate limit like 7 requests/minute. I have created a retry handler to gracefully handle the 429 too many requests HTTP Status Code and re-queue the job after the a minute (the time unit changes based on rate limit). To re-queue the job after some interval I am using the rq-scheduler.
Please find the handler code attached below,
def retry_failed_job(job, exc_type, exc_value, traceback):
if isinstance(exc_value, MyThirdPartyAPIException) and exc_value.status_code == 429:
import datetime as dt
sch = Scheduler(connection=Redis())
# sch.enqueue_in(dt.timedelta(seconds=60), job.func_name, *job.args, **job.kwargs)
I am facing issues in re-queueing the failed job back into the task queue. As I can not directly call the sch.enqueue_in(dt.timedelta(seconds=60), job) in the handler code (As per the doc, job to represent the delayed function call). How can I re-queue the job function with all the args and kwargs?
Ahh, The following statement does the work,
sch.enqueue_in(dt.timedelta(seconds=60), job.func, *job.args, **job.kwargs)
The question is still open let me know if any one has better approach on this.

how to structure beginning and ending synchronous calls using trio?

My ask is for structured trio pseudo-code (actual trio function-calls, but dummy worker-does-work-here fill-in) so I can understand and try out good flow-control practices for switching between synchronous and asynchronous processes.
I want to do the following...
load a file of json-data into a data-dict
aside: the data-dict looks like { 'key_a': {(info_dict_a)}, 'key_b': {info_dict_b} }
have each of n-workers...
access that data-dict to find the next record-to-process info-dict
prepare some data from the record-being-processed and post the data to a url
process the post-response to update a 'response' key in the record-being-processed info-dict
update the data-dict with the key's info-dict
overwrite the original file of json-data with the updated data-dict
Aside: I know there are other ways I could achieve my overall goal than the clunky repeated rewrite of a json file -- but I'm not asking for that input; I really would like to understand trio well enough to be able to use it for this flow.
So, the processes that I want to be synchronous:
the get next record-to-process info-dict
the updating of the data-dict
the overwriting of the original file of json-data with the updated data-dict
New to trio, I have working code here ...which I believe is getting the next record-to-process synchronously (via using a trio.Semaphore() technique). But I'm pretty sure I'm not saving the file synchronously.
Learning Go a few years ago, I felt I grokked the approaches to interweaving synchronous and asynchronous calls -- but am not there yet with trio. Thanks in advance.
Here is how I would write the (pseudo-)code:
async def process_file(input_file):
# load the file synchronously
with open(input_file) as fd:
data = json.load(fd)
# iterate over your dict asynchronously
async with trio.open_nursery() as nursery:
for key, sub in data.items():
if sub['updated'] is None:
sub['updated'] = 'in_progress'
nursery.start_soon(post_update, {key: sub})
# save your result json synchronously
save_file(data, input_file)
trio guarantees you that once you exit the async with block every task you spawned is complete so you can safely save your file because no more update will occur.
I also removed the grab_next_entry function because it seems to me that this function will iterate over the same keys (incrementally) at each call (giving a O(n!)) complexity while you could simplify it by just iterating over your dict once (dropping the complexity to O(n))
You don't need the Semaphore either, except if you want to limit the number of parallel post_update calls. But trio offers a builtin mechanism for this as well thanks to its CapacityLimiter that you would use like this:
limit = trio.CapacityLimiter(10)
async with trio.open_nursery() as nursery:
async with limit:
for x in z:
nursery.start_soon(func, x)
UPDATE thanks to #njsmith's comment
So, in order to limit the amount of concurrent post_update you'll rewrite it like this:
async def post_update(data, limit):
async with limit:
...
And then you can rewrite the previous loop like that:
limit = trio.CapacityLimiter(10)
# iterate over your dict asynchronously
async with trio.open_nursery() as nursery:
for key, sub in data.items():
if sub['updated'] is None:
sub['updated'] = 'in_progress'
nursery.start_soon(post_update, {key: sub}, limit)
This way, we spawn n tasks for the n entries in your data-dict, but if there are more than 10 tasks running concurrently, then the extra ones will have to wait for the limit to be released (at the end of the async with limit block).
This code uses channels to multiplex requests to and from a pool of workers. I found the additional requirement (in your code comments) that the post-response rate is throttled, so read_entries sleeps after each send.
from random import random
import time, asks, trio
snd_input, rcv_input = trio.open_memory_channel(0)
snd_output, rcv_output = trio.open_memory_channel(0)
async def read_entries():
async with snd_input:
for key_entry in range(10):
print("reading", key_entry)
await snd_input.send(key_entry)
await trio.sleep(1)
async def work(n):
async for key_entry in rcv_input:
print(f"w{n} {time.monotonic()} posting", key_entry)
r = await asks.post(f"https://httpbin.org/delay/{5 * random()}")
await snd_output.send((r.status_code, key_entry))
async def save_entries():
async for entry in rcv_output:
print("saving", entry)
async def main():
async with trio.open_nursery() as nursery:
nursery.start_soon(read_entries)
nursery.start_soon(save_entries)
async with snd_output:
async with trio.open_nursery() as workers:
for n in range(3):
workers.start_soon(work, n)
trio.run(main)

How to wrap asyncio with iterator

I have the following simplified code:
async def asynchronous_function(*args, **kwds):
statement = await prepare(query)
async with conn.transaction():
async for record in statement.cursor():
??? yield record ???
...
class Foo:
def __iter__(self):
records = ??? asynchronous_function ???
yield from records
...
x = Foo()
for record in x:
...
I don't know how to fill in the ??? above. I want to yield the record data, but it's really not obvious how to wrap asyncio code.
While it is true that asyncio is intended to be used across the board, sometimes it is simply impossible to immediately convert a large piece of software (with all its dependencies) to async. Fortunately there are ways to combine legacy synchronous code with newly written asyncio portions. A straightforward way to do so is by running the event loop in a dedicated thread, and using asyncio.run_coroutine_threadsafe to submit tasks to it.
With those low-level tools you can write a generic adapter to turn any asynchronous iterator into a synchronous one. For example:
import asyncio, threading, queue
# create an asyncio loop that runs in the background to
# serve our asyncio needs
loop = asyncio.get_event_loop()
threading.Thread(target=loop.run_forever, daemon=True).start()
def wrap_async_iter(ait):
"""Wrap an asynchronous iterator into a synchronous one"""
q = queue.Queue()
_END = object()
def yield_queue_items():
while True:
next_item = q.get()
if next_item is _END:
break
yield next_item
# After observing _END we know the aiter_to_queue coroutine has
# completed. Invoke result() for side effect - if an exception
# was raised by the async iterator, it will be propagated here.
async_result.result()
async def aiter_to_queue():
try:
async for item in ait:
q.put(item)
finally:
q.put(_END)
async_result = asyncio.run_coroutine_threadsafe(aiter_to_queue(), loop)
return yield_queue_items()
Then your code just needs to call wrap_async_iter to wrap an async iter into a sync one:
async def mock_records():
for i in range(3):
yield i
await asyncio.sleep(1)
for record in wrap_async_iter(mock_records()):
print(record)
In your case Foo.__iter__ would use yield from wrap_async_iter(asynchronous_function(...)).
If you want to receive all records from async generator, you can use async for or, for shortness, asynchronous comprehensions:
async def asynchronous_function(*args, **kwds):
# ...
yield record
async def aget_records():
records = [
record
async for record
in asynchronous_function()
]
return records
If you want to get result from asynchronous function synchronously (i.e. blocking), you can just run this function in asyncio loop:
def get_records():
records = asyncio.run(aget_records())
return records
Note, however, that once you run some coroutine in event loop you're losing ability to run this coroutine concurrently (i.e. parallel) with other coroutines and thus receive all related benefits.
As Vincent already pointed in comments, asyncio is not a magic wand that makes code faster, it's an instrument that sometimes can be used to run different I/O tasks concurrently with low overhead.
You may be interested in reading this answer to see main idea behind asyncio.

Celery: Task Singleton?

I have a task that I need to run asynchronously from the web page that triggered it. This task runs rather long, and as the web page could be getting a lot of these requests, I'd like celery to only run one instance of this task at a given time.
Is there any way I can do this in Celery natively? I'm tempted to create a database table that holds this state for all the tasks to communicate with, but it feels hacky.
You probably can create a dedicated worker for that task configured with CELERYD_CONCURRENCY=1 then all tasks on that worker will run synchronously
You can use memcache/redis for that.
There is an example on the celery official site - http://docs.celeryproject.org/en/latest/tutorials/task-cookbook.html
And if you prefer redis (This is a Django realization, but you can also easily modify it for your needs):
from django.core.cache import cache
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
class SingletonTask(Task):
def __call__(self, *args, **kwargs):
lock = cache.lock(self.name)
if not lock.acquire(blocking=False):
logger.info("{} failed to lock".format(self.name))
return
try:
super(SingletonTask, self).__call__(*args, **kwargs)
except Exception as e:
lock.release()
raise e
lock.release()
And then use it as a base task:
#shared_task(base=SingletonTask)
def test_task():
from time import sleep
sleep(10)
This realization is nonblocking. If you want next task to wait for the previous task change blocking=False to blocking=True and add timeout