I am trying to call one API 10 times asynchronously though cast method of Genserver. Can someone guide me how I can collect the responses of 10 API and consolidate in one list of tuples?
defmodule DataMonitor.RuleReceiver do
use GenServer
require Logger
alias DataMonitor.{
ProcessRuleSet
}
def init(state) do
{:ok, state}
end
def start_link do
GenServer.start_link(__MODULE__, [])
end
def process_rule_set(receiver_pid, {rule_set, company_id, auth_headers}) do
GenServer.cast(receiver_pid, {rule_set, company_id, auth_headers})
end
def handle_cast({rule_set, company_id, auth_headers}, state) do
result = ProcessRuleSet.process_rule_set(rule_set, company_id, auth_headers)
{:noreply, state}
end
end
Sender/Caller module
defmodule DataMonitor.RuleSender do
def perform() do
Enum.each(rule_sets, fn rule_set ->
{:ok, pid} = RuleReceiver.start_link()
RuleReceiver.process_rule_set(pid, {rule_set, company_id, auth_headers})
end)
end
end
though GenServer.cast/2 [...] how I can collect the responses
You cannot per se. Cast requests are asynchronous and return no response. The only way to collect the data from casts, would be to introduce a store (like an Agent, or ets, or whatever,) and store the values directly from casts. This solution has the obvious drawback: the workflow of it would be undetermined, one might not assume all the 10 responses are processed and stored at any time. That said, casts under some circumstances might even be lost and unprocessed and you have no chance to get notified about that. I have never met such a case, but it’s considered to be legit.
So, in this particular case, you probably should just use GenServer.call/2 instead of cast, and collect responses directly in the iteration with Enum.map/2:
def perform() do
Enum.map(rule_sets, fn rule_set ->
{:ok, pid} = RuleReceiver.start_link()
{pid, RuleReceiver.process_rule_set_call(...)}
end)
end
As the result you’ll have 10 tuples {pid, response}. I am not sure what is the reason to create 10 GenServers here and why would not you use named GenServers to not bother with pids, but this is obviously out of scope here.
Related
My ask is for structured trio pseudo-code (actual trio function-calls, but dummy worker-does-work-here fill-in) so I can understand and try out good flow-control practices for switching between synchronous and asynchronous processes.
I want to do the following...
load a file of json-data into a data-dict
aside: the data-dict looks like { 'key_a': {(info_dict_a)}, 'key_b': {info_dict_b} }
have each of n-workers...
access that data-dict to find the next record-to-process info-dict
prepare some data from the record-being-processed and post the data to a url
process the post-response to update a 'response' key in the record-being-processed info-dict
update the data-dict with the key's info-dict
overwrite the original file of json-data with the updated data-dict
Aside: I know there are other ways I could achieve my overall goal than the clunky repeated rewrite of a json file -- but I'm not asking for that input; I really would like to understand trio well enough to be able to use it for this flow.
So, the processes that I want to be synchronous:
the get next record-to-process info-dict
the updating of the data-dict
the overwriting of the original file of json-data with the updated data-dict
New to trio, I have working code here ...which I believe is getting the next record-to-process synchronously (via using a trio.Semaphore() technique). But I'm pretty sure I'm not saving the file synchronously.
Learning Go a few years ago, I felt I grokked the approaches to interweaving synchronous and asynchronous calls -- but am not there yet with trio. Thanks in advance.
Here is how I would write the (pseudo-)code:
async def process_file(input_file):
# load the file synchronously
with open(input_file) as fd:
data = json.load(fd)
# iterate over your dict asynchronously
async with trio.open_nursery() as nursery:
for key, sub in data.items():
if sub['updated'] is None:
sub['updated'] = 'in_progress'
nursery.start_soon(post_update, {key: sub})
# save your result json synchronously
save_file(data, input_file)
trio guarantees you that once you exit the async with block every task you spawned is complete so you can safely save your file because no more update will occur.
I also removed the grab_next_entry function because it seems to me that this function will iterate over the same keys (incrementally) at each call (giving a O(n!)) complexity while you could simplify it by just iterating over your dict once (dropping the complexity to O(n))
You don't need the Semaphore either, except if you want to limit the number of parallel post_update calls. But trio offers a builtin mechanism for this as well thanks to its CapacityLimiter that you would use like this:
limit = trio.CapacityLimiter(10)
async with trio.open_nursery() as nursery:
async with limit:
for x in z:
nursery.start_soon(func, x)
UPDATE thanks to #njsmith's comment
So, in order to limit the amount of concurrent post_update you'll rewrite it like this:
async def post_update(data, limit):
async with limit:
...
And then you can rewrite the previous loop like that:
limit = trio.CapacityLimiter(10)
# iterate over your dict asynchronously
async with trio.open_nursery() as nursery:
for key, sub in data.items():
if sub['updated'] is None:
sub['updated'] = 'in_progress'
nursery.start_soon(post_update, {key: sub}, limit)
This way, we spawn n tasks for the n entries in your data-dict, but if there are more than 10 tasks running concurrently, then the extra ones will have to wait for the limit to be released (at the end of the async with limit block).
This code uses channels to multiplex requests to and from a pool of workers. I found the additional requirement (in your code comments) that the post-response rate is throttled, so read_entries sleeps after each send.
from random import random
import time, asks, trio
snd_input, rcv_input = trio.open_memory_channel(0)
snd_output, rcv_output = trio.open_memory_channel(0)
async def read_entries():
async with snd_input:
for key_entry in range(10):
print("reading", key_entry)
await snd_input.send(key_entry)
await trio.sleep(1)
async def work(n):
async for key_entry in rcv_input:
print(f"w{n} {time.monotonic()} posting", key_entry)
r = await asks.post(f"https://httpbin.org/delay/{5 * random()}")
await snd_output.send((r.status_code, key_entry))
async def save_entries():
async for entry in rcv_output:
print("saving", entry)
async def main():
async with trio.open_nursery() as nursery:
nursery.start_soon(read_entries)
nursery.start_soon(save_entries)
async with snd_output:
async with trio.open_nursery() as workers:
for n in range(3):
workers.start_soon(work, n)
trio.run(main)
I am using the taskqueue API to send multiple emails is small groups with mailgun. My code looks more or less like this:
class CpMsg(ndb.Model):
group = ndb.KeyProperty()
sent = ndb.BooleanProperty()
#Other properties
def send_mail(messages):
"""Sends a request to mailgun's API"""
# Some code
pass
class MailTask(TaskHandler):
def post(self):
p_key = utils.key_from_string(self.request.get('p'))
msgs = CpMsg.query(
CpMsg.group==p_key,
CpMsg.sent==False).fetch(BATCH_SIZE)
if msgs:
send_mail(msgs)
for msg in msgs:
msg.sent = True
ndb.put_multi(msgs)
#Call the task again in COOLDOWN seconds
The code above has been working fine, but according to the docs, the taskqueue API guarantees that a task is delivered at least once, so tasks should be idempotent. Now, most of the time this would be the case with the above code, since it only gets messages that have the 'sent' property equal to False. The problem is that non ancestor ndb queries are only eventually consistent, which means that if the task is executed twice in quick succession the query may return stale results and include the messages that were just sent.
I thought of including an ancestor for the messages, but since the sent emails will be in the thousands I'm worried that may mean having large entity groups, which have a limited write throughput.
Should I use an ancestor to make the queries? Or maybe there is a way to configure mailgun to avoid sending the same email twice? Should I just accept the risk that in some rare cases a few emails may be sent more than once?
One possible approach to avoid the eventual consistency hurdle is to make the query a keys_only one, then iterate through the message keys to get the actual messages by key lookup (strong consistency), check if msg.sent is True and skip sending those messages in such case. Something along these lines:
msg_keys = CpMsg.query(
CpMsg.group==p_key,
CpMsg.sent==False).fetch(BATCH_SIZE, keys_only=True)
if not msg_keys:
return
msgs = ndb.get_multi(msg_keys)
msgs_to_send = []
for msg in msgs:
if not msg.sent:
msgs_to_send.append(msg)
if msgs_to_send:
send_mail(msgs_to_send)
for msg in msgs_to_send:
msg.sent = True
ndb.put_multi(msgs_to_send)
You'd also have to make your post call transactional (with the #ndb.transactional() decorator).
This should address the duplicates caused by the query eventual consistency. However there still is room for duplicates caused by transaction retries due to datastore contention (or any other reason) - as the send_mail() call isn't idempotent. Sending one message at a time (maybe using the task queue) could reduce the chance of that happening. See also GAE/P: Transaction safety with API calls
I have the following simplified code:
async def asynchronous_function(*args, **kwds):
statement = await prepare(query)
async with conn.transaction():
async for record in statement.cursor():
??? yield record ???
...
class Foo:
def __iter__(self):
records = ??? asynchronous_function ???
yield from records
...
x = Foo()
for record in x:
...
I don't know how to fill in the ??? above. I want to yield the record data, but it's really not obvious how to wrap asyncio code.
While it is true that asyncio is intended to be used across the board, sometimes it is simply impossible to immediately convert a large piece of software (with all its dependencies) to async. Fortunately there are ways to combine legacy synchronous code with newly written asyncio portions. A straightforward way to do so is by running the event loop in a dedicated thread, and using asyncio.run_coroutine_threadsafe to submit tasks to it.
With those low-level tools you can write a generic adapter to turn any asynchronous iterator into a synchronous one. For example:
import asyncio, threading, queue
# create an asyncio loop that runs in the background to
# serve our asyncio needs
loop = asyncio.get_event_loop()
threading.Thread(target=loop.run_forever, daemon=True).start()
def wrap_async_iter(ait):
"""Wrap an asynchronous iterator into a synchronous one"""
q = queue.Queue()
_END = object()
def yield_queue_items():
while True:
next_item = q.get()
if next_item is _END:
break
yield next_item
# After observing _END we know the aiter_to_queue coroutine has
# completed. Invoke result() for side effect - if an exception
# was raised by the async iterator, it will be propagated here.
async_result.result()
async def aiter_to_queue():
try:
async for item in ait:
q.put(item)
finally:
q.put(_END)
async_result = asyncio.run_coroutine_threadsafe(aiter_to_queue(), loop)
return yield_queue_items()
Then your code just needs to call wrap_async_iter to wrap an async iter into a sync one:
async def mock_records():
for i in range(3):
yield i
await asyncio.sleep(1)
for record in wrap_async_iter(mock_records()):
print(record)
In your case Foo.__iter__ would use yield from wrap_async_iter(asynchronous_function(...)).
If you want to receive all records from async generator, you can use async for or, for shortness, asynchronous comprehensions:
async def asynchronous_function(*args, **kwds):
# ...
yield record
async def aget_records():
records = [
record
async for record
in asynchronous_function()
]
return records
If you want to get result from asynchronous function synchronously (i.e. blocking), you can just run this function in asyncio loop:
def get_records():
records = asyncio.run(aget_records())
return records
Note, however, that once you run some coroutine in event loop you're losing ability to run this coroutine concurrently (i.e. parallel) with other coroutines and thus receive all related benefits.
As Vincent already pointed in comments, asyncio is not a magic wand that makes code faster, it's an instrument that sometimes can be used to run different I/O tasks concurrently with low overhead.
You may be interested in reading this answer to see main idea behind asyncio.
Consider the following minimal [?] example:
defmodule Foo do
def bar() do
n = IO.read(:line) |> String.trim() |> String.to_integer()
for _ <- 0..n - 1 do
IO.read(:line) |> IO.write()
end
end
end
import ExUnit.CaptureIO
capture_io("2\nabc\ndef", Foo.bar)
I did look into the documentation, and it puts no limitations on ExUnit.CaptureIO usage, but the aforelisted code hangs, waiting for the first line of input, as if it hasn't been fed. Have I missed something?
If it matters, I'm using Elixir 1.7.3.
The second argument to capture_io needs to be a function to run with the capturing enabled. Here, you're passing in the result of running Foo.bar. That hangs forever, as it is expecting input from stdio, which never comes. Long story short you need to pass it as a function:
capture_io("2\nabc\ndef", &Foo.bar/0)
because Foo.bar is the same as Foo.bar().
Part of the implementation of inlineCallbacks is this:
if isinstance(result, Deferred):
# a deferred was yielded, get the result.
def gotResult(r):
if waiting[0]:
waiting[0] = False
waiting[1] = r
else:
_inlineCallbacks(r, g, deferred)
result.addBoth(gotResult)
if waiting[0]:
# Haven't called back yet, set flag so that we get reinvoked
# and return from the loop
waiting[0] = False
return deferred
result = waiting[1]
# Reset waiting to initial values for next loop. gotResult uses
# waiting, but this isn't a problem because gotResult is only
# executed once, and if it hasn't been executed yet, the return
# branch above would have been taken.
waiting[0] = True
waiting[1] = None
As it is shown, if in am inlineCallbacks-decorated function I make a call like this:
#inlineCallbacks
def myfunction(a, b):
c = callsomething(a)
yield twisted.internet.defer.succeed(None)
print callsomething2(b, c)
This yield will get back to the function immediately (this means: it won't be re-scheduled but immediately continue from the yield). This contrasts with Tornado's tornado.gen.moment (which isn't more than an already-resolved Future with a result of None), which makes the yielder re-schedule itself, regardless the future being already resolved or not.
How can I run a behavior like the one Tornado does when yielding a dummy future like moment?
The equivalent might be something like a yielding a Deferred that doesn't fire until "soon". reactor.callLater(0, ...) is generally accepted to create a timed event that doesn't run now but will run pretty soon. You can easily get a Deferred that fires based on this using twisted.internet.task.deferLater(reactor, 0, lambda: None).
You may want to look at alternate scheduling tools instead, though (in both Twisted and Tornado). This kind of re-scheduling trick generally only works in small, simple applications. Its effectiveness diminishes the more tasks concurrently employ it.
Consider whether something like twisted.internet.task.cooperate might provide a better solution instead.