Design pattern to best implement batch api requests that happen transparently to the calling layer - kotlin

I have a batch processor that I want to refactor to be expressed a 1-to-1 fashion based on input to increase readability, and for further optimization later on. The issue is that there is a service that should be called in batches to reduce HTTP overhead, so mixing the 1-to-1 code with the batch code is a bit tricky, and we may not want to call the service with every input. Results can be sent out eagerly one-by-one, but order must be maintained, so something like a flow doesn't seem to work.
So, ideally the batch processor would look something like this:
class Processor<A, B> {
val service: Service<A, B>
val scope: CoroutineScope
fun processBatch(input: List<A>) {
input.map {
Pair(it, scope.async { service.call(it) })
}.map {
(a, b) ->
runBlocking { b.await().let { /** handle result, do something with a if result is null, etc **/ } }
}
}
}
The desire is to perform all of the service logic in such a way that it is executing in the background, automatically splitting the inputs for the service into batches, executing them asynchronously, and somehow mapping the result of the batch call into the suspended call.
Here is a hacky implementation:
class Service<A, B> {
val inputContainer: MutableList<A>
val outputs: MutableList<B>
val runCalled = AtomicBoolean(false)
val batchSize: Int
suspended fun call(input: A): B? {
// some prefiltering logic that returns a null early
val index = inputContainer.size
inputContainer.add(a) // add to overall list for later batching
return suspend {
run()
outputs[index]
}
}
fun run() {
val batchOutputs = mutableListOf<Deferred<List<B?>>>()
if (!runCalled.getAndSet(true)) {
inputs.chunked(batchSize).forEach {
batchOutputs.add(scope.async { batchCall(it) })
}
runBlocking {
batchOutputs.map {
val res = result.await()
outputs.addAll(res)
}
}
}
}
suspended fun batchCall(input: List<A>): List<B?> {
// batch API call, etc
}
}
Something like this could work but there are several concerns:
All API calls go out at once. Ideally this would be batching and executing in the background while other inputs are being scheduled, but this is not .
Processing of the service result for the first input cannot resume until all results have been returned. Ideally we could process the result if the service call has returned, while other results continue to be performed in the background.
Containers of intermediate results seem hacky and prone to bugs. Cleanup logic is also needed, which introduces more hacky bits into the rest of the code
I can think of several optimizations to the address 1 and 2, but I imagine concerns related to 3 would be worse. This seems like a fairly common call pattern and I would expect there to be a library or much simpler design pattern to accomplish this, but I haven't been able to find anything. Any guidance is appreciated.

You're on the right track by using Deferred. The solution I would use is:
When the caller makes a request, create a CompletableDeferred
Using a channel, pass this CompletableDeferred to the service for later completion
Have the caller suspend until the service completes the CompletableDeferred
It might look something like this:
val requestChannel = Channel<Pair<Request, CompletableDeferred<Result>>()
suspend fun doRequest(request: Request): Result {
val result = CompletableDeferred<Result>()
requestChannel.send(Pair(request, result))
return result.await()
}
fun run() = scope.launch {
while(isActive) {
val (requests, deferreds) = getBatch(batchSize).unzip()
val results = batchCall(requests)
(results zip deferreds).forEach { (result, deferred) ->
deferred.complete(result)
}
}
}
suspend fun getBatch(batchSize: Int) = buildList {
repeat(batchSize) {
add(requestChannel.receive())
}
}

Related

Non-blocking way to lazy-compute a value

I feel like I'm trying to reinvent the wheel, but I can't find what I should be using instead. Also, my wheel looks square-ish.
I want to get the result of some long-running operation (download/async API/compute/etc) in a thread-safe, cached way, that is:
if the long-running op has already completed, just return the cached value
if the long-running op is running, just wait for it to complete
otherwise start the long-running op and wait
I'd like all of the above waiting to be done in non-blocking suspend funcs. Am I missing some obvious doodad in the standard toolset?
I'd also like the cached value to be resettable (invalidate the result, so that the next get() starts over).
Just launching a coroutine will create a bunch of parallel long-running ops if called multiple times before it's done (fails case #2).
Both Flow and Deferred can be used to do that, I think, but both need a bunch of logic around them (like deciding when to start the op vs when to just wait).
So far I found this relatively simple way:
class CachedComputation<T>(private val compute: suspend () -> T) {
private var cache = GlobalScope.async(Dispatchers.Unconfined, start = CoroutineStart.LAZY) { compute() }
suspend fun get(): T {
return cache.await()
}
fun reset() {
cache = GlobalScope.async(Dispatchers.Unconfined, start = CoroutineStart.LAZY) { compute() }
}
}
It might need some synchronisation between get() and reset(). But the main problem with that is the usage of GlobalScope. Passing a Scope to get() is ugly and still starts a new coroutine, even though it's already a suspend func. I'd rather have it confined to the context of the caller.
I can solve it with a more complicated class:
class CachedComputation<T>(private val compute: suspend () -> T) {
private var future: CompletableDeferred<T>? = null
/**
* Return the result of computation once it's available.
* If the result is ready, return it immediately.
* If the result is being computed, wait for it to finish and return.
* If the computation is not running, start it and wait for result.
*/
suspend fun get(): T {
val (needToCompute, f) = newOrCurrentFuture()
return if (needToCompute) {
compute().also { f.complete(it) }
} else {
f.await()
}
}
/**
* If the computation hasn't started yet, create, save and return a new deferred result
* and indicate the need to start the computation.
* Otherwise return the current deferred result that can be awaited on, and indicate
* there's no need to start a new computation.
*/
#Synchronized
private fun newOrCurrentFuture(): Pair<Boolean, CompletableDeferred<T>> {
val currentFuture = future
return if (currentFuture != null)
Pair(false, currentFuture)
else
Pair(true, newFuture())
}
/**
* Create a new deferred result and save it in the class field
*/
private fun newFuture(): CompletableDeferred<T> {
return CompletableDeferred<T>()
.also { future = it }
}
#Synchronized
fun reset() {
future = null
}
}
But it seems unnecessarily complicated.

Processing and aggregating data from multiple servers efficiently

Summary
My goal is to process and aggregate data from multiple servers efficiently while handling possible errors. For that, I
have a sequential version that I want to speed up. As I am using Kotlin, coroutines seem the way to go for this
asynchronous task. However, I'm quite new to this, and can't figure out how to do this idiomatic. None of my attempts
satisfied my requirements completely.
Here is the sequential version of the core function that I am currently using:
suspend fun readDataFromServers(): Set<String> = coroutineScope {
listOfServers
// step 1: read data from servers while logging errors
.mapNotNull { url ->
runCatching { makeRequestTo(url) }
.onFailure { println("err while accessing $url: $it") }
.getOrNull()
}
// step 2: do some element-wise post-processing
.map { process(it) }
// step 3: aggregate data
.toSet()
}
Background
In my use case, there are numServers I want to read data from. Each of them usually answers within successDuration,
but the connection attempt may fail after timeoutDuration with probability failProb and throw an IOException. As
downtimes are a common thing in my system, I do not need to retry anything, but only log it for the record. Hence,
the makeRequestTo function can be modelled as follows:
suspend fun makeRequestTo(url: String) =
if (random.nextFloat() > failProb) {
delay(successDuration)
"{Some response from $url}"
} else {
delay(timeoutDuration)
throw IOException("Connection to $url timed out")
}
Attempts
All these attempts can be tried out in the Kotlin playground. I don't know how long this link stays alive; maybe I'll need to upload this as a gist, but I liked that people can execute the code directly.
Async
I tried using async {makeRequestTo(it)} after listOfServers and awaiting the results in the following mapNotNull
similar
to this post
. While this collapses the communication time to timeoutDuration, all following processing steps have to wait for that
long before they can continue. Hence, some composition of Deferreds was required here, which is discouraged in
Kotlin (or at least should be avoided in favor of suspending
functions).
suspend fun readDataFromServersAsync(): Set<String> = supervisorScope {
listOfServers
.map { async { makeRequestTo(it) } }
.mapNotNull { kotlin.runCatching { it.await() }.onFailure { println("err: $it") }.getOrNull() }
.map { process(it) }
.toSet()
}
Loops
Using normal loops like below fulfills the functional requirements, but feels a bit more complex than it should be.
Especially the part where shared state must be synchronized makes me to not trust this code and any future modifications
to it.
val results = mutableSetOf<String>()
val mutex = Mutex()
val logger = CoroutineExceptionHandler { _, exception -> println("err: $exception") }
for (server in listOfServers) {
launch(logger) {
val response = makeRequestTo(server)
val processed = process(response)
mutex.withLock {
results.add(processed)
}
}
}
return#supervisorScope results

Parallelly consuming a long sequence in Kotlin

I have a function generating a very long sequence of work items. Generating these items is fast, but there are too many in total to store a list of them in memory. Processing the items produces no results, just side effects.
I would like to process these items across multiple threads. One solution is to have a thread read from the generator and write to a concurrent bounded queue, and a number of executor threads polling for work from the bounded queue, but this is a lot of things to set up.
Is there anything in the standard library that would help me do that?
I had initially tried
items.map { async(executor) process(it) }.forEach { it.await() }
But, as pointed out in how to implement parallel mapping for sequences in kotlin, this doesn't work for reasons that are obvious in retrospect.
Is there a quick way to do this (possibly with an external library), or is manually setting up a bounded queue in the middle my best option?
You can look at coroutines combined with channels.
If all work items can be emmited on demand with producer channel. Then it's possible to await for each items and process it with a pool of threads.
An example :
sealed class Stream {
object End: Stream()
class Item(val data: Long): Stream()
}
val produceCtx = newSingleThreadContext("producer")
// A dummy producer that send one million Longs on its own thread
val producer = CoroutineScope(produceCtx).produce {
for (i in (0 until 1000000L)) send(Stream.Item(i))
send(Stream.End)
}
val workCtx = newFixedThreadPoolContext(4, "work")
val workers = Channel<Unit>(4)
repeat(4) { workers.offer(Unit) }
for(_nothing in workers) { // launch 4 times then wait for a task to finish
launch(workCtx) {
when (val item = producer.receive()) {
Stream.End -> workers.close()
is Stream.Item -> {
workFunction(item.data) // Actual work here
workers.offer(Unit) // Notify to launch a new task
}
}
}
}
Your magic word would be .asSequence():
items
.asSequence() // Creates lazy executable sequence
.forEach { launch { executor.process(it) } } // If you don't need the value aftrwards, use 'launch', a.k.a. "fire and forget"
but there are too many in total to store a list of them in memory
Then don't map to list and don't collect the values, no matter if you work with Kotlin or Java.
As long as you are on the JVM, you can write yourself an extension function, that works the sequence in chunks and spawns futures for all entries in a chunk. Something like this:
#Suppress("UNCHECKED_CAST")
fun <T, R> Sequence<T>.mapParallel(action: (value: T) -> R?): Sequence<R?> {
val numThreads = Runtime.getRuntime().availableProcessors() - 1
return this
.chunked(numThreads)
.map { chunk ->
val threadPool = Executors.newFixedThreadPool(numThreads)
try {
return#map chunk
.map {
// CAUTION -> needs to be written like this
// otherwise the submit(Runnable) overload is called
// which always returns an empty Future!!!
val callable: () -> R? = { action(it) }
threadPool.submit(callable)
}
} finally {
threadPool.shutdown()
}
}
.flatten()
.map { future -> future.get() }
}
You can then just use it like:
items
.mapParallel { /* process an item */ }
.forEach { /* handle the result */ }
As long as workload per item is similar, this gives a good parallel processing.

How To await a function call?

So I have some asynchronous operations happening, I can create some lambada, call a function and pass that value to them. But what i want is not to have the result of the operation as a parameter, I want to return them.
As a example, I have a class A with some listeners, if there is a result all listeners are notified. So basically the asyncFunction should return a result if there is one otherwise be suspended.
object A {
val listeners = mutableListOf<(Int) -> Unit>()
fun onResult(value: Int) {
listeners.forEach { it(value) }
}
}
fun asyncFunction(): Deferred<Int> {
return async {
A.listeners.add({ result ->
})
return result
}
}
What I'm thinking right now (maybe I'm completely on the wrong track), is to have something like a Deferred, to which i can send the result and it returns. Is there something like that? Can I implement a Deffered myself?
class A {
private val awaiter: ??? // can this be a Deferred ?
fun onResult(result: Int) {
awaiter.putResult(result)
}
fun awaitResult(): Int {
return awaiter.await()
}
}
val a = A()
launch {
val result = a.awaitResult()
}
launch {
a.onResult(42)
}
So I do know that with callbacks this can be handled but it would be cleaner and easier to have it that way.
I hope there is a nice and clean solution im just missing.
Your asyncFunction should in fact be a suspendable function:
suspend fun suspendFunction(): Int =
suspendCoroutine { cont -> A.listeners.add { cont.resume(it) } }
Note that it returns the Int result and suspends until it's available.
However, this is just a fix for your immediate problem. It will still malfunction in many ways:
the listener's purpose is served as soon as it gets the first result, but it stays in the listener list forever, resulting in a memory leak
if the result arrived before you called suspendFunction, it will miss it and hang.
You can keep improving it manually (it's a good way to learn) or switch to a solid solution provided by the standard library. The library solution is CompletableDeferred:
object A {
val result = CompletableDeferred<Int>()
fun provideResult(r: Int) {
result.complete(r)
}
}
suspend fun suspendFunction(): Int = A.result.await()

Return value only of the faster coroutine

How can I run multiple coroutines in parallel and return only the value of the one that finishes first?
Real-life scenario, I have two data sources - Database and API service. I don't care where does the data originate from, I just need it fast. How can I query both Database and API service and cancel the other request when the one finishes?
In RxJava world this would be equal to Amb operator. How can I achieve similar behaviour using coroutines?
I came up with following implementation:
suspend fun getFaster(): Int = coroutineScope {
select<Int> {
async { getFromServer() }.onAwait { it }
async { getFromDB() }.onAwait { it }
}.also {
coroutineContext.cancelChildren()
}
}
The coroutineScope acts as a parent to all async calls performed within. After the select finishes we can just cancel the rest.
You can use select to write your own amb operator. Something like that:
suspend fun <T> amb(vararg jobs: Deferred<T>): T = select {
fun cancelAll() = jobs.forEach { it.cancel() }
for (deferred in jobs) {
deferred.onAwait {
cancelAll()
it
}
}
}
You can read more about select expression here