Processing and aggregating data from multiple servers efficiently

Processing and aggregating data from multiple servers efficiently - kotlin

Summary
My goal is to process and aggregate data from multiple servers efficiently while handling possible errors. For that, I
have a sequential version that I want to speed up. As I am using Kotlin, coroutines seem the way to go for this
asynchronous task. However, I'm quite new to this, and can't figure out how to do this idiomatic. None of my attempts
satisfied my requirements completely.
Here is the sequential version of the core function that I am currently using:
suspend fun readDataFromServers(): Set<String> = coroutineScope {
listOfServers
// step 1: read data from servers while logging errors
.mapNotNull { url ->
runCatching { makeRequestTo(url) }
.onFailure { println("err while accessing $url: $it") }
.getOrNull()
}
// step 2: do some element-wise post-processing
.map { process(it) }
// step 3: aggregate data
.toSet()
}
Background
In my use case, there are numServers I want to read data from. Each of them usually answers within successDuration,
but the connection attempt may fail after timeoutDuration with probability failProb and throw an IOException. As
downtimes are a common thing in my system, I do not need to retry anything, but only log it for the record. Hence,
the makeRequestTo function can be modelled as follows:
suspend fun makeRequestTo(url: String) =
if (random.nextFloat() > failProb) {
delay(successDuration)
"{Some response from $url}"
} else {
delay(timeoutDuration)
throw IOException("Connection to $url timed out")
}
Attempts
All these attempts can be tried out in the Kotlin playground. I don't know how long this link stays alive; maybe I'll need to upload this as a gist, but I liked that people can execute the code directly.
Async
I tried using async {makeRequestTo(it)} after listOfServers and awaiting the results in the following mapNotNull
similar
to this post
. While this collapses the communication time to timeoutDuration, all following processing steps have to wait for that
long before they can continue. Hence, some composition of Deferreds was required here, which is discouraged in
Kotlin (or at least should be avoided in favor of suspending
functions).
suspend fun readDataFromServersAsync(): Set<String> = supervisorScope {
listOfServers
.map { async { makeRequestTo(it) } }
.mapNotNull { kotlin.runCatching { it.await() }.onFailure { println("err: $it") }.getOrNull() }
.map { process(it) }
.toSet()
}
Loops
Using normal loops like below fulfills the functional requirements, but feels a bit more complex than it should be.
Especially the part where shared state must be synchronized makes me to not trust this code and any future modifications
to it.
val results = mutableSetOf<String>()
val mutex = Mutex()
val logger = CoroutineExceptionHandler { _, exception -> println("err: $exception") }
for (server in listOfServers) {
launch(logger) {
val response = makeRequestTo(server)
val processed = process(response)
mutex.withLock {
results.add(processed)
}
}
}
return#supervisorScope results

Related

Design pattern to best implement batch api requests that happen transparently to the calling layer

I have a batch processor that I want to refactor to be expressed a 1-to-1 fashion based on input to increase readability, and for further optimization later on. The issue is that there is a service that should be called in batches to reduce HTTP overhead, so mixing the 1-to-1 code with the batch code is a bit tricky, and we may not want to call the service with every input. Results can be sent out eagerly one-by-one, but order must be maintained, so something like a flow doesn't seem to work.
So, ideally the batch processor would look something like this:
class Processor<A, B> {
val service: Service<A, B>
val scope: CoroutineScope
fun processBatch(input: List<A>) {
input.map {
Pair(it, scope.async { service.call(it) })
}.map {
(a, b) ->
runBlocking { b.await().let { /** handle result, do something with a if result is null, etc **/ } }
}
}
}
The desire is to perform all of the service logic in such a way that it is executing in the background, automatically splitting the inputs for the service into batches, executing them asynchronously, and somehow mapping the result of the batch call into the suspended call.
Here is a hacky implementation:
class Service<A, B> {
val inputContainer: MutableList<A>
val outputs: MutableList<B>
val runCalled = AtomicBoolean(false)
val batchSize: Int
suspended fun call(input: A): B? {
// some prefiltering logic that returns a null early
val index = inputContainer.size
inputContainer.add(a) // add to overall list for later batching
return suspend {
run()
outputs[index]
}
}
fun run() {
val batchOutputs = mutableListOf<Deferred<List<B?>>>()
if (!runCalled.getAndSet(true)) {
inputs.chunked(batchSize).forEach {
batchOutputs.add(scope.async { batchCall(it) })
}
runBlocking {
batchOutputs.map {
val res = result.await()
outputs.addAll(res)
}
}
}
}
suspended fun batchCall(input: List<A>): List<B?> {
// batch API call, etc
}
}
Something like this could work but there are several concerns:
All API calls go out at once. Ideally this would be batching and executing in the background while other inputs are being scheduled, but this is not .
Processing of the service result for the first input cannot resume until all results have been returned. Ideally we could process the result if the service call has returned, while other results continue to be performed in the background.
Containers of intermediate results seem hacky and prone to bugs. Cleanup logic is also needed, which introduces more hacky bits into the rest of the code
I can think of several optimizations to the address 1 and 2, but I imagine concerns related to 3 would be worse. This seems like a fairly common call pattern and I would expect there to be a library or much simpler design pattern to accomplish this, but I haven't been able to find anything. Any guidance is appreciated.

You're on the right track by using Deferred. The solution I would use is:
When the caller makes a request, create a CompletableDeferred
Using a channel, pass this CompletableDeferred to the service for later completion
Have the caller suspend until the service completes the CompletableDeferred
It might look something like this:
val requestChannel = Channel<Pair<Request, CompletableDeferred<Result>>()
suspend fun doRequest(request: Request): Result {
val result = CompletableDeferred<Result>()
requestChannel.send(Pair(request, result))
return result.await()
}
fun run() = scope.launch {
while(isActive) {
val (requests, deferreds) = getBatch(batchSize).unzip()
val results = batchCall(requests)
(results zip deferreds).forEach { (result, deferred) ->
deferred.complete(result)
}
}
}
suspend fun getBatch(batchSize: Int) = buildList {
repeat(batchSize) {
add(requestChannel.receive())
}
}

Explain the difference in scan + posting to participating source in RxJava vs Kotlin Coroutines

I am porting a piece of code from Rx to Coroutines and came across the behavior I can't wrap my head around.
Background: imagine that you have a stream of values each one gets associated with an action-lambda to be executed later. There's also an "external" stream to which lambda can post and it's merged in the resulting stream.
I reduced the original code to this simpler (but still tricky) version:
val shared = PublishSubject.create<String>()
Observable
.merge(
subject.map { it to { } },
Observable.just("item1", "item2")
.map { it to { shared.onNext(it.toUpperCase()) } }
)
.scan("*" to { }) { accumulator, value ->
(accumulator.first + "_" + value.first) to value.second
}
.subscribe {
it.second()
println("got ${it.first}")
}
This prints
got *
got *_item1
got *_item1_ITEM1
got *_item1_ITEM1_item2
got *_item1_ITEM1_item2_ITEM2
Next I have this coroutines + flow based version.
The notable difference is that it has suspend modifier added to the lambda (to be able to call shared.emit().
runBlocking {
val shared = MutableSharedFlow<String>()
merge(
shared.map { it to suspend {} },
flowOf("item1", "item2").map { it to suspend { shared.emit(it.toUpperCase()) } }
)
.scan("*" to suspend { }) { accumulator, value ->
(accumulator.first + "_" + value.first) to value.second
}
.collect {
it.second()
println("got ${it.first}")
}
}
This prints
got *
got *_item1
got *_item1_item2
got *_item1_item1_ITEM1
got *_item1_item2_ITEM1_ITEM2
Notice, that in the Rx version uppercased ITEM emissions were interspersed with lowercase ones, while in the coroutines version they come last.
Questions I'd like to ask:
Why does this happen? Is it due to the suspending lambda? Would be grateful for step-by-step explanation if there is something complex going on
Does Rx have some internal buffer which allows it to behave as it does?
Can similar behavior be achieved with Flow and if so, how?

Combining kotlin flow results

I'm wandering if there is a clean way to launch a series of flows in Kotlin and then, after their resolution, perform further operations based on whether they succeeded or not
For example's sake I need to read all integers from a DB (returning them into a flow), check if they are even or odd against an external API (also returning a flow), and then remove the odd ones from the DB
In code it would be something like this
fun findEven() {
db.readIntegers()
.map { listOfInt ->
listOfInt.asFlow()
.flatMapMerge { singleInt ->
httpClient.apiCallToCheckForOddity(singleInt)
.catch {
// API failure when number is even
}
.map {
// API success when number is odd
db.remove(singleInt).collect()
}
}.collect()
}.collect()
}
But the problem I see with this code is the access to the DB deleting entries done in parallel, and I think a better solution would be to run all API calls and somewhere collect all that failed and all that succeeded, so to be able to do a bulk insertion in the DB only once instead of having multiple coroutines do that on their own

In my opinion, it's kind of an anti-pattern to produce side effects in map, filter, etc. A side effect like removing items from a database should be a separate step (collect in the case of a Flow, and forEach in the case of a List) for clarity.
The nested flow is also kind of convoluted, since you can directly modify the list as a List.
I think you can do it like this, assuming the API can only check one item at a time.
suspend fun findEven() {
db.readIntegers()
.map { listOfInt ->
listOfInt.filter { singleInt ->
runCatching {
httpClient.apiCallToCheckForOddity(singleInt)
}.isSuccess
}
}
.collect { listOfOddInt ->
db.removeAll(listOfOddInt)
}
}
Parallel version, if the API call returns the parameter. (By the way, Kotlin APIs should not throw exceptions on non-programmer errors).
suspend fun findEven() {
db.readIntegers()
.map { listOfInt ->
coroutineScope {
listOfInt.map { singleInt ->
async {
runCatching {
httpClient.apiCallToCheckForOddity(singleInt)
}
}
}.awaitAll()
.mapNotNull(Result<Int>::getOrNull)
}
}
.collect { listOfOddInt ->
db.removeAll(listOfOddInt)
}
}

JobCancellationException StandaloneCoroutine was cancelled

Since we are using Coroutines (1.3.5 used) we have a lot of crash : JobCancellationException - StandaloneCoroutine was cancelled.
I read a lot of thread about theses problems and I tried a lot of solution in production but crashes always occurs.
In all our viewmodels we are using the viewmodelscope so it's ok.
But in our data layer we need to launch a tracking events which are fire and forget task. In first step we used a GlobalScope.launch. I was thinking the CancelletationException was due to this global scope so I removed it and create an extension in the data layer with using a SupervisorJob and a CoroutineExceptionHandler:
private val appScope = CoroutineScope(Dispatchers.Default + SupervisorJob())
private val coroutineExceptionHandler by lazy { CoroutineExceptionHandler { _, throwable -> logw("Error occurred inside Coroutine.", throwable) } }
fun launchOnApp(block: suspend CoroutineScope.() -> Unit) {
appScope.launch(coroutineExceptionHandler) { block() }
}
But I always saw crashes with this code. Do I need to use cancelAndJoin method? Which strategy I can use with a clean archi and this kind of work please?
Thanks in advance

You can build an extension utility that catches the cancellation exception, and do what you want with it:
fun CoroutineScope.safeLaunch(block: suspend CoroutineScope.() -> Unit): Job {
return this.launch {
try {
block()
} catch (ce: CancellationException) {
// You can ignore or log this exception
} catch (e: Exception) {
// Here it's better to at least log the exception
Log.e("TAG","Coroutine error", e)
}
}
}
And you can use the extension with a coroutine scope of your choice, for example the global scope:
GlobalScope.safeLaunch{
// here goes my suspend functions and stuff
}
or any viewmodel scope:
myViewModel.viewModelScope.safeLaunch{
// here goes my suspend functions and stuff
}

I recommend not to use GlobalScope for the following reasons:
This is the description in CoroutineScope.kt :
This is a delicate API. It is easy to accidentally create resource or memory leaks when GlobalScope is used. A coroutine launched in GlobalScope is not subject to the principle of structured concurrency, so if it hangs or gets delayed due to a problem (e.g. due to a slow network), it will stay working and consuming resources.
There are limited circumstances under which GlobalScope can be legitimately and safely used, such as top-level background processes that must stay active for the whole duration of the application's lifetime. Because of that, any use of GlobalScope requires an explicit opt-in with #OptIn(DelicateCoroutinesApi::class)
// A global coroutine to log statistics every second, must be always active
#OptIn(DelicateCoroutinesApi::class)
val globalScopeReporter = GlobalScope.launch {
while (true) {
delay(1000)
logStatistics()
}
}
If you don't mind the job being canceled you can just ignore it.
To manage tasks that have been canceled or should not be undone, you need to know where your code is coming from and improve it.
var job: Job? = null
fun requestJob(from:String) {
Log.send("test : from = $from")
if (job != null) {
job?.cancel()
Log.d("test", "test : canceled")
}
job = GlobalScope.launch {
(0..10).forEach { i ->
delay(1000)
Log.d("test", "test : job $i")
}
}.apply {
invokeOnCompletion {
Log.d("test", "test : from = $from, reason = ${it?.message ?: "completed"}")
job = null
}
}
}

Parallelly consuming a long sequence in Kotlin

I have a function generating a very long sequence of work items. Generating these items is fast, but there are too many in total to store a list of them in memory. Processing the items produces no results, just side effects.
I would like to process these items across multiple threads. One solution is to have a thread read from the generator and write to a concurrent bounded queue, and a number of executor threads polling for work from the bounded queue, but this is a lot of things to set up.
Is there anything in the standard library that would help me do that?
I had initially tried
items.map { async(executor) process(it) }.forEach { it.await() }
But, as pointed out in how to implement parallel mapping for sequences in kotlin, this doesn't work for reasons that are obvious in retrospect.
Is there a quick way to do this (possibly with an external library), or is manually setting up a bounded queue in the middle my best option?

You can look at coroutines combined with channels.
If all work items can be emmited on demand with producer channel. Then it's possible to await for each items and process it with a pool of threads.
An example :
sealed class Stream {
object End: Stream()
class Item(val data: Long): Stream()
}
val produceCtx = newSingleThreadContext("producer")
// A dummy producer that send one million Longs on its own thread
val producer = CoroutineScope(produceCtx).produce {
for (i in (0 until 1000000L)) send(Stream.Item(i))
send(Stream.End)
}
val workCtx = newFixedThreadPoolContext(4, "work")
val workers = Channel<Unit>(4)
repeat(4) { workers.offer(Unit) }
for(_nothing in workers) { // launch 4 times then wait for a task to finish
launch(workCtx) {
when (val item = producer.receive()) {
Stream.End -> workers.close()
is Stream.Item -> {
workFunction(item.data) // Actual work here
workers.offer(Unit) // Notify to launch a new task
}
}
}
}

Your magic word would be .asSequence():
items
.asSequence() // Creates lazy executable sequence
.forEach { launch { executor.process(it) } } // If you don't need the value aftrwards, use 'launch', a.k.a. "fire and forget"
but there are too many in total to store a list of them in memory
Then don't map to list and don't collect the values, no matter if you work with Kotlin or Java.

As long as you are on the JVM, you can write yourself an extension function, that works the sequence in chunks and spawns futures for all entries in a chunk. Something like this:
#Suppress("UNCHECKED_CAST")
fun <T, R> Sequence<T>.mapParallel(action: (value: T) -> R?): Sequence<R?> {
val numThreads = Runtime.getRuntime().availableProcessors() - 1
return this
.chunked(numThreads)
.map { chunk ->
val threadPool = Executors.newFixedThreadPool(numThreads)
try {
return#map chunk
.map {
// CAUTION -> needs to be written like this
// otherwise the submit(Runnable) overload is called
// which always returns an empty Future!!!
val callable: () -> R? = { action(it) }
threadPool.submit(callable)
}
} finally {
threadPool.shutdown()
}
}
.flatten()
.map { future -> future.get() }
}
You can then just use it like:
items
.mapParallel { /* process an item */ }
.forEach { /* handle the result */ }
As long as workload per item is similar, this gives a good parallel processing.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Processing and aggregating data from multiple servers efficiently - kotlin

Related

Design pattern to best implement batch api requests that happen transparently to the calling layer

Explain the difference in scan + posting to participating source in RxJava vs Kotlin Coroutines

Combining kotlin flow results

JobCancellationException StandaloneCoroutine was cancelled

Parallelly consuming a long sequence in Kotlin

Categories

Resources