Parallelly consuming a long sequence in Kotlin - kotlin

I have a function generating a very long sequence of work items. Generating these items is fast, but there are too many in total to store a list of them in memory. Processing the items produces no results, just side effects.
I would like to process these items across multiple threads. One solution is to have a thread read from the generator and write to a concurrent bounded queue, and a number of executor threads polling for work from the bounded queue, but this is a lot of things to set up.
Is there anything in the standard library that would help me do that?
I had initially tried
items.map { async(executor) process(it) }.forEach { it.await() }
But, as pointed out in how to implement parallel mapping for sequences in kotlin, this doesn't work for reasons that are obvious in retrospect.
Is there a quick way to do this (possibly with an external library), or is manually setting up a bounded queue in the middle my best option?

You can look at coroutines combined with channels.
If all work items can be emmited on demand with producer channel. Then it's possible to await for each items and process it with a pool of threads.
An example :
sealed class Stream {
object End: Stream()
class Item(val data: Long): Stream()
}
val produceCtx = newSingleThreadContext("producer")
// A dummy producer that send one million Longs on its own thread
val producer = CoroutineScope(produceCtx).produce {
for (i in (0 until 1000000L)) send(Stream.Item(i))
send(Stream.End)
}
val workCtx = newFixedThreadPoolContext(4, "work")
val workers = Channel<Unit>(4)
repeat(4) { workers.offer(Unit) }
for(_nothing in workers) { // launch 4 times then wait for a task to finish
launch(workCtx) {
when (val item = producer.receive()) {
Stream.End -> workers.close()
is Stream.Item -> {
workFunction(item.data) // Actual work here
workers.offer(Unit) // Notify to launch a new task
}
}
}
}

Your magic word would be .asSequence():
items
.asSequence() // Creates lazy executable sequence
.forEach { launch { executor.process(it) } } // If you don't need the value aftrwards, use 'launch', a.k.a. "fire and forget"
but there are too many in total to store a list of them in memory
Then don't map to list and don't collect the values, no matter if you work with Kotlin or Java.

As long as you are on the JVM, you can write yourself an extension function, that works the sequence in chunks and spawns futures for all entries in a chunk. Something like this:
#Suppress("UNCHECKED_CAST")
fun <T, R> Sequence<T>.mapParallel(action: (value: T) -> R?): Sequence<R?> {
val numThreads = Runtime.getRuntime().availableProcessors() - 1
return this
.chunked(numThreads)
.map { chunk ->
val threadPool = Executors.newFixedThreadPool(numThreads)
try {
return#map chunk
.map {
// CAUTION -> needs to be written like this
// otherwise the submit(Runnable) overload is called
// which always returns an empty Future!!!
val callable: () -> R? = { action(it) }
threadPool.submit(callable)
}
} finally {
threadPool.shutdown()
}
}
.flatten()
.map { future -> future.get() }
}
You can then just use it like:
items
.mapParallel { /* process an item */ }
.forEach { /* handle the result */ }
As long as workload per item is similar, this gives a good parallel processing.

Related

Kotlin - How To Collect X Values From a Flow?

Let's say I have a flow that is constantly sending updated like the following:
locationFlow = StateFlow<Location?>(null)
I have a use-case where after a particular event occurs, I want to collect X values from the flow and continue, so something like what I have below. I know that collect is a terminal operator, so I don't think the logic I have below works, but how could I do this in this case? I'd like to collect X items, save them, and then send them to another function for processing/handling.
fun onEventOccurred() {
launch {
val locations = mutableListOf<Location?>()
locationFlow.collect {
//collect only X locations
locations.add(it)
}
saveLocations(locations)
}
}
Is there a pre-existing Kotlin function for something like this? I'd like to collect from the flow X times, save the items to a list, and pass that list to another function.
It doesn't matter that collect is terminal. The upstream StateFlow will keep behaving normally because StateFlows don't care what their collectors are doing. you can use the take function to get a specific number of items, and you can use toList() (another terminal function) to concisely copy them into a list once they're all ready.
fun onEventOccurred() {
launch {
saveLocations(locationFlow.take(5).toList())
}
}
If I understood correctly your use case, you want to:
discard elements until a specific one is sent – actually, after re-reading your question I don't think this is the case.. I'm leaving it in the example just FYI
when that happens, you want to collect X items for further processing
Assuming that's correct, you can use a combination of dropWhile and take, like so:
fun main() = runBlocking {
val messages = flow {
repeat(10) {
println(it)
delay(500)
emit(it)
}
}
messages
.dropWhile { it < 5 }
.take(3)
.collect { println(it) } // prints 5, 6, 7
}
You can even have more complex logic, i.e. discard any number that's less than 5, and then take the first 10 even numbers:
fun main() = runBlocking {
val messages = flow {
repeat(100) {
delay(500)
emit(it)
}
}
messages
.dropWhile { it < 5 }
.filter { it % 2 == 0}
.take(10)
.collect { println(it) } // prints even numbers, 6 to 24
}

Processing and aggregating data from multiple servers efficiently

Summary
My goal is to process and aggregate data from multiple servers efficiently while handling possible errors. For that, I
have a sequential version that I want to speed up. As I am using Kotlin, coroutines seem the way to go for this
asynchronous task. However, I'm quite new to this, and can't figure out how to do this idiomatic. None of my attempts
satisfied my requirements completely.
Here is the sequential version of the core function that I am currently using:
suspend fun readDataFromServers(): Set<String> = coroutineScope {
listOfServers
// step 1: read data from servers while logging errors
.mapNotNull { url ->
runCatching { makeRequestTo(url) }
.onFailure { println("err while accessing $url: $it") }
.getOrNull()
}
// step 2: do some element-wise post-processing
.map { process(it) }
// step 3: aggregate data
.toSet()
}
Background
In my use case, there are numServers I want to read data from. Each of them usually answers within successDuration,
but the connection attempt may fail after timeoutDuration with probability failProb and throw an IOException. As
downtimes are a common thing in my system, I do not need to retry anything, but only log it for the record. Hence,
the makeRequestTo function can be modelled as follows:
suspend fun makeRequestTo(url: String) =
if (random.nextFloat() > failProb) {
delay(successDuration)
"{Some response from $url}"
} else {
delay(timeoutDuration)
throw IOException("Connection to $url timed out")
}
Attempts
All these attempts can be tried out in the Kotlin playground. I don't know how long this link stays alive; maybe I'll need to upload this as a gist, but I liked that people can execute the code directly.
Async
I tried using async {makeRequestTo(it)} after listOfServers and awaiting the results in the following mapNotNull
similar
to this post
. While this collapses the communication time to timeoutDuration, all following processing steps have to wait for that
long before they can continue. Hence, some composition of Deferreds was required here, which is discouraged in
Kotlin (or at least should be avoided in favor of suspending
functions).
suspend fun readDataFromServersAsync(): Set<String> = supervisorScope {
listOfServers
.map { async { makeRequestTo(it) } }
.mapNotNull { kotlin.runCatching { it.await() }.onFailure { println("err: $it") }.getOrNull() }
.map { process(it) }
.toSet()
}
Loops
Using normal loops like below fulfills the functional requirements, but feels a bit more complex than it should be.
Especially the part where shared state must be synchronized makes me to not trust this code and any future modifications
to it.
val results = mutableSetOf<String>()
val mutex = Mutex()
val logger = CoroutineExceptionHandler { _, exception -> println("err: $exception") }
for (server in listOfServers) {
launch(logger) {
val response = makeRequestTo(server)
val processed = process(response)
mutex.withLock {
results.add(processed)
}
}
}
return#supervisorScope results

Design pattern to best implement batch api requests that happen transparently to the calling layer

I have a batch processor that I want to refactor to be expressed a 1-to-1 fashion based on input to increase readability, and for further optimization later on. The issue is that there is a service that should be called in batches to reduce HTTP overhead, so mixing the 1-to-1 code with the batch code is a bit tricky, and we may not want to call the service with every input. Results can be sent out eagerly one-by-one, but order must be maintained, so something like a flow doesn't seem to work.
So, ideally the batch processor would look something like this:
class Processor<A, B> {
val service: Service<A, B>
val scope: CoroutineScope
fun processBatch(input: List<A>) {
input.map {
Pair(it, scope.async { service.call(it) })
}.map {
(a, b) ->
runBlocking { b.await().let { /** handle result, do something with a if result is null, etc **/ } }
}
}
}
The desire is to perform all of the service logic in such a way that it is executing in the background, automatically splitting the inputs for the service into batches, executing them asynchronously, and somehow mapping the result of the batch call into the suspended call.
Here is a hacky implementation:
class Service<A, B> {
val inputContainer: MutableList<A>
val outputs: MutableList<B>
val runCalled = AtomicBoolean(false)
val batchSize: Int
suspended fun call(input: A): B? {
// some prefiltering logic that returns a null early
val index = inputContainer.size
inputContainer.add(a) // add to overall list for later batching
return suspend {
run()
outputs[index]
}
}
fun run() {
val batchOutputs = mutableListOf<Deferred<List<B?>>>()
if (!runCalled.getAndSet(true)) {
inputs.chunked(batchSize).forEach {
batchOutputs.add(scope.async { batchCall(it) })
}
runBlocking {
batchOutputs.map {
val res = result.await()
outputs.addAll(res)
}
}
}
}
suspended fun batchCall(input: List<A>): List<B?> {
// batch API call, etc
}
}
Something like this could work but there are several concerns:
All API calls go out at once. Ideally this would be batching and executing in the background while other inputs are being scheduled, but this is not .
Processing of the service result for the first input cannot resume until all results have been returned. Ideally we could process the result if the service call has returned, while other results continue to be performed in the background.
Containers of intermediate results seem hacky and prone to bugs. Cleanup logic is also needed, which introduces more hacky bits into the rest of the code
I can think of several optimizations to the address 1 and 2, but I imagine concerns related to 3 would be worse. This seems like a fairly common call pattern and I would expect there to be a library or much simpler design pattern to accomplish this, but I haven't been able to find anything. Any guidance is appreciated.
You're on the right track by using Deferred. The solution I would use is:
When the caller makes a request, create a CompletableDeferred
Using a channel, pass this CompletableDeferred to the service for later completion
Have the caller suspend until the service completes the CompletableDeferred
It might look something like this:
val requestChannel = Channel<Pair<Request, CompletableDeferred<Result>>()
suspend fun doRequest(request: Request): Result {
val result = CompletableDeferred<Result>()
requestChannel.send(Pair(request, result))
return result.await()
}
fun run() = scope.launch {
while(isActive) {
val (requests, deferreds) = getBatch(batchSize).unzip()
val results = batchCall(requests)
(results zip deferreds).forEach { (result, deferred) ->
deferred.complete(result)
}
}
}
suspend fun getBatch(batchSize: Int) = buildList {
repeat(batchSize) {
add(requestChannel.receive())
}
}

Combining kotlin flow results

I'm wandering if there is a clean way to launch a series of flows in Kotlin and then, after their resolution, perform further operations based on whether they succeeded or not
For example's sake I need to read all integers from a DB (returning them into a flow), check if they are even or odd against an external API (also returning a flow), and then remove the odd ones from the DB
In code it would be something like this
fun findEven() {
db.readIntegers()
.map { listOfInt ->
listOfInt.asFlow()
.flatMapMerge { singleInt ->
httpClient.apiCallToCheckForOddity(singleInt)
.catch {
// API failure when number is even
}
.map {
// API success when number is odd
db.remove(singleInt).collect()
}
}.collect()
}.collect()
}
But the problem I see with this code is the access to the DB deleting entries done in parallel, and I think a better solution would be to run all API calls and somewhere collect all that failed and all that succeeded, so to be able to do a bulk insertion in the DB only once instead of having multiple coroutines do that on their own
In my opinion, it's kind of an anti-pattern to produce side effects in map, filter, etc. A side effect like removing items from a database should be a separate step (collect in the case of a Flow, and forEach in the case of a List) for clarity.
The nested flow is also kind of convoluted, since you can directly modify the list as a List.
I think you can do it like this, assuming the API can only check one item at a time.
suspend fun findEven() {
db.readIntegers()
.map { listOfInt ->
listOfInt.filter { singleInt ->
runCatching {
httpClient.apiCallToCheckForOddity(singleInt)
}.isSuccess
}
}
.collect { listOfOddInt ->
db.removeAll(listOfOddInt)
}
}
Parallel version, if the API call returns the parameter. (By the way, Kotlin APIs should not throw exceptions on non-programmer errors).
suspend fun findEven() {
db.readIntegers()
.map { listOfInt ->
coroutineScope {
listOfInt.map { singleInt ->
async {
runCatching {
httpClient.apiCallToCheckForOddity(singleInt)
}
}
}.awaitAll()
.mapNotNull(Result<Int>::getOrNull)
}
}
.collect { listOfOddInt ->
db.removeAll(listOfOddInt)
}
}

Observe many times from same Observable (RxAndroidBle)

I'm using the RxAndroidBle library with RxJava2 to read from a BLE Characteristic. I think this question is just an RxJava question, but including the detail that I'm using RxAndroidBle in case that is useful.
I get connection, and then use it to call readCharacteristic(), which itself returns a Single<ByteArray>. At this point, I don't just want to just get the one ByteArray though. I need to read from this characteristic several times, because the BLE device is set up to let me get a small file back, and characteristics can only send 20 bytes back at a time, hence my need to read repeatedly.
Is it possible to modify this code so that the switchMap() below returns an Observable that will emit many ByteArrays, instead of just the single one?
I'm new to RxJava.
val connection: Observable<RxBleConnection> = selectedDevice.record.bleDevice.establishConnection(false, Timeout(30, TimeUnit.SECONDS))
return connection
.subscribeOn(Schedulers.io())
.switchMap {
// I want to get an Observable that can read multiple times here.
it.readCharacteristic(serverCertCharacteristicUUID).toObservable()
}
.doOnNext {
Timber.e("Got Certificate bytes")
}
.map {
String(it as ByteArray)
}
.doOnNext {
Timber.e("Got certificate: $it")
}
.singleOrError()
To repeat a read multiple times until a specific value is emitted one needs to change this part:
// I want to get an Observable that can read multiple times here.
it.readCharacteristic(serverCertCharacteristicUUID).toObservable()
to something like what was suggested by the RxJava author in the first answer that google gives for phrase rxjava single repeat:
// this will repeat until a `checkRepeatIf` returns false
Observable.defer {
val successValue = AtomicReference<ByteArray>()
connection.readCharacteristic(serverCertCharacteristicUUID)
.doOnSuccess { successValue.lazySet(it) }
.repeatWhen { completes -> completes.takeWhile { checkRepeatIf(successValue.get()) } }
}
I was able to get this working by sending a signal to stop both the connectionObservable, and the read on the Bluetooth characteristic. Of note is that you need to call toObservable() AFTER repeat() or this doesn't work, although I don't know why exactly.
override fun readMultipartCharacteristic(macAddress: String): Single<String> {
val CERTIFICATE_TERMINATOR = 0x30.toByte()
val device = bluetoothService.getBleDevice(macAddress)
if (connectionObservable == null || !device.connectionState.equals(RxBleConnection.RxBleConnectionState.CONNECTED)) {
connectionObservable = device.establishConnection(false, Timeout(30, TimeUnit.SECONDS))
}
val stop: PublishSubject<Unit> = PublishSubject.create()
return connectionObservable!!
.subscribeOn(Schedulers.io())
.takeUntil(stop)
.switchMap {
it.readCharacteristic(UUID("my-uuid"))
.repeat()
.toObservable()
.takeUntil(stop)
}
.collectInto(ByteArrayOutputStream(), { buffer, byteArray ->
// Watch for the signal of the end of the stream
if (byteArray.size == 1 && byteArray.get(0).equals(CERTIFICATE_TERMINATOR)) {
stop.onComplete()
} else {
buffer.write(byteArray)
}
})
.map {
String(it.toByteArray())
}
}
You can use the notification to buffer your data.
device.establishConnection(false)
.flatMap(rxBleConnection -> rxBleConnection.setupNotification(characteristicUuid))
.flatMap(notificationObservable -> notificationObservable) // <-- Notification has been set up, now observe value changes.
.subscribe(
bytes -> {
// Given characteristic has been changes, here is the value.
},
throwable -> {
// Handle an error here.
}
);