How to propagate closing to a chain of flows in kotlin - kotlin

I am using kotlin and I wanted to stream over a possibly huge resultset using flows. I found some explanations around the web:
Callbacks and Kotlin Flows
Use Flow for asynchronous data streams
I implemented it and it works fine. I also needed to batch the results before sending them to an external services, so I implemented a chunked operation on flows. Something like that:
fun <T> Flow<T>.chunked(chunkSize: Int): Flow<List<T>> {
return callbackFlow {
val listOfResult = mutableListOf<T>()
this#chunked.collect {
listOfResult.add(it)
if (listOfResult.size == chunkSize) {
trySendBlocking(listOfResult.toList())
listOfResult.clear()
}
}
if (listOfResult.isNotEmpty()) {
trySendBlocking(listOfResult)
}
close()
}
}
To be sure that everything was working fine, I created some integration tests:
first flow + chuncked to consume all rows, passed
using the first flow (the one created from the jdbc repository) and
applying take operator just to consider few x items. It passed correctly.
using first flow + chunked operator + take operator, it hangs forever
So the last test showed that there was something wrong in the implementation.
I investigated a lot without finding nothing useful but, dumping the threads, I found a coroutine thread blocked in the trySendBlocking call on the first flow, the one created in the jdbc repository.
I am wondering in which way the chunked operator is supposed to propagate the closing to the upstream flow since it seems this part is missing.
In both cases I am propagating downstream the end of data with a close() call but I took a look the take operator and I saw it is triggering back the closing with an emitAbort(...)
Should I do something similar in the callbackFlow{...}?
After a bit of investigation, I was able to avoid the locking adding a timeout on the trySendBlocking inside the repository but I didnĀ“t like that. At the end, I realized that I could cast the original flow (in the chunked operator) to a SendChannel and close it if the downstream flow is closed:
trySendBlocking(listOfResult.toList()).onSuccess {
LOGGER.debug("Sent")
}.onFailure {
LOGGER.warn("An error occurred sending data.", it)
}.onClosed {
LOGGER.info("Channel has been closed")
(originalFlow as SendChannel<*>).close(it)
}
Is this the correct way of closing flows backwards? Any hint to solve this issue?
Thanks!

You shouldn't use trySendBlocking instead of send. You should never use a blocking function in a coroutine without wrapping it in withContext with a Dispatcher that can handle blocking code (e.g. Dispatchers.Default). But when there's a suspend function alternative, use that instead, in this case send().
Also, callbackFlow is more convoluted than necessary for transforming a flow. You should use the standard flow builder instead (and so you'll use emit() instead of send()).
fun <T> Flow<T>.chunked(chunkSize: Int): Flow<List<T>> = flow {
val listOfResult = mutableListOf<T>()
collect {
listOfResult.add(it)
if (listOfResult.size == chunkSize) {
emit(listOfResult.toList())
listOfResult.clear()
}
}
if (listOfResult.isNotEmpty()) {
emit(listOfResult)
}
}

Related

How can you paralellize the processing of File InputStreams in Kotlin / Arrow?

I am processing large Files, having a list of them:
val originalFiles: List<File>
I need to read the InputStream of each file, process it, and write it to another processedFile. For the sake of simplicity let's assume I just read the original InputStream and write it to the destination file output stream.
I would like to process the originalFiles in parallel.
The first straigtforward way would be to use parallelStream()
override suspend fun processFiles(workdir: File): Either<Throwable, Unit> = either {
val originalFiles = ...
originalFiles.parallelStream().forEach {
val destination = File("${workdir.absolutePath}/${it.name}-processed.txt")
logger.info("Starting merge {}", destination)
FileOutputStream(destination).use { faos ->
IOUtils.copy(it.inputStream(), faos)
}
logger.info("Finished processing {}", destination)
}
}
However, given that I'm working with coroutines and Arrow, I get a compile warning Possibly blocking call in non-blocking context could lead to thread starvation.
Is there a proper (non-blocking) way to work with Input/OutputStreams with coroutines/suspend functions?
Is there a better way to paralellize the List processing with coroutines/Arrow?
Your best best would be to use parTraverse (going to be renamed to parMap in 2.x.x). This function comes from Arrow Fx Coroutines, there is also Flow#parMap and Flow#parMapUnordered that you can use instead.
You also need to make sure that FileOutputStream is closed correctly, and in face of cancellation, and I would recommend using Resource for that.
The Possibly blocking call in non-blocking context could lead to thread starvation warning will disappear by invoking it on the correct Dispatchers.IO.
suspend fun processFiles(workdir: File): Unit {
val originalFiles: List<File> = emptyList<File>()
originalFiles.parTraverse(Dispatchers.IO) {
val destination = File("${workdir.absolutePath}/${it.name}-processed.txt")
logger.info("Starting merge {}", destination)
FileOutputStream(destination).use { faos ->
IOUtils.copy(it.inputStream(), faos)
}
logger.info("Finished processing {}", destination)
}
}
So the summary the answers on your question:
Is there a proper (non-blocking) way to work with Input/OutputStreams with coroutines/suspend functions?
Execute them using suspend + Dispatchers.IO.
Is there a better way to paralellize the List processing with coroutines/Arrow?
Leverage parTraverse to parallelise the List transformations in Kotlin Coroutines. Optionally, parTraverseN if you want to also limit the amount of parallel processes.

emitting flow values asynchronously with kotlins flow

Iam building a simple Spring Service with kotlin and webflux.
I have a endpoint which returns a flow. The flow contains elements which take a long time to compute which is simulated by a delay.
It is constructed like this:
suspend fun latest(): Flow<Message> {
println("generating messages")
return flow {
for (i in 0..20) {
println("generating $i")
if (i % 2 == 0) delay(1000)
else delay(200)
println("generated messsage $i")
emit(generateMessage(i))
}
println("messages generated")
}
}
My expectation was that it would return Message1 followed by Message3, Message5... and then Message0 because of the different delays the individual generation takes.
But in reality the flow contains the elements in order.
I guess iam missing something important about coroutins and flow and i tryed diffrent thinks to achive what i want with couroutins but i cant figure out how.
Solution
As pointed out by Marko Topolnik and William Reed using channelFlow works.
fun latest(): Flow<Message> {
println("generating numbers")
return channelFlow {
for (i in 0..20) {
launch {
send(generateMessage(i))
}
}
}
}
suspend fun generateMessage(i: Int): Message {
println("generating $i")
val time = measureTimeMillis {
if (i % 2 == 0) delay(1000)
else delay(500)
}
println("generated messsage $i in ${time}ms")
return Message(UUID.randomUUID(), "This is Message $i")
}
When run the results are as expected
generating numbers
generating 2
generating 0
generating 1
generating 6
...
generated messsage 5 in 501ms
generated messsage 9 in 501ms
generated messsage 13 in 501ms
generated messsage 15 in 505ms
generated messsage 4 in 1004ms
...
Once you go concurrent with the computation of each element, your first problem will be to figure out when all the computation is done.
You have to know in advance how many items to expect. So it seems natural to me to construct a plain List<Deferred<Message>> and then await on all the deferreds before returning the entire thing. You aren't getting any mileage from the flow in your case, since flow is all about doing things synchronously, inside the flow collection.
You can also use channelFlow in combination with a known count of messages to expect, and then terminate the flow based on that. The advantage would be that Spring can start collecting the flow earlier.
EDIT
Actually, the problem of the count isn't present: the flow will automatically wait for all the child coroutines you launched to complete.
Your current approach uses a single coroutine for the entire function, including the for loop. That means that any calling of a suspend fun, e.g. delay will block that entire coroutine until it completes. It does free up the thread to go do other stuff, but the current coroutine is blocked.
It's hard to say what the right solution is based on your simplified example. If you truly did want a new coroutine for each for loop, you could launch it there, but it doesn't seem clear that is the right solution from the information given.

Blocking operation in coroutines

I am working on webflux application using coroutines. It's main purpose is "backend for frontend" for mobile app. Most of request is handled by fetching data and merging data from microservices. I am currently working on adding database to this service. I thought i understood this concept in coroutines. I tried adding this to code
suspend fun fetchData(): String {
return withContext(Dispatchers.IO) {
Thread.sleep(10000) // fetch data from database
""
}
}
I was surprised that this code used on one of endpoints slowed down every endpoint response time. Endpoints unrelated to this part of code where affected. My guess is I am using wrong pool thread for this operation. I also tried Reactor approach but got the same result:
suspend fun fetchData(): String {
return Mono.fromCallable {
Thread.sleep(10000) // fetch data from database
""
}.subscribeOn(Schedulers.boundedElastic()).awaitSingle()
}
What am I doing wrong? Why main thread seems to get blocked?

How does Kotlin flow created with BroadcastChannel.asFlow() context preservation work?

Here is an example to illustrate my confusion:
fun main() = runBlocking(Dispatchers.Default + CoroutineName("Main")) {
val broadcaster = BroadcastChannel<Int>(Channel.BUFFERED)
val flow = withContext(CoroutineName("InitialFlowCreation")) {
broadcaster.asFlow()
.map {
println("first mapping in context: $coroutineContext")
it * 10
}
.broadcastIn(CoroutineScope(Dispatchers.Default + CoroutineName("BroadcastIn")))
.asFlow()
}
val updatedFlow = withContext(CoroutineName("UpdatedFlowCreation")) {
flow.map {
println("second mapping in context: $coroutineContext")
it * 10
}
.flowOn(Dispatchers.Default + CoroutineName("FlowOn"))
}
launch(CoroutineName("Collector")) {
updatedFlow.collect {
println("Collecting $it in context: $coroutineContext")
}
}
delay(1_000)
launch(CoroutineName("OriginalBroadcast")) {
for (i in 1..10) {
broadcaster.send(i)
println("Sent original broadcast from: $coroutineContext")
delay(1_000)
}
}
return#runBlocking
}
This produces the following output (truncated):
Sent original broadcast from: [CoroutineName(OriginalBroadcast), StandaloneCoroutine{Active}#3a14b06a, DefaultDispatcher]
first mapping in context: [CoroutineName(InitialFlowCreation), UndispatchedCoroutine{Completed}#40202c08, DefaultDispatcher]
second mapping in context: [CoroutineName(UpdatedFlowCreation), UndispatchedCoroutine{Completed}#6cf04ddc, DefaultDispatcher]
Collecting 100 in context: [CoroutineName(Collector), StandaloneCoroutine{Active}#6ac9d4b5, DefaultDispatcher]
The documentation states things in various places that causes me to be confused by this result.
In Flow we have "Use channelFlow if the collection and emission of a flow are to be separated into multiple coroutines. It encapsulates all the context preservation work and allows you to focus on your domain-specific problem, rather than invariant implementation details. It is possible to use any combination of coroutine builders from within channelFlow." I know I'm not actually using the channelFlow function but a ChannelFlow is being created internally when we call broadcastIn so the same principals should apply.
I thought the first invocation of map would be run in the "OriginalBroadcast" context and the second would either be run in the "BroadcastIn" context or the "Collector" context but instead they are both run in the context where they are called. I don't understand why this is happening, shouldn't the context of map be where it is collected in order to be broadcast or the context where it is finally collected, not the context where map is called? Also the call to flowOn has no effect. What context preservation work is being encapsulated here?
Also am I correct that in a chain of flow.broadcastIn(...).asFlow().map{...}.broadcastIn(...).asFlow() the two BroadcastChannels created will not be fused? Trying to make sure I'm not missing something.
I guess what I'm really looking for is inclusive documentation of in what situation Channels are fused, how they are fused, and what context the operators that are called between ChannelFlow operators will run in.
The context preservation only applies to operations on flows, e.g. the code in flow { ... } builder works in the same context that calls collect(). The context is not preserved when operating via channels by the very nature of channels. Channels are communication primitives that are designed for communication between different coroutines.
It means that when you call broadcaster.send in one coroutine it will be received in another coroutine, in a coroutine that collects from the corresponding flow.
The documentation on channelFlow simply means that you don't have to worry about context preservation violation, which is non-trivial to ensure if you were to write such a primitive yourself.

Transforming a Spring Webflux Mono to an Either, preferably without blocking?

I'm using Kotlin and Arrow and the WebClient from spring-webflux. What I'd like to do is to transform a Mono instance to an Either.
The Either instance is created by calling Either.right(..) when the response of the WebClient is successful or Either.left(..) when WebClient returns an error.
What I'm looking for is a method in Mono similar to Either.fold(..) where I can map over the successful and erroneous result and return a different type than a Mono. Something like this (pseudo-code which doesn't work):
val either : Either<Throwable, ClientResponse> =
webClient().post().exchange()
.fold({ throwable -> Either.left(throwable) },
{ response -> Either.right(response)})
How should one go about?
There is no fold method on Mono but you can achieve the same using two methods: map and onErrorResume. It would go something like this:
val either : Either<Throwable, ClientResponse> =
webClient().post()
.exchange()
.map { Either.right(it) }
.onErrorResume { Either.left(it).toMono() }
I'm not really familiar with that Arrow library nor the typical use case for it, so I'll use Java snippets to make my point here.
First I'd like first to point that this type seems to be blocking and not lazy (unlike Mono). Translating a Mono to that type means that you'll make your code blocking and that you shouldn't do that, for example, in the middle of a Controller handler or you will block your whole server.
This is more or less the equivalent of this:
Mono<ClientResponse> response = webClient.get().uri("/").exchange();
// blocking; return the response or throws an exception
ClientResponse blockingResponse = response.block();
That being said, I think you should be able to convert a Mono to that type by either calling block() on it and a try/catch block around it, or turning it first into a CompletableFuture first, like:
Mono<ClientResponse> response = webClient.get().uri("/").exchange();
Either<Throwable, ClientResponse> either = response
.toFuture()
.handle((resp, t) -> Either.fold(t, resp))
.get();
There might be better ways to do that (especially with inline functions), but they all should involve blocking on the Mono in the first place.