How can you paralellize the processing of File InputStreams in Kotlin / Arrow? - kotlin

I am processing large Files, having a list of them:
val originalFiles: List<File>
I need to read the InputStream of each file, process it, and write it to another processedFile. For the sake of simplicity let's assume I just read the original InputStream and write it to the destination file output stream.
I would like to process the originalFiles in parallel.
The first straigtforward way would be to use parallelStream()
override suspend fun processFiles(workdir: File): Either<Throwable, Unit> = either {
val originalFiles = ...
originalFiles.parallelStream().forEach {
val destination = File("${workdir.absolutePath}/${it.name}-processed.txt")
logger.info("Starting merge {}", destination)
FileOutputStream(destination).use { faos ->
IOUtils.copy(it.inputStream(), faos)
}
logger.info("Finished processing {}", destination)
}
}
However, given that I'm working with coroutines and Arrow, I get a compile warning Possibly blocking call in non-blocking context could lead to thread starvation.
Is there a proper (non-blocking) way to work with Input/OutputStreams with coroutines/suspend functions?
Is there a better way to paralellize the List processing with coroutines/Arrow?

Your best best would be to use parTraverse (going to be renamed to parMap in 2.x.x). This function comes from Arrow Fx Coroutines, there is also Flow#parMap and Flow#parMapUnordered that you can use instead.
You also need to make sure that FileOutputStream is closed correctly, and in face of cancellation, and I would recommend using Resource for that.
The Possibly blocking call in non-blocking context could lead to thread starvation warning will disappear by invoking it on the correct Dispatchers.IO.
suspend fun processFiles(workdir: File): Unit {
val originalFiles: List<File> = emptyList<File>()
originalFiles.parTraverse(Dispatchers.IO) {
val destination = File("${workdir.absolutePath}/${it.name}-processed.txt")
logger.info("Starting merge {}", destination)
FileOutputStream(destination).use { faos ->
IOUtils.copy(it.inputStream(), faos)
}
logger.info("Finished processing {}", destination)
}
}
So the summary the answers on your question:
Is there a proper (non-blocking) way to work with Input/OutputStreams with coroutines/suspend functions?
Execute them using suspend + Dispatchers.IO.
Is there a better way to paralellize the List processing with coroutines/Arrow?
Leverage parTraverse to parallelise the List transformations in Kotlin Coroutines. Optionally, parTraverseN if you want to also limit the amount of parallel processes.

Related

How to propagate closing to a chain of flows in kotlin

I am using kotlin and I wanted to stream over a possibly huge resultset using flows. I found some explanations around the web:
Callbacks and Kotlin Flows
Use Flow for asynchronous data streams
I implemented it and it works fine. I also needed to batch the results before sending them to an external services, so I implemented a chunked operation on flows. Something like that:
fun <T> Flow<T>.chunked(chunkSize: Int): Flow<List<T>> {
return callbackFlow {
val listOfResult = mutableListOf<T>()
this#chunked.collect {
listOfResult.add(it)
if (listOfResult.size == chunkSize) {
trySendBlocking(listOfResult.toList())
listOfResult.clear()
}
}
if (listOfResult.isNotEmpty()) {
trySendBlocking(listOfResult)
}
close()
}
}
To be sure that everything was working fine, I created some integration tests:
first flow + chuncked to consume all rows, passed
using the first flow (the one created from the jdbc repository) and
applying take operator just to consider few x items. It passed correctly.
using first flow + chunked operator + take operator, it hangs forever
So the last test showed that there was something wrong in the implementation.
I investigated a lot without finding nothing useful but, dumping the threads, I found a coroutine thread blocked in the trySendBlocking call on the first flow, the one created in the jdbc repository.
I am wondering in which way the chunked operator is supposed to propagate the closing to the upstream flow since it seems this part is missing.
In both cases I am propagating downstream the end of data with a close() call but I took a look the take operator and I saw it is triggering back the closing with an emitAbort(...)
Should I do something similar in the callbackFlow{...}?
After a bit of investigation, I was able to avoid the locking adding a timeout on the trySendBlocking inside the repository but I didnĀ“t like that. At the end, I realized that I could cast the original flow (in the chunked operator) to a SendChannel and close it if the downstream flow is closed:
trySendBlocking(listOfResult.toList()).onSuccess {
LOGGER.debug("Sent")
}.onFailure {
LOGGER.warn("An error occurred sending data.", it)
}.onClosed {
LOGGER.info("Channel has been closed")
(originalFlow as SendChannel<*>).close(it)
}
Is this the correct way of closing flows backwards? Any hint to solve this issue?
Thanks!
You shouldn't use trySendBlocking instead of send. You should never use a blocking function in a coroutine without wrapping it in withContext with a Dispatcher that can handle blocking code (e.g. Dispatchers.Default). But when there's a suspend function alternative, use that instead, in this case send().
Also, callbackFlow is more convoluted than necessary for transforming a flow. You should use the standard flow builder instead (and so you'll use emit() instead of send()).
fun <T> Flow<T>.chunked(chunkSize: Int): Flow<List<T>> = flow {
val listOfResult = mutableListOf<T>()
collect {
listOfResult.add(it)
if (listOfResult.size == chunkSize) {
emit(listOfResult.toList())
listOfResult.clear()
}
}
if (listOfResult.isNotEmpty()) {
emit(listOfResult)
}
}

Will I always add withContext(Dispatchers.IO) in suspend when I pull data from a remote server?

I'm learning Coroutines of Kotlin.
The following content is from the artical https://developer.android.com/kotlin/coroutines.
Important: Using suspend doesn't tell Kotlin to run a function on a background thread. It's normal for suspend functions to operate on the main thread. It's also common to launch coroutines on the main thread. You should always use withContext() inside a suspend function when you need main-safety, such as when reading from or writing to disk, performing network operations, or running CPU-intensive operations.
Normally it's spend long time when I pull data from a remote server, so I need to place "the pull data function" in background thread in order not to freeze main UI.
Should I always add withContext(Dispatchers.IO) in suspend when I use suspend to pull data from remote server?
BTW,
The Code A is from the project https://github.com/googlecodelabs/kotlin-coroutines, you can see it .
But I can't find the keyword withContext() in the project, why?
Code A
fun refreshTitle() = launchDataLoad {
repository.refreshTitle()
}
private fun launchDataLoad(block: suspend () -> Unit): Unit {
viewModelScope.launch {
try {
_spinner.value = true
block()
} catch (error: TitleRefreshError) {
_snackBar.value = error.message
} finally {
_spinner.value = false
}
}
}
Should I always add withContext(Dispatchers.IO) in suspend when I use suspend to pull data from remote server?
It depends. If the you use a library like Retrofit 2.6.0 that has native support for suspend, the dispatcher is already Dispatchers.IO (or whatever the library deems more appropriate).
If the call to pull data from a remote server is blocking, you need to make sure to run it on Dispatcher.IO yourself with withContext(Dispatchers.IO) to not block the main thread.
I can't find the keyword withContext() in the project, why?
Because the project uses Retrofit, so the switch to Dispatchers.IO happens under the hood:
https://github.com/googlecodelabs/kotlin-coroutines/blob/master/coroutines-codelab/finished_code/src/main/java/com/example/android/kotlincoroutines/main/MainNetwork.kt

How does Kotlin flow created with BroadcastChannel.asFlow() context preservation work?

Here is an example to illustrate my confusion:
fun main() = runBlocking(Dispatchers.Default + CoroutineName("Main")) {
val broadcaster = BroadcastChannel<Int>(Channel.BUFFERED)
val flow = withContext(CoroutineName("InitialFlowCreation")) {
broadcaster.asFlow()
.map {
println("first mapping in context: $coroutineContext")
it * 10
}
.broadcastIn(CoroutineScope(Dispatchers.Default + CoroutineName("BroadcastIn")))
.asFlow()
}
val updatedFlow = withContext(CoroutineName("UpdatedFlowCreation")) {
flow.map {
println("second mapping in context: $coroutineContext")
it * 10
}
.flowOn(Dispatchers.Default + CoroutineName("FlowOn"))
}
launch(CoroutineName("Collector")) {
updatedFlow.collect {
println("Collecting $it in context: $coroutineContext")
}
}
delay(1_000)
launch(CoroutineName("OriginalBroadcast")) {
for (i in 1..10) {
broadcaster.send(i)
println("Sent original broadcast from: $coroutineContext")
delay(1_000)
}
}
return#runBlocking
}
This produces the following output (truncated):
Sent original broadcast from: [CoroutineName(OriginalBroadcast), StandaloneCoroutine{Active}#3a14b06a, DefaultDispatcher]
first mapping in context: [CoroutineName(InitialFlowCreation), UndispatchedCoroutine{Completed}#40202c08, DefaultDispatcher]
second mapping in context: [CoroutineName(UpdatedFlowCreation), UndispatchedCoroutine{Completed}#6cf04ddc, DefaultDispatcher]
Collecting 100 in context: [CoroutineName(Collector), StandaloneCoroutine{Active}#6ac9d4b5, DefaultDispatcher]
The documentation states things in various places that causes me to be confused by this result.
In Flow we have "Use channelFlow if the collection and emission of a flow are to be separated into multiple coroutines. It encapsulates all the context preservation work and allows you to focus on your domain-specific problem, rather than invariant implementation details. It is possible to use any combination of coroutine builders from within channelFlow." I know I'm not actually using the channelFlow function but a ChannelFlow is being created internally when we call broadcastIn so the same principals should apply.
I thought the first invocation of map would be run in the "OriginalBroadcast" context and the second would either be run in the "BroadcastIn" context or the "Collector" context but instead they are both run in the context where they are called. I don't understand why this is happening, shouldn't the context of map be where it is collected in order to be broadcast or the context where it is finally collected, not the context where map is called? Also the call to flowOn has no effect. What context preservation work is being encapsulated here?
Also am I correct that in a chain of flow.broadcastIn(...).asFlow().map{...}.broadcastIn(...).asFlow() the two BroadcastChannels created will not be fused? Trying to make sure I'm not missing something.
I guess what I'm really looking for is inclusive documentation of in what situation Channels are fused, how they are fused, and what context the operators that are called between ChannelFlow operators will run in.
The context preservation only applies to operations on flows, e.g. the code in flow { ... } builder works in the same context that calls collect(). The context is not preserved when operating via channels by the very nature of channels. Channels are communication primitives that are designed for communication between different coroutines.
It means that when you call broadcaster.send in one coroutine it will be received in another coroutine, in a coroutine that collects from the corresponding flow.
The documentation on channelFlow simply means that you don't have to worry about context preservation violation, which is non-trivial to ensure if you were to write such a primitive yourself.

Kotlin - How to read from file asynchronously?

Is there any kotlin idiomatic way to read a file content's asynchronously? I couldn't find anything in documentation.
A least as of Java 7 (which is where Android is stuck), there isn't any API that would tap into the low-level async file IO support (like io_uring). There is a class called AsynchronousFileChannel, but, as its docs state,
An AsynchronousFileChannel is associated with a thread pool to which tasks are submitted to handle I/O events and dispatch to completion handlers that consume the results of I/O operations on the channel.
That makes it no better than the following, bog-standard Kotlin idiom:
launch {
val contents = withContext(Dispatchers.IO) {
FileInputStream("filename.txt").use { it.readBytes() }
}
processContents(contents)
}
go_on_with_other_stuff_while_file_is_loading()
This uses Kotlin's own dedicated IO thread pool and unblocks the UI thread. If you're on Android, that is your actual concern, anyway.
Java NIO Asynchronous Channel is the tool you want.
Check out this AsynchronousFileChannel.aRead extension function from coroutine example:
suspend fun AsynchronousFileChannel.aRead(buf: ByteBuffer): Int =
suspendCoroutine { cont ->
read(buf, 0L, Unit, object : CompletionHandler<Int, Unit> {
override fun completed(bytesRead: Int, attachment: Unit) {
cont.resume(bytesRead)
}
override fun failed(exception: Throwable, attachment: Unit) {
cont.resumeWithException(exception)
}
})
}
You just open an AsynchronousFileChannel then call this aRead() in a coroutine,
val channel = AsynchronousFileChannel.open(Paths.get(fileName))
try {
val buf = ByteBuffer.allocate(4096)
val bytesRead = channel.aRead(buf)
} finally {
channel.close()
}
It's an essential function, don't know why it is not part of coroutine-core lib.
javasync/RxIo uses Java NIO Asynchronous Channel to provide a non-blocking API to read and write a file content's asynchronously, including kotlin idiomatic way. Next you have two examples: one reading/writing in bulk through coroutines, and other iterating lines through an asynchronous Kotlin Flow:
suspend fun copyNio(from: String, to: String) {
val data = Path(from).readText() // suspension point
Path(to).writeText(data) // suspension point
}
fun printLinesFrom(filename: String) {
Path(filename)
.lines() // Flow<String>
.onEach(::println)
.collect() // block if you want to wait for completion
}
Disclaimer I am the author and main contributor of javasync/RxIo

Concurrent S3 File Upload via Kotlin Coroutines

I need to upload many files to S3, it would take hours to complete that job sequentially. That's exactly what Kotlin's new coroutines excels in, so I wanted to give them a first try instead of fiddling around again with some Thread-based execution service.
Here is my (simplified) code:
fun upload(superTiles: Map<Int, Map<Int, SuperTile>>) = runBlocking {
val s3 = AmazonS3ClientBuilder.standard().withRegion("eu-west-1").build()
for ((x, ys) in superTiles) {
val jobs = mutableListOf<Deferred<Any>>()
for ((y, superTile) in ys) {
val job = async(CommonPool) {
uploadTile(s3, x, y, superTile)
}
jobs.add(job)
}
jobs.map { it.await() }
}
}
suspend fun uploadTile(s3: AmazonS3, x: Int, y: Int, superTile: SuperTile) {
val json: String = "{}"
val key = "$s3Prefix/x4/$z/$x/$y.json"
s3.putObject(PutObjectRequest("my_bucket", ByteArrayInputStream(json.toByteArray()), metadata))
}
The problem: the code is still very slow and logging reveals that requests are still executed sequentially: a job is finished before the next one is created. Only in very few cases (1 out of 10) I see jobs running concurrently.
Why does the code not run much faster / concurrently? What can I do about it?
Kotlin coroutines excel when you work with asynchronous API, while AmazonS3.putObject API that you are using is an old-school blocking, synchronous API, so you get only as many concurrent uploads as the number of threads in the CommonPool that you are using. There is no value in marking your uploadTile function with suspend modified, because it does not use any suspending functions in its body.
The first step in getting more throughput in your upload task is to start using asynchronous API for that. I'd suggest to look at Amazon S3 TransferManager for that purse. See if that gets your problem solved first.
Kotlin coroutines are designed to help you to combine your async APIs into a easy-to-use logical workflows. For example, it is straightforward to adapt asynchronous API of TransferManager for use with coroutines by writing the following extension function:
suspend fun Upload.await(): UploadResult = suspendCancellableCoroutine { cont ->
addProgressListener {
if (isDone) {
// we know it should not actually wait when done
try { cont.resume(waitForUploadResult()) }
catch (e: Throwable) { cont.resumeWithException(e) }
}
}
cont.invokeOnCompletion { abort() }
}
This extension enables you to write very fluent code that works with TransferManager and you can rewrite your uploadTile function to work with TransferManager instead of working with blocking AmazonS3 interface:
suspend fun uploadTile(tm: TransferManager, x: Int, y: Int, superTile: SuperTile) {
val json: String = "{}"
val key = "$s3Prefix/x4/$z/$x/$y.json"
tm.upload(PutObjectRequest("my_bucket", ByteArrayInputStream(json.toByteArray()), metadata))
.await()
}
Notice, how this new version of uploadTile uses a suspending function await that was defined above.