Compare to sets of files with coroutines in Kotlin - kotlin

I have written a function that scans files (pictures) from two Lists and check if a file is in both lists.
The code below is working as expected, but for large sets it takes some time. So I tried to do this in parallel with coroutines. But in sets of 100 sample files the programm was always slower than without coroutines.
The code:
private fun doJob() {
val filesToCompare = File("C:\\Users\\Tobias\\Desktop\\Test").walk().filter { it.isFile }.toList()
val allFiles = File("\\\\myserver\\Photos\\photo").walk().filter { it.isFile }.toList()
println("Files to scan: ${filesToCompare.size}")
filesToCompare.forEach { file ->
var multipleDuplicate = 0
var s = "This file is a duplicate"
s += "\n${file.absolutePath}"
allFiles.forEach { possibleDuplicate ->
if (file != possibleDuplicate) { //only needed when both lists are the same
// Files that have the same name or contains the name, so not every file gets byte comparison
if (possibleDuplicate.nameWithoutExtension.contains(file.nameWithoutExtension)) {
try {
if (Files.mismatch(file.toPath(), possibleDuplicate.toPath()) == -1L) {
s += "\n${possibleDuplicate.absolutePath}"
i++
multipleDuplicate++
println(s)
}
} catch (e: Exception) {
println(e.message)
}
}
}
}
if (multipleDuplicate > 1) {
println("This file has $multipleDuplicate duplicate(s)")
}
}
println("Files scanned: ${filesToCompare.size}")
println("Total number of duplicates found: $i")
}
How have I tried to add the coroutines?
I wrapped the code inside the first forEach in launch{...} the idea was that for each file a coroutine starts and the second loop is done concurrently. I expected the program to run faster but in fact it was about the same time or slower.
How can I achieve this code to run in parallel faster?

Running each inner loop in a coroutine seems to be a decent approach. The problem might lie in the dispatcher you were using. If you used runBlocking and launch without context argument, you were using a single thread to run all your coroutines.
Since there is mostly blocking IO here, you could instead use Dispatchers.IO to launch your coroutines, so your coroutines are dispatched on multiple threads. The parallelism should be automatically limited to 64, but if your memory can't handle that, you can also use Dispatchers.IO.limitedParallelism(n) to reduce the number of threads.

Related

what is the difference between two programs

Below is the piece of code which I was trying out. there are two blocks of code. the first one creates a million threads/coroutines asynchronously in map and then later adds them up. the second block creates the million coroutines and adds them in the for loop and then prints in out.
According to me the first one should be faster as it creates them in parallel . the second one should be slow as you are creating million in one coroutine.
if my question makes sense at all I would like to get some understanding of this.
fun main() {
//first block
val elapsed1 = measureTimeMillis {
val deferred = (1..1000000).map { n->
GlobalScope.async{
n
}
}
runBlocking {
val sum = deferred.sumByDouble { it.await().toDouble() }
println(sum)
}
}
//second block
val elapsed2 = measureTimeMillis {
val def2 = GlobalScope.async{
var j = 0.0;
for(i in 1..1000000) {
j+=i
}
j
}
runBlocking {
val s = def2.await()
println(s)
}
}
println("elapsed1 $elapsed1")
println("elapsed2 $elapsed2")
}
Please note that, for above small block of code, it is very difficult determine which one is faster than another.
Also, the result of above execution may vary from system to system.
Hence, we cannot confirm which block of code is faster just by running above steps.
Also, I request you to please refer to the following article to get more clarity around this area:
What is microbenchmarking?

Kotlin Coroutines - unlimited stream to fan out batches

I'm looking to implement a pipeline for processing an infinite stream of messages. I'm new to coroutines and trying to follow along with the docs but I'm not confident I'm doing the right thing.
My infinite stream is of batches of records and I'd like to fan out the processing of each record to a coroutine, wait for a batch to finish (to log stats and stuff) before continuing to the next batch.
-> process [record] \
source -> [records] -> process [record] -> [log batch stats]
-> process [record] /
|------------------- while(true) -------------------|
What I had planned is to have 2 Channels, one for the infinite stream, and one for the intermediate records that will fill up and empty on each batch.
runBlocking {
val infinite: Channel<List<Record>> = produce { send(source.getBatch()) }
val records = Channel<Record>(Channel.Factory.UNLIMITED)
while(true) {
infinite.receive().forEach { records.send(it) }
while(!records.isEmpty()) {
launch { process(records.receive()) }
}
// ??? Wait for jobs?
logBatchStats()
}
}
From googling, it seems that waiting for jobs is discouraged, plus I wasn't sure if calling .map on a channel will actually receive messages to convert them to jobs:
records.map { record -> launch { process(record) } }
yields a Channel<Job>. It seems I can call .toList() on it to collapse it, but then I need to join the jobs? Again, google suggested to do that by having a parent job, but I'm not really sure how to do that with launch.
Anyway, very much a n00b to this.
Thanks for the help.
I don't see a reason to have two channels. You could directly iterate over the list of records. And you should use async instead of launch. Then you can use await or even better awaitAll for the list of results.
val infinite: ReceiveChannel<List<Record>> = produce { ... }
while(true) {
val resultsDeferred = infinite.receive().map {
async {
process(it)
}
}
val results = resultsDeferred.awaitAll()
logBatchStats()
}

Launch a number of coroutines and join them all with timeout (without cancelling)

I need to launch a number of jobs which will return a result.
In the main code (which is not a coroutine), after launching the jobs I need to wait for them all to complete their task OR for a given timeout to expire, whichever comes first.
If I exit from the wait because all the jobs completed before the timeout, that's great, I will collect their results.
But if some of the jobs are taking longer that the timeout, my main function needs to wake as soon as the timeout expires, inspect which jobs did finish in time (if any) and which ones are still running, and work from there, without cancelling the jobs that are still running.
How would you code this kind of wait?
The solution follows directly from the question. First, we'll design a suspending function for the task. Let's see our requirements:
if some of the jobs are taking longer that the timeout... without cancelling the jobs that are still running.
It means that the jobs we launch have to be standalone (not children), so we'll opt-out of structured concurrency and use GlobalScope to launch them, manually collecting all the jobs. We use async coroutine builder because we plan to collect their results of some type R later:
val jobs: List<Deferred<R>> = List(numberOfJobs) {
GlobalScope.async { /* our code that produces R */ }
}
after launching the jobs I need to wait for them all to complete their task OR for a given timeout to expire, whichever comes first.
Let's wait for all of them and do this waiting with timeout:
withTimeoutOrNull(timeoutMillis) { jobs.joinAll() }
We use joinAll (as opposed to awaitAll) to avoid exception if one of the jobs fail and withTimeoutOrNull to avoid exception on timeout.
my main function needs to wake as soon as the timeout expires, inspect which jobs did finish in time (if any) and which ones are still running
jobs.map { deferred -> /* ... inspect results */ }
In the main code (which is not a coroutine) ...
Since our main code is not a coroutine it has to wait in a blocking way, so we bridge the code we wrote using runBlocking. Putting it all together:
fun awaitResultsWithTimeoutBlocking(
timeoutMillis: Long,
numberOfJobs: Int
) = runBlocking {
val jobs: List<Deferred<R>> = List(numberOfJobs) {
GlobalScope.async { /* our code that produces R */ }
}
withTimeoutOrNull(timeoutMillis) { jobs.joinAll() }
jobs.map { deferred -> /* ... inspect results */ }
}
P.S. I would not recommend deploying this kind of solution in any kind of a serious production environment, since letting your background jobs running (leak) after timeout will invariably badly bite you later on. Do so only if you throughly understand all the deficiencies and risks of such an approach.
You can try to work with whileSelect and the onTimeout clause. But you still have to overcome the problem that your main code is not a coroutine. The next lines are an example of whileSelect statement. The function returns a Deferred with a list of results evaluated in the timeout period and another list of Deferreds of the unfinished results.
fun CoroutineScope.runWithTimeout(timeoutMs: Int): Deferred<Pair<List<Int>, List<Deferred<Int>>>> = async {
val deferredList = (1..100).mapTo(mutableListOf()) {
async {
val random = Random.nextInt(0, 100)
delay(random.toLong())
random
}
}
val finished = mutableListOf<Int>()
val endTime = System.currentTimeMillis() + timeoutMs
whileSelect {
var waitTime = endTime - System.currentTimeMillis()
onTimeout(waitTime) {
false
}
deferredList.toList().forEach { deferred ->
deferred.onAwait { random ->
deferredList.remove(deferred)
finished.add(random)
true
}
}
}
finished.toList() to deferredList.toList()
}
In your main code you can use the discouraged method runBlocking to access the Deferrred.
fun main() = runBlocking<Unit> {
val deferredResult = runWithTimeout(75)
val (finished, pending) = deferredResult.await()
println("Finished: ${finished.size} vs Pending: ${pending.size}")
}
Here is the solution I came up with. Pairing each job with a state (among other info):
private enum class State { WAIT, DONE, ... }
private data class MyJob(
val job: Deferred<...>,
var state: State = State.WAIT,
...
)
and writing an explicit loop:
// wait until either all jobs complete, or a timeout is reached
val waitJob = launch { delay(TIMEOUT_MS) }
while (waitJob.isActive && myJobs.any { it.state == State.WAIT }) {
select<Unit> {
waitJob.onJoin {}
myJobs.filter { it.state == State.WAIT }.forEach {
it.job.onJoin {}
}
}
// mark any finished jobs as DONE to exclude them from the next loop
myJobs.filter { !it.job.isActive }.forEach {
it.state = State.DONE
}
}
The initial state is called WAIT (instead of RUN) because it doesn't necessarily mean that the job is still running, only that my loop has not yet taken it into account.
I'm interested to know if this is idiomatic enough, or if there are better ways to code this kind of behaviour.

Unpredictable coroutines execution order?

This is what I thought:
When using coroutines you go piling up async ops and once you are done with synchronous op..call them in FIFO order..but that's not always true
In this example you get what I expected:
fun main() = runBlocking {
launch {
println("1")
}
launch {
println("2")
}
println("0")
}
Also here(with nested launch):
fun main() = runBlocking {
launch {
println("1")
}
launch {
launch {
println("3")
}
println("2")
}
println("0")
}
Now in this example with a scope builder and creating another "pile"(not the real term) the order changes but still..you get as expected
fun main() = runBlocking {
launch {
println("2")
}
// replacing launch
coroutineScope {
println("0")
}
println("1")
}
Finally..the reason of this question..example 2 with scope builder:
fun main() = runBlocking {
launch {
println("3")
}
coroutineScope {
launch {
println("1")
}
println("0")
}
println("2")
}
I get this:
0
3
1
2
Why??
Was my assumption wrong and that's not how coroutines work
If so..then how should I determine the correct order when coding
edited: I've tried running the same code on different machines and different platforms but always got the same result..also tried more complicated nesting to prove non-variability of results
And digging the documentation found that coroutines are just code transformation(as I initially thought)
Remember that even if the like to call them 'light-weight' threads they run in a single 'real' thread(note: without newSingleThreadContext)
Thus I chose to believe execution order is pre-established at compile-time and not decided at runtime
After all..I still can't anticipate the order..and that's what I want
Don't assume coroutines will be run in a specific order, the runtime will decide what's best to run when and in what order. What you may be interested in that will help is the kotlinx.coroutines documentation. It does a great job of explaining how they work and also provides some handy abstractions to help managing coroutines make more sense. I personally recommend checking out channels, jobs, and Deferred (async/await).
For example, if I wanted things done in a certain order by number, I'd use channels to ensure things arrived in the order I wanted.
runBlocking {
val channel = Channel<Int>()
launch {
for (x in 0..5) channel.send(x * x)
channel.close()
}
for (msg in channel) {
// Pretend we're doing some work with channel results
println("Message: $msg")
}
}
Hopefully that can give you more context or what coroutines are and what they're good for.

Kotlin: withContext() vs Async-await

I have been reading kotlin docs, and if I understood correctly the two Kotlin functions work as follows :
withContext(context): switches the context of the current coroutine, when the given block executes, the coroutine switches back to previous context.
async(context): Starts a new coroutine in the given context and if we call .await() on the returned Deferred task, it will suspends the calling coroutine and resume when the block executing inside the spawned coroutine returns.
Now for the following two versions of code :
Version1:
launch(){
block1()
val returned = async(context){
block2()
}.await()
block3()
}
Version2:
launch(){
block1()
val returned = withContext(context){
block2()
}
block3()
}
In both versions block1(), block3() execute in default context(commonpool?) where as block2() executes in the given context.
The overall execution is synchronous with block1() -> block2() -> block3() order.
Only difference I see is that version1 creates another coroutine, where as version2 executes only one coroutine while switching context.
My questions are :
Isn't it always better to use withContext rather than async-await as it is functionally similar, but doesn't create another coroutine. Large numbers of coroutines, although lightweight, could still be a problem in demanding applications.
Is there a case async-await is more preferable to withContext?
Update:
Kotlin 1.2.50 now has a code inspection where it can convert async(ctx) { }.await() to withContext(ctx) { }.
Large number of coroutines, though lightweight, could still be a problem in demanding applications
I'd like to dispel this myth of "too many coroutines" being a problem by quantifying their actual cost.
First, we should disentangle the coroutine itself from the coroutine context to which it is attached. This is how you create just a coroutine with minimum overhead:
GlobalScope.launch(Dispatchers.Unconfined) {
suspendCoroutine<Unit> {
continuations.add(it)
}
}
The value of this expression is a Job holding a suspended coroutine. To retain the continuation, we added it to a list in the wider scope.
I benchmarked this code and concluded that it allocates 140 bytes and takes 100 nanoseconds to complete. So that's how lightweight a coroutine is.
For reproducibility, this is the code I used:
fun measureMemoryOfLaunch() {
val continuations = ContinuationList()
val jobs = (1..10_000).mapTo(JobList()) {
GlobalScope.launch(Dispatchers.Unconfined) {
suspendCoroutine<Unit> {
continuations.add(it)
}
}
}
(1..500).forEach {
Thread.sleep(1000)
println(it)
}
println(jobs.onEach { it.cancel() }.filter { it.isActive})
}
class JobList : ArrayList<Job>()
class ContinuationList : ArrayList<Continuation<Unit>>()
This code starts a bunch of coroutines and then sleeps so you have time to analyze the heap with a monitoring tool like VisualVM. I created the specialized classes JobList and ContinuationList because this makes it easier to analyze the heap dump.
To get a more complete story, I used the code below to also measure the cost of withContext() and async-await:
import kotlinx.coroutines.*
import java.util.concurrent.Executors
import kotlin.coroutines.suspendCoroutine
import kotlin.system.measureTimeMillis
const val JOBS_PER_BATCH = 100_000
var blackHoleCount = 0
val threadPool = Executors.newSingleThreadExecutor()!!
val ThreadPool = threadPool.asCoroutineDispatcher()
fun main(args: Array<String>) {
try {
measure("just launch", justLaunch)
measure("launch and withContext", launchAndWithContext)
measure("launch and async", launchAndAsync)
println("Black hole value: $blackHoleCount")
} finally {
threadPool.shutdown()
}
}
fun measure(name: String, block: (Int) -> Job) {
print("Measuring $name, warmup ")
(1..1_000_000).forEach { block(it).cancel() }
println("done.")
System.gc()
System.gc()
val tookOnAverage = (1..20).map { _ ->
System.gc()
System.gc()
var jobs: List<Job> = emptyList()
measureTimeMillis {
jobs = (1..JOBS_PER_BATCH).map(block)
}.also { _ ->
blackHoleCount += jobs.onEach { it.cancel() }.count()
}
}.average()
println("$name took ${tookOnAverage * 1_000_000 / JOBS_PER_BATCH} nanoseconds")
}
fun measureMemory(name:String, block: (Int) -> Job) {
println(name)
val jobs = (1..JOBS_PER_BATCH).map(block)
(1..500).forEach {
Thread.sleep(1000)
println(it)
}
println(jobs.onEach { it.cancel() }.filter { it.isActive})
}
val justLaunch: (i: Int) -> Job = {
GlobalScope.launch(Dispatchers.Unconfined) {
suspendCoroutine<Unit> {}
}
}
val launchAndWithContext: (i: Int) -> Job = {
GlobalScope.launch(Dispatchers.Unconfined) {
withContext(ThreadPool) {
suspendCoroutine<Unit> {}
}
}
}
val launchAndAsync: (i: Int) -> Job = {
GlobalScope.launch(Dispatchers.Unconfined) {
async(ThreadPool) {
suspendCoroutine<Unit> {}
}.await()
}
}
This is the typical output I get from the above code:
Just launch: 140 nanoseconds
launch and withContext : 520 nanoseconds
launch and async-await: 1100 nanoseconds
Yes, async-await takes about twice as long as withContext, but it's still just a microsecond. You'd have to launch them in a tight loop, doing almost nothing besides, for that to become "a problem" in your app.
Using measureMemory() I found the following memory cost per call:
Just launch: 88 bytes
withContext(): 512 bytes
async-await: 652 bytes
The cost of async-await is exactly 140 bytes higher than withContext, the number we got as the memory weight of one coroutine. This is just a fraction of the complete cost of setting up the CommonPool context.
If performance/memory impact was the only criterion to decide between withContext and async-await, the conclusion would have to be that there's no relevant difference between them in 99% of real use cases.
The real reason is that withContext() a simpler and more direct API, especially in terms of exception handling:
An exception that isn't handled within async { ... } causes its parent job to get cancelled. This happens regardless of how you handle exceptions from the matching await(). If you haven't prepared a coroutineScope for it, it may bring down your entire application.
An exception not handled within withContext { ... } simply gets thrown by the withContext call, you handle it just like any other.
withContext also happens to be optimized, leveraging the fact that you're suspending the parent coroutine and awaiting on the child, but that's just an added bonus.
async-await should be reserved for those cases where you actually want concurrency, so that you launch several coroutines in the background and only then await on them. In short:
async-await-async-await — don't do that, use withContext-withContext
async-async-await-await — that's the way to use it.
Isn't it always better to use withContext rather than asynch-await as it is funcationally similar, but doesn't create another coroutine. Large numebrs coroutines, though lightweight could still be a problem in demanding applications
Is there a case asynch-await is more preferable to withContext
You should use async/await when you want to execute multiple tasks concurrently, for example:
runBlocking {
val deferredResults = arrayListOf<Deferred<String>>()
deferredResults += async {
delay(1, TimeUnit.SECONDS)
"1"
}
deferredResults += async {
delay(1, TimeUnit.SECONDS)
"2"
}
deferredResults += async {
delay(1, TimeUnit.SECONDS)
"3"
}
//wait for all results (at this point tasks are running)
val results = deferredResults.map { it.await() }
//Or val results = deferredResults.awaitAll()
println(results)
}
If you don't need to run multiple tasks concurrently you can use withContext.
When in doubt, remember this like a rule of thumb:
If multiple tasks have to happen in parallel and the final result depends on completion of all of them, then use async.
For returning the result of a single task, use withContext.