Custom function to update data in pending kotlin channel buffer - kotlin

I have an UNLIMITED size buffered channel where senders are much faster than receivers. I would like to update the buffer by removing old data and replacing it with newer one (if the receiver does not yet consume it)
Here is my code
import kotlinx.coroutines.channels.Channel
import kotlinx.coroutines.coroutineScope
import kotlinx.coroutines.delay
import kotlinx.coroutines.launch
data class Item(val id: Int, val value: Int)
val testData = listOf(
Item(1, 10),
Item(2, 24),
Item(3, 12),
Item(1, 17), // This one should replace the Item(1, 10) if it's not yet consumed
Item(4, 16),
Item(2, 32), // This one should replace the Item(2, 24) if it's not yet consumed
)
suspend fun main(): Unit = coroutineScope {
val channel = Channel<Item>(Channel.UNLIMITED)
launch {
for(item in testData) {
delay(50)
println("Producing item $item")
channel.send(item)
}
}
// As you can see the sender already sent all the testData and they are waiting in the buffer to be consumed by the receiver.
// I would like to do some checks whenever new item is added to the buffer
// if(itemInBuffer.id == newItem.id && itemInBuffer.value < newItem.value) then replace it with newItem
launch {
for (item in channel) {
delay(5000)
println(item.toString())
}
}
}
Is there any kotlin built function which takes some custom condition and removes items from the buffer? I saw there is a function called distinctUntilChangedBy in flow which removes the duplicate data based on the custom key selector. Is there anything similar available for Channel or Is it possible to achieve it with ChannelFlow (Note: in my real code events are comes from some network calls so I'm not sure channelFlow could be suitable there)

This isn't as simple as it sounds. We can't access the channel queue to modify its contents and moreover, even if we could, it wouldn't be easy to find an item with the same id. We would have to iterate over the whole queue. distinctUntilChangedBy() is a much different case, because it only compares the last item - it doesn't look through the whole queue.
I think our best bet is to not use queues provided by channels, but instead store data by ourselves in a map and only provide send and receive functionality for it. I implemented this as a flow-like operator and also I made it generic, so it could be used in other similar cases:
context(CoroutineScope)
fun <T : Any, K> ReceiveChannel<T>.groupingReduce(keySelector: (T) -> K, reduce: (T, T) -> T): ReceiveChannel<T> = produce {
val items = mutableMapOf<K, T>()
while (!isClosedForReceive) {
select<Unit> {
if (items.isNotEmpty()) {
val (key, item) = items.entries.first()
onSend(item) {
items.remove(key)
}
}
onReceiveCatching { result ->
val item = result.getOrElse { return#onReceiveCatching }
items.merge(keySelector(item), item, reduce)
}
}
}
items.values.forEach { send(it) }
}
It keeps the data in a map, it tries to send and receive at the same time, whatever finishes the first. If received an item and the key is already in a map, it allows to merge both values in a way provided by the caller. It sends items in the order they appeared the first time in the source channel, so new value for the same key doesn't push back this item to the last position in the queue.
This is how we can use it with the example provided by you. I modified it a little as your version is confusing to me. It consumes (1, 10) before producing (1, 17), so actually the example is incorrect. Also, producer and consumer don't run at the same time, so launching them concurrently and adding delays doesn't change too much:
suspend fun main(): Unit = coroutineScope {
val channel = Channel<Item>(Channel.UNLIMITED)
val channel2 = channel.groupingReduce(
keySelector = { it.id },
reduce = { it1, it2 -> if (it1.value > it2.value) it1 else it2 }
)
for(item in testData) {
println("Producing item $item")
channel.send(item)
}
channel.close()
// Needed because while using `UNLIMITED` sending is almost immediate,
// so it actually starts consuming at the same time it is producing.
delay(100)
for (item in channel2) {
println(item.toString())
}
}
I created another example where producer and consumer actually run concurrently. Items are produced every 100ms and are consumed every 200ms with initial delay of 50ms.
suspend fun main(): Unit = coroutineScope {
val channel = Channel<Item>(Channel.UNLIMITED)
val channel2 = channel.groupingReduce(
keySelector = { it.id },
reduce = { it1, it2 -> if (it1.value > it2.value) it1 else it2 }
)
launch {
delay(50)
for (item in channel2) {
println(item.toString())
delay(200)
}
}
launch {
listOf(
Item(1, 10),
// consume: 1, 10
Item(2, 20),
Item(1, 30),
// consume: 2, 20
Item(3, 40),
Item(1, 50),
// consume: 1, 50
Item(4, 60),
Item(1, 70),
// consume: 3, 40
Item(5, 80),
// consume: 4, 60
// consume: 1, 70
// consume: 5, 80
).forEach {
channel.send(it)
delay(100)
}
channel.close()
}
}
Maybe there is a better way to solve this. Also, to be honest, I'm not 100% sure this code is correct. Maybe I missed some corner case around channel closing, cancellations or failures. Additionally, I'm not sure if select { onSend() } guarantees that if the code block has not been executed, then the item has not been sent. If we cancel send(), we don't have a guarantee the item has not been sent. It may be the same in this case.

Related

Kotlin - How To Collect X Values From a Flow?

Let's say I have a flow that is constantly sending updated like the following:
locationFlow = StateFlow<Location?>(null)
I have a use-case where after a particular event occurs, I want to collect X values from the flow and continue, so something like what I have below. I know that collect is a terminal operator, so I don't think the logic I have below works, but how could I do this in this case? I'd like to collect X items, save them, and then send them to another function for processing/handling.
fun onEventOccurred() {
launch {
val locations = mutableListOf<Location?>()
locationFlow.collect {
//collect only X locations
locations.add(it)
}
saveLocations(locations)
}
}
Is there a pre-existing Kotlin function for something like this? I'd like to collect from the flow X times, save the items to a list, and pass that list to another function.
It doesn't matter that collect is terminal. The upstream StateFlow will keep behaving normally because StateFlows don't care what their collectors are doing. you can use the take function to get a specific number of items, and you can use toList() (another terminal function) to concisely copy them into a list once they're all ready.
fun onEventOccurred() {
launch {
saveLocations(locationFlow.take(5).toList())
}
}
If I understood correctly your use case, you want to:
discard elements until a specific one is sent – actually, after re-reading your question I don't think this is the case.. I'm leaving it in the example just FYI
when that happens, you want to collect X items for further processing
Assuming that's correct, you can use a combination of dropWhile and take, like so:
fun main() = runBlocking {
val messages = flow {
repeat(10) {
println(it)
delay(500)
emit(it)
}
}
messages
.dropWhile { it < 5 }
.take(3)
.collect { println(it) } // prints 5, 6, 7
}
You can even have more complex logic, i.e. discard any number that's less than 5, and then take the first 10 even numbers:
fun main() = runBlocking {
val messages = flow {
repeat(100) {
delay(500)
emit(it)
}
}
messages
.dropWhile { it < 5 }
.filter { it % 2 == 0}
.take(10)
.collect { println(it) } // prints even numbers, 6 to 24
}

Parallelly consuming a long sequence in Kotlin

I have a function generating a very long sequence of work items. Generating these items is fast, but there are too many in total to store a list of them in memory. Processing the items produces no results, just side effects.
I would like to process these items across multiple threads. One solution is to have a thread read from the generator and write to a concurrent bounded queue, and a number of executor threads polling for work from the bounded queue, but this is a lot of things to set up.
Is there anything in the standard library that would help me do that?
I had initially tried
items.map { async(executor) process(it) }.forEach { it.await() }
But, as pointed out in how to implement parallel mapping for sequences in kotlin, this doesn't work for reasons that are obvious in retrospect.
Is there a quick way to do this (possibly with an external library), or is manually setting up a bounded queue in the middle my best option?
You can look at coroutines combined with channels.
If all work items can be emmited on demand with producer channel. Then it's possible to await for each items and process it with a pool of threads.
An example :
sealed class Stream {
object End: Stream()
class Item(val data: Long): Stream()
}
val produceCtx = newSingleThreadContext("producer")
// A dummy producer that send one million Longs on its own thread
val producer = CoroutineScope(produceCtx).produce {
for (i in (0 until 1000000L)) send(Stream.Item(i))
send(Stream.End)
}
val workCtx = newFixedThreadPoolContext(4, "work")
val workers = Channel<Unit>(4)
repeat(4) { workers.offer(Unit) }
for(_nothing in workers) { // launch 4 times then wait for a task to finish
launch(workCtx) {
when (val item = producer.receive()) {
Stream.End -> workers.close()
is Stream.Item -> {
workFunction(item.data) // Actual work here
workers.offer(Unit) // Notify to launch a new task
}
}
}
}
Your magic word would be .asSequence():
items
.asSequence() // Creates lazy executable sequence
.forEach { launch { executor.process(it) } } // If you don't need the value aftrwards, use 'launch', a.k.a. "fire and forget"
but there are too many in total to store a list of them in memory
Then don't map to list and don't collect the values, no matter if you work with Kotlin or Java.
As long as you are on the JVM, you can write yourself an extension function, that works the sequence in chunks and spawns futures for all entries in a chunk. Something like this:
#Suppress("UNCHECKED_CAST")
fun <T, R> Sequence<T>.mapParallel(action: (value: T) -> R?): Sequence<R?> {
val numThreads = Runtime.getRuntime().availableProcessors() - 1
return this
.chunked(numThreads)
.map { chunk ->
val threadPool = Executors.newFixedThreadPool(numThreads)
try {
return#map chunk
.map {
// CAUTION -> needs to be written like this
// otherwise the submit(Runnable) overload is called
// which always returns an empty Future!!!
val callable: () -> R? = { action(it) }
threadPool.submit(callable)
}
} finally {
threadPool.shutdown()
}
}
.flatten()
.map { future -> future.get() }
}
You can then just use it like:
items
.mapParallel { /* process an item */ }
.forEach { /* handle the result */ }
As long as workload per item is similar, this gives a good parallel processing.

GroupBy operator for Kotlin Flow

I am trying to switch from RxJava to Kotlin Flow. Flow is really impressive. But Is there any operator similar to RxJava's "GroupBy" in kotlin Flow right now?
As of Kotlin Coroutines 1.3, the standard library doesn't seem to provide this operator. However, since the design of Flow is such that all operators are extension functions, there is no fundamental distinction between the standard library providing it and you writing your own.
With that in mind, here are some of my ideas on how to approach it.
1. Collect Each Group to a List
If you just need a list of all items for each key, use this simple implementation that emits pairs of (K, List<T>):
fun <T, K> Flow<T>.groupToList(getKey: (T) -> K): Flow<Pair<K, List<T>>> = flow {
val storage = mutableMapOf<K, MutableList<T>>()
collect { t -> storage.getOrPut(getKey(t)) { mutableListOf() } += t }
storage.forEach { (k, ts) -> emit(k to ts) }
}
For this example:
suspend fun main() {
val input = 1..10
input.asFlow()
.groupToList { it % 2 }
.collect { println(it) }
}
it prints
(1, [1, 3, 5, 7, 9])
(0, [2, 4, 6, 8, 10])
2.a Emit a Flow for Each Group
If you need the full RxJava semantics where you transform the input flow into many output flows (one per distinct key), things get more involved.
Whenever you see a new key in the input, you must emit a new inner flow to the downstream and then, asynchronously, keep pushing more data into it whenever you encounter the same key again.
Here's an implementation that does this:
fun <T, K> Flow<T>.groupBy(getKey: (T) -> K): Flow<Pair<K, Flow<T>>> = flow {
val storage = mutableMapOf<K, SendChannel<T>>()
try {
collect { t ->
val key = getKey(t)
storage.getOrPut(key) {
Channel<T>(32).also { emit(key to it.consumeAsFlow()) }
}.send(t)
}
} finally {
storage.values.forEach { chan -> chan.close() }
}
}
It sets up a Channel for each key and exposes the channel to the downstream as a flow.
2.b Concurrently Collect and Reduce Grouped Flows
Since groupBy keeps emitting the data to the inner flows after emitting the flows themselves to the downstream, you have to be very careful with how you collect them.
You must collect all the inner flows concurrently, with no upper limit on the level of concurrency. Otherwise the channels of the flows that are queued for later collection will eventually block the sender and you'll end up with a deadlock.
Here is a function that does this properly:
fun <T, K, R> Flow<Pair<K, Flow<T>>>.reducePerKey(
reduce: suspend Flow<T>.() -> R
): Flow<Pair<K, R>> = flow {
coroutineScope {
this#reducePerKey
.map { (key, flow) -> key to async { flow.reduce() } }
.toList()
.forEach { (key, deferred) -> emit(key to deferred.await()) }
}
}
The map stage launches a coroutine for each inner flow it receives. The coroutine reduces it to the final result.
toList() is a terminal operation that collects the entire upstream flow, launching all the async coroutines in the process. The coroutines start consuming the inner flows even while we're still collecting the main flow. This is essential to prevent a deadlock.
Finally, after all the coroutines have been launched, we start a forEach loop that waits for and emits the final results as they become available.
You can implement almost the same behavior in terms of flatMapMerge:
fun <T, K, R> Flow<Pair<K, Flow<T>>>.reducePerKey(
reduce: suspend Flow<T>.() -> R
): Flow<Pair<K, R>> = flatMapMerge(Int.MAX_VALUE) { (key, flow) ->
flow { emit(key to flow.reduce()) }
}
The difference is in the ordering: whereas the first implementation respects the order of appearance of keys in the input, this one doesn't. Both perform similarly.
3. Example
This example groups and sums 40 million integers:
suspend fun main() {
val input = 1..40_000_000
input.asFlow()
.groupBy { it % 100 }
.reducePerKey { sum { it.toLong() } }
.collect { println(it) }
}
suspend fun <T> Flow<T>.sum(toLong: suspend (T) -> Long): Long {
var sum = 0L
collect { sum += toLong(it) }
return sum
}
I can successfully run this with -Xmx64m. On my 4-core laptop I'm getting about 4 million items per second.
It is simple to redefine the first solution in terms of the new one like this:
fun <T, K> Flow<T>.groupToList(getKey: (T) -> K): Flow<Pair<K, List<T>>> =
groupBy(getKey).reducePerKey { toList() }
Not yet but you can have a look at this library https://github.com/akarnokd/kotlin-flow-extensions .
In my project, I was able to achieve this non-blocking by using Flux.groupBy.
https://projectreactor.io/docs/core/release/api/reactor/core/publisher/Flux.html#groupBy-java.util.function.Function-
I did this in the process of converting the results obtained with Flux to Flow.
This may be an inappropriate answer for the situation in question, but I share it as an example.

Kotlin coroutines progress counter

I'm making thousands of HTTP requests using async/await and would like to have a progress indicator. I've added one in a naive way, but noticed that the counter value never reaches the total when all requests are done. So I've created a simple test and, sure enough, it doesn't work as expected:
fun main(args: Array<String>) {
var i = 0
val range = (1..100000)
range.map {
launch {
++i
}
}
println("$i ${range.count()}")
}
The output is something like this, where the first number always changes:
98800 100000
I'm probably missing some important detail about concurrency/synchronization in JVM/Kotlin, but don't know where to start. Any tips?
UPDATE: I ended up using channels as Marko suggested:
/**
* Asynchronously fetches stats for all symbols and sends a total number of requests
* to the `counter` channel each time a request completes. For example:
*
* val counterActor = actor<Int>(UI) {
* var counter = 0
* for (total in channel) {
* progressLabel.text = "${++counter} / $total"
* }
* }
*/
suspend fun getAssetStatsWithProgress(counter: SendChannel<Int>): Map<String, AssetStats> {
val symbolMap = getSymbols()?.let { it.map { it.symbol to it }.toMap() } ?: emptyMap()
val total = symbolMap.size
return symbolMap.map { async { getAssetStats(it.key) } }
.mapNotNull { it.await().also { counter.send(total) } }
.map { it.symbol to it }
.toMap()
}
The explanation what exactly makes your wrong approach fail is secondary: the primary thing is fixing the approach.
Instead of async-await or launch, for this communication pattern you should instead have an actor to which all the HTTP jobs send their status. This will automatically handle all your concurrency issues.
Here's some sample code, taken from the link you provided in the comment and adapted to your use case. Instead of some third party asking it for the counter value and updating the GUI with it, the actor runs in the UI context and updates the GUI itself:
import kotlinx.coroutines.experimental.*
import kotlinx.coroutines.experimental.channels.*
import kotlin.system.*
import kotlin.coroutines.experimental.*
object IncCounter
fun counterActor() = actor<IncCounter>(UI) {
var counter = 0
for (msg in channel) {
updateView(++counter)
}
}
fun main(args: Array<String>) = runBlocking {
val counter = counterActor()
massiveRun(CommonPool) {
counter.send(IncCounter)
}
counter.close()
println("View state: $viewState")
}
// Everything below is mock code that supports the example
// code above:
val UI = newSingleThreadContext("UI")
fun updateView(newVal: Int) {
viewState = newVal
}
var viewState = 0
suspend fun massiveRun(context: CoroutineContext, action: suspend () -> Unit) {
val numCoroutines = 1000
val repeatActionCount = 1000
val time = measureTimeMillis {
val jobs = List(numCoroutines) {
launch(context) {
repeat(repeatActionCount) { action() }
}
}
jobs.forEach { it.join() }
}
println("Completed ${numCoroutines * repeatActionCount} actions in $time ms")
}
Running it prints
Completed 1000000 actions in 2189 ms
View state: 1000000
You're losing writes because i++ is not an atomic operation - the value has to be read, incremented, and then written back - and you have multiple threads reading and writing i at the same time. (If you don't provide launch with a context, it uses a threadpool by default.)
You're losing 1 from your count every time two threads read the same value as they will then both write that value plus one.
Synchronizing in some way, for example by using an AtomicInteger solves this:
fun main(args: Array<String>) {
val i = AtomicInteger(0)
val range = (1..100000)
range.map {
launch {
i.incrementAndGet()
}
}
println("$i ${range.count()}") // 100000 100000
}
There's also no guarantee that these background threads will be done with their work by the time you print the result and your program ends - you can test it easily by adding just a very small delay inside launch, a couple milliseconds. With that, it's a good idea to wrap this all in a runBlocking call which will keep the main thread alive and then wait for the coroutines to all finish:
fun main(args: Array<String>) = runBlocking {
val i = AtomicInteger(0)
val range = (1..100000)
val jobs: List<Job> = range.map {
launch {
i.incrementAndGet()
}
}
jobs.forEach { it.join() }
println("$i ${range.count()}") // 100000 100000
}
Have you read Coroutines basics? There's exact same problem as yours:
val c = AtomicInteger()
for (i in 1..1_000_000)
launch {
c.addAndGet(i)
}
println(c.get())
This example completes in less than a second for me, but it prints some arbitrary number, because some coroutines don't finish before main() prints the result.
Because launch is not blocking, there's no guarantee all of coroutines will finish before println. You need to use async, store the Deferred objects and await for them to finish.

How to merge 2 separate streams, buffer populated data from them and subsrcibe to it after some amout of time

I am trying to test situation like this:
I have 2 classess which just extends from the same Parent.
I am creating and Observables from the list of items for each of the class:
val listSomeClass1 = ArrayList<SomeClass1>()
val listSomeClass2 = ArrayList<SomeClass2>()
fun populateJust1() {
listSomeClass1.add(SomeClass1("23", 23))
listSomeClass1.add(SomeClass1("24", 24))
listSomeClass1.add(SomeClass1("25", 25))
}
fun populateJust2() {
listSomeClass2.add(SomeClass2(23.00))
listSomeClass2.add(SomeClass2(24.00))
listSomeClass2.add(SomeClass2(25.00))
}
populateItemsSomeClass1()
populateItemsSomeClass2()
Now i can create 2 observables:
val someClass1Observable = Observable.fromIterable(listSomeClass1)
val someClass2Observable = Observable.fromIterable(listSomeClass2)
And here, i want to merge emission from them, buffer it, and subscribe to it after 10 seconds:
Observable.merge(someClass1Observable, someClass2Observable)
.buffer(10, TimeUnit.SECONDS)
.doOnSubscribe { Log.v("parentObservable", "STARTED") }
.subscribe { t: MutableList<Parent> ->
Log.v("parentObservable", "onNext")
t.forEach { Log.v("onNext", it.toString()) }
}
However, the observable is not starting after 10 seconds as i expected, and just starts immedietaly with this data ready.
How to simulate something like this, that i will gather 2 separate streams and after 10 seconds i will be able to get the gathered data
I must point that i don't want to use any Subject.
UPDATE
I've done somehitng like this:
val list1 = listOf(SomeClass1("1", 1), SomeClass1("2", 2), SomeClass1("3", 3))
val list2 = listOf(SomeClass2(5.00), SomeClass2(4.00), SomeClass2(6.00))
val someClass1Observable = Observable
.fromIterable(list1)
.zipWith(Observable.interval(2, TimeUnit.SECONDS),
BiFunction { item: SomeClass1, _: Long -> item })
val someClass2Observable = Observable
.fromIterable(list2)
.zipWith(Observable.interval(1, TimeUnit.SECONDS),
BiFunction { item: SomeClass2, _: Long -> item })
someClass1Observable.subscribe {
Log.v("someClass1", it.toString())
}
someClass2Observable.subscribe {
Log.v("someClass2", it.toString())
}
Observable.merge(someClass1Observable, someClass2Observable)
.buffer(10, TimeUnit.SECONDS)
.delay(10, TimeUnit.SECONDS)
.doOnSubscribe { Log.v("parentObservable", "STARTED") }
.subscribe { t: MutableList<Parent> ->
Log.v("parentObservable", "onNext")
t.forEach { Log.v("onNext", it.toString()) }
}
Thread.sleep(13000)
someClass1Observable.subscribe {
Log.v("someClass1", it.toString())
}
someClass2Observable.subscribe {
Log.v("someClass2", it.toString())
}
Here, i want to just simulate 2 infinite streams of someClass1 and someclass2 Observables and same for the merge Observable.
Again, i want to have ability to merge those 2 streams, buffer populated data and do something with it after 10 seconds. If after 10 seconds those 2 streams will again populate some data, the merge Observable should clean previous buffer, and should again buffer new data and emit after 10 seconds and so on, infinite. However, my code is not working as i expected, what changes I need to do to make it as i described?
I think you're looking for the delay operator
http://reactivex.io/documentation/operators/delay.html
Delay
shift the emissions from an Observable forward in time by a particular amount
So something like:
.delay(10, TimeUnit.SECONDS)