KafkaConsumer: `seekToEnd()` does not make consumer consume from latest offset - kotlin

I have the following code
class Consumer(val consumer: KafkaConsumer<String, ConsumerRecord<String>>) {
fun run() {
consumer.seekToEnd(emptyList())
val pollDuration = 30 // seconds
while (true) {
val records = consumer.poll(Duration.ofSeconds(pollDuration))
// perform record analysis and commitSync()
}
}
}
}
The topic which the consumer is subscribed to continously receives records. Occasionally, the consumer will crash due to the processing step. When the consumer then is restarted, I want it to consume from the latest offset on the topic (i.e. ignore records that were published to the topic while the consumer was down). I thought the seekToEnd() method would ensure that. However, it seems like the method has no effect at all. The consumer starts to consume from the offset from which it crashed.
What is the correct way to use seekToEnd()?
Edit: The consumer is created with the following configs
fun <T> buildConsumer(valueDeserializer: String): KafkaConsumer<String, T> {
val props = setupConfig(valueDeserializer)
Common.setupConsumerSecurityProtocol(props)
return createConsumer(props)
}
fun setupConfig(valueDeserializer: String): Properties {
// Configuration setup
val props = Properties()
props[ConsumerConfig.GROUP_ID_CONFIG] = config.applicationId
props[ConsumerConfig.CLIENT_ID_CONFIG] = config.kafka.clientId
props[ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG] = config.kafka.bootstrapServers
props[AbstractKafkaSchemaSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG] = config.kafka.schemaRegistryUrl
props[ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG] = config.kafka.stringDeserializer
props[ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG] = valueDeserializer
props[KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG] = "true"
props[ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG] = config.kafka.maxPollIntervalMs
props[ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG] = config.kafka.sessionTimeoutMs
props[ConsumerConfig.ALLOW_AUTO_CREATE_TOPICS_CONFIG] = "false"
props[ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG] = "false"
props[ConsumerConfig.AUTO_OFFSET_RESET_CONFIG] = "latest"
return props
}
fun <T> createConsumer(props: Properties): KafkaConsumer<String, T> {
val consumer = KafkaConsumer<String, T>(props)
consumer.subscribe(listOf(config.kafka.inputTopic))
return consumer
}

I found a solution!
I needed to add a dummy poll as a part of the consumer initialization process. Since several Kafka methods are evaluated lazily, it is necessary with a dummy poll to assign partitions to the consumer. Without the dummy poll, the consumer tries to seek to the end of partitions that are null. As a result, seekToEnd() has no effect.
It is important that the dummy poll duration is long enough for the partitions to get assigned. For instance with consumer.poll((Duration.ofSeconds(1)), the partitions did not get time to be assigned before the program moved on to the next method call (i.e. seekToEnd()).
Working code could look something like this
class Consumer(val consumer: KafkaConsumer<String, ConsumerRecord<String>>) {
fun run() {
// Initialization
val pollDuration = 30 // seconds
consumer.poll((Duration.ofSeconds(pollDuration)) // Dummy poll to get assigned partitions
// Seek to end and commit new offset
consumer.seekToEnd(emptyList())
consumer.commitSync()
while (true) {
val records = consumer.poll(Duration.ofSeconds(pollDuration))
// perform record analysis and commitSync()
}
}
}
}

The seekToEnd method requires the information on the actual partition (in Kafka terms TopicPartition) on which you plan to make your consumer read from the end.
I am not familiar with the Kotlin API, but checking the JavaDocs on the KafkaConsumer's method seekToEnd you will see, that it asks for a collection of TopicPartitions.
As you are currently using emptyList(), it will have no impact at all, just like you observed.

Related

Thread-safe access to the same variable from different flows (Kotlin)

Is this code thread safe? Do I need a synchronized block or something like that? source1 and source2 endless Kotlin Flow
viewModelScope.launch {
var listAll = mutableListOf<String>()
var list1 = mutableListOf<String>()
var list2 = mutableListOf<String>()
launch {
source1.getNames().collect { list ->
list1 = list
listAll = mutableListOf()
listAll.addAll(list1)
listAll.addAll(list2)
//then consume listAll as StateFlow or return another flow with emit(listAll)
}
}
launch {
source2.getNames().collect { list ->
list2 = list
listAll = mutableListOf()
listAll.addAll(list2)
listAll.addAll(list1)
//then consume listAll as StateFlow or return another flow with emit(listAll)
}
}
}
This code is not thread safe.
However, it is called from viewModelScope.launch which runs on Dispatchers.Main by default. So your inner launch blocks will be called sequentially. This means that after all you will get the result which is produced by second launch block.
To achieve asynchronous behavior, you want to use viewModelScope.launch(Dispatchers.Default).
Your code will probably fire concurrent modification exception in that case.
To synchronize it, you may want to use Java's Collections.synchronizedList which blocks the list while one thread is performing operations with it, so the other thread are not able to perform modifications.
Or perform synchronizing manually using Mutex.
val mutex = Mutex()
viewModelScope.launch(Dispatchers.Default) {
launch {
mutex.withLock {
... // Your code
}
}
launch {
mutex.withLock {
... // Your code
}
}
}
Read official Kotlin guide to shared mutable state
After all, I am struggling to imagine real life example in which you will actually use that code. You probably don't need asynchronous behavior, you will be fine without using two launch blocks. Or you should rethink your design to avoid need of manual synchronization of two coroutines.

Kotlin Coroutines - Asynchronously consume a sequence

I'm looking for a way to keep a Kotlin sequence that can produces values very quickly, from outpacing slower async consumers of its values. In the following code, if the async handleValue(it) cannot keep up with the rate that the sequence is producing values, the rate imbalance leads to buffering of produced values, and eventual out-of-memory errors.
getSequence().map { async {
handleValue(it)
}}
I believe this is a classic producer/consumer "back-pressure" situation, and I'm trying to understand how to use Kotlin coroutines to deal with it.
Thanks for any suggestions :)
Kotlin channels and flows offer buffering producer dispatched data until the consumer/collector is ready to consume it.
But Channels have some concerns that have been manipulated in Flows; for instance, they are considered hot streams:
The producer starts for dispatching data whether or not there is an attached consumer; and this introduces resource leaks.
As long as no consumer attached to the producer, the producer will stuck in suspending state
However Flows are cold streams; nothing will be produced until there is something to consume.
To handle your query with Flows:
GlobalScope.launch {
flow {
// Producer
for (item in getSequence()) emit(item)
}.map { handleValue(it) }
.buffer(10) // Optionally specify the buffer size
.collect { // Collector
}
}
For my own reference, and to anyone else this may help, here's how I eventually solved this using Channels - https://kotlinlang.org/docs/channels.html#channel-basics
A producer coroutine:
fun itemChannel() : ReceiveChannel<MyItem> {
return produce {
while (moreItems()) {
send(nextItem()) // <-- suspend until next 'receive()'
}
}
}
And a function to run multiple consumer coroutines, each reading off that channel:
fun itemConsumers() {
runBlocking {
val channel = itemChannel()
repeat(numberOfConsumers) {
launch {
var more = true
while (more) {
try {
val item = channel.receive()
// do stuff with item here...
} catch (ex: ClosedReceiveChannelException) {
more = false
}
}
}
}
}
}
The idea here is that the consumer receives off the channel within the coroutine, so the next receive() is not called until a consumer coroutine finishes handling the last item. This results in the desired back-pressure, as opposed to receiving from a sequence or flow in the main thread, and then passing the item into a coroutine to be consumed. In that scenario there is no back-pressure from the receiver, since the receive happens in a different coroutine than where the received item is consumed.

Can I build a Kotlin SharedFlow where the consumer dictates replay length?

Question
When instantiatiang a Kotlin MutableSharedFlow<T> class it allows you to specify replay length of n >= 0. All consumers will get n number of events replayed. Is it a good way to extend or wrap MutableSharedFlow so that the consumer dictactes how many (if any) events he/she wants replayed?
Example desired consumer code
flow.collectWithReplay(count = 1) { event -> ... }
Count would of course have to be equal or less than the upper boundary decided by the flow instance.
Rationale
Some times you want to act differently upon events that are old and new. An example is when the event contains one-time information that is irellevant after consumed once (e.g. data for an error dialog). You may still want to know that the last state was an error, but since it is old you don't show a dialog again. You'd then call flow.replayCache.lastOrNull() to get the old and then subscribe to new using .collectWitReplay(0).
Other times you don't want that distinction and then it would be a hassle to do the two calls separately. .collectWithReplay(1) then yields less and prettier code.
Solution attempted
I have made a solution using my own 1-element replay cache, which solves a special case for n=1. It would be trivial to extend to any n - that's not the point, but I dislike a couple of things about it:
a) It doesn't utilize the built in replay mechanism of SharedFlow
b) It's not thread-safe. collectWithReplay might lose an event emitted in between its line 1 and 2
c) Not sure if I lose any performance by losing inline on the collect method signature
open class FlowEventBus<T>() {
private val _flow = MutableSharedFlow<T>(replay = 0)
var latest: T? = null
private set
suspend fun emit(event: T) {
latest = event
_flow.emit(event) // suspends until all subscribers receive the event
}
/** Consumers who only wants events occuring from now on subscribe here */
suspend fun collect(action: suspend (value: T) -> Unit) = _flow.collect(action)
/** Consumers who wants the last event emitted as well as future events subscribe here */
suspend fun collectWithReplay(action: suspend (value: T) -> Unit) {
latest?.let { action(it) } // Replay any cached event
_flow.collect(action) // Listen for new events
}
}
Answer to main question
Here the foundation for a solution based on the suggestion from #tenfour04
val mainFlow = MutableSharedFlow<String>(10)
If consumers want a different replay value, the do this:
val flowForTwo = mainFlow.shareIn(threadPoolScope, SharingStarted.Eagerly, 2)
flowForTwo.collect { }
You'll be creating a new SharedFlow each time you do this though, so performance may suffer.
See working test
Variation: Event bus with zero or one replay
Here is a solution where the flow is wrapped in an event bus and the consumer may decide between replay length of 0 or 1. This solution comes with some race condition quirks when you emit and collect very close in time. Run and understand this failing unit before using in production. I don't know how to fix it, or if it's worth fixing. You might be better off just using a variation of my original idea.
/**
* FlowEventBus where consumer can decide between single replay or no replay when collecting.
* Warning: It has some concurrency issues that is apparent when you run the tests
*/
class FlowEventBus<T> {
private val threadPoolScope = CoroutineScope(Dispatchers.Default + SupervisorJob())
private val eventsWithSingleReplay = MutableSharedFlow<T>(replay = 1) // private mutable shared flow
private val eventsWithoutReplay = eventsWithSingleReplay.shareIn(threadPoolScope, SharingStarted.Eagerly, replay = 0)
val latest: T?
get() = eventsWithSingleReplay.replayCache.lastOrNull()
/** Emit a new event */
suspend fun emit(event: T) = eventsWithSingleReplay.emit(event)
/** Consumers who only wants events occuring from now on subscribe here */
suspend fun collect(action: suspend (value: T) -> Unit) = eventsWithoutReplay.collect(action)
/** Consumers who wants the last event emitted as well as future events subscribe here */
suspend fun collectWithReplay(action: suspend (value: T) -> Unit) {
eventsWithSingleReplay.collect(action)
}
}

Flow - pause/resume flow

In RxJava there is the valve operator that allows to pause (and buffer) a flow and resumes the flow again (and also emit the buffered values as soon as it's resumed). It's part of the rx java extensions (https://github.com/akarnokd/RxJavaExtensions/blob/3.x/src/main/java/hu/akarnokd/rxjava3/operators/FlowableValve.java).
Is there something like this for kotlin flows?
My use case is that I want to observe a flow inside an activity and never lose an event (like I would do it with LiveData e.g. which stops observing data if the activity is paused). So while the activity is paused I want the flow to buffer observed values until the activity is resumed and emit them all as soon as the activity is resumed.
So while the activity is created (until it is destroyed) I want to observe the flow BUT I only want to emit values while the activity is active and buffer the values while it is not active (but still created) until it gets active again.
Is there something to solve this or has anyone ever written something to solve this?
A combination of Lifecycle.launchWhenX and a SharedFlow should do the trick. Here's a simple example using a flow that emits a number every second.
// In your ViewModel
class MainViewModel : ViewModel() {
val numbers = flow {
var counter = 0
while (true) {
emit(counter++)
delay(1_000L)
}
}
.shareIn(
scope = viewModelScope,
started = SharingStarted.Lazily
)
}
// In your Fragment.onViewCreated()
viewLifecycleOwner.lifecycleScope.launchWhenStarted {
viewModel.numbers
.collect { number ->
Log.d("asdf", "number: $number")
}
}
This works because Lifecycle.launchWhenStarted pauses the coroutine when the Lifecycle enters a stopped state, rather than cancels it. When your Lifecycle comes back to a started state after pausing, it'll collect everything that happened while in the stopped state.
I know it is ugly solution but it works fine for me:
fun main() {
val flow = MutableSharedFlow<String>(extraBufferCapacity = 50, onBufferOverflow = BufferOverflow.DROP_OLDEST)
val isOpened = AtomicBoolean()
val startTime = System.currentTimeMillis()
GlobalScope.launch(Executors.newSingleThreadExecutor().asCoroutineDispatcher()) {
flow
.transform { value ->
while (isOpened.get().not()) { }
emit(value)
}
.collect {
println("${System.currentTimeMillis() - startTime}: $it")
}
}
Thread.sleep(1000)
flow.tryEmit("First")
Thread.sleep(1000)
isOpened.set(true)
flow.tryEmit("Second")
isOpened.set(false)
Thread.sleep(1000)
isOpened.set(true)
flow.tryEmit("Third")
Thread.sleep(2000)
}
Result:
So you can set isOpened to false when your activity lifecycle paused and to true when resumed.
You can use lifecycleScope.launchWhenStarted
https://developer.android.com/kotlin/flow/stateflow-and-sharedflow#stateflow

How to inform a Flux that I have an item ready to publish?

I am trying to make a class that would take incoming user events, process them and then pass the result to whoever subscribed to it:
class EventProcessor
{
val flux: Flux<Result>
fun onUserEvent1(e : Event)
{
val result = process(e)
// Notify flux that I have a new result
}
fun onUserEvent2(e : Event)
{
val result = process(e)
// Notify flux that I have a new result
}
fun process(e : Event): Result
{
...
}
}
Then the client code can subscribe to EventProcessor::flux and get notified each time a user event has been successfully processed.
However, I do not know how to do this. I tried to construct the flux with the Flux::generate function like this:
class EventProcessor
{
private var sink: SynchronousSink<Result>? = null
val flux: Flux<Result> = Flux.generate{ sink = it }
fun onUserEvent1(e : Event)
{
val result = process(e)
sink?.next(result)
}
fun onUserEvent2(e : Event)
{
val result = process(e)
sink?.next(result)
}
....
}
But this does not work, since I am supposed to immediately call next on the SynchronousSink<Result> passed to me in Flux::generate. I cannot store the sink as in the example:
reactor.core.Exceptions$ErrorCallbackNotImplemented:
java.lang.IllegalStateException: The generator didn't call any of the
SynchronousSink method
I was also thinking about the Flux::merge and Flux::concat methods, but these are static and they create a new Flux. I just want to push things into the existing flux, such that whoever holds it, gets notified.
Based on my limited understanding of the reactive types, this is supposed to be a common use case. Yet I find it very difficult to actually implement it. This brings me to a suspicion that I am missing something crucial or that I am using the library in an odd way, in which it was not intended to be used. If this is the case, any advice is warmly welcome.