Does Flux.collectList() introduce blocking behavior - kotlin

Assume we have a Finder that returns multiple objects
fun findAll(): Flux<CarEntity> {
return carRepository.findAll()
}
Because we want to apply some logic to all cars and passengers at the same time, we convert it to a Mono through .collectList()
val carsMono = carsFinder.findAll().collectList()
val passengerMono = carsFinder.findAll().collectList()
return Mono.zip(carsMono, passengerMono)
In other words,
We have a list of entities of undefined length
We gather every item in a list until there is no more - how is this done without blocking the threat?

No collectList() is not a blocking operator but we need to be careful when we use this operator.
with this operator we wait till the upper stream has emitted all elements to us, and if this stream is never ending stream like kafkatemplate or a processor, the collect list will collect elements till it run in out of memory exception.
Or when we have a big stream like findAll from database this will collect a big list consuming a lot of ram memory or even giving out of memory exception.
If you know that you deal with small number of elements you are safe to go.
If you can avoid and process elements in stream than would be better

Related

How to print size of flow in kotlin

Hey I am new in kotlin flow. I am trying to print flow size. As we know that list has size() function. Do we have something similar function for flow.
val list = mutableListof(1,2,3)
println(list.size)
output
2
How do we get size value in flow?
dataMutableStateFlow.collectLatest { data ->
???
}
Thanks
A Flow doesn't know its size at any moment, because there is an unknown number of future values to be emitted. Also, Flows do not keep a record of how many values they have emitted in the past.
Sequences have the same problem. With both Flows and Sequences, you can only get the count by doing something terminal with them, something that iterates through it all.
The only way to get the size of the Flow is to do something that iterates through the entire Flow. For instance, you can call the suspend function count() on a Flow to get its size. The more complicated way to do it would be to create a count variable and then increment the count inside a collect call. However, counting the emissions of a Flow is only usable for finite cold Flows. Hot flows (SharedFlow and StateFlow) are never finite, and many cold Flows are also infinite.

Reactor - Stop source when first empty

I have a requirement like this.
Flux<Integer> s1 = .....;
s1.flatMap(value -> anotherSource.find(value));
I need a way to stop this s1 when anotherSource.find gives me first empty. how to do that?
Note:
One possible solution is to throw error then capture it to stop.
anotherSource.find(value).switchIfempty(Mono.error(..))
I am looking for better solution than this.
You won't find a specific operator for this, you'll have to combine operators to achieve it. (Note that doesn't make it a "hack" per-se, reactive frameworks are generally intended to be used in a way where you combine basic operators together to achieve your use-case.)
I would agree that using an error to achieve is far from ideal though as it potentially disrupts the flow of real errors in the reactive chain - so that should really be a last resort.
The approach I've generally taken in cases where I want the stream to stop based on an inner publisher is to materialise the inner stream, filter out the onComplete() signals and then re-add the onComplete() wherever appropriate (in this case, if it's empty.) You can then dematerialise the outer stream and it'll respond to the completed signal wherever you've injected it, stopping the stream:
s1.flatMap(
value ->
anotherSource
.find(value)
.materialize()
.filter(s -> !s.isOnComplete())
.defaultIfEmpty(Signal.complete()))
.dematerialize()
This has the advantage of preserving any error signals, while also not requiring another object or special value.

Kotlin: Why is Sequence more performant in this example?

Currently, I am looking into Kotlin and have a question about Sequences vs. Collections.
I read a blog post about this topic and there you can find this code snippets:
List implementation:
val list = generateSequence(1) { it + 1 }
.take(50_000_000)
.toList()
measure {
list
.filter { it % 3 == 0 }
.average()
}
// 8644 ms
Sequence implementation:
val sequence = generateSequence(1) { it + 1 }
.take(50_000_000)
measure {
sequence
.filter { it % 3 == 0 }
.average()
}
// 822 ms
The point here is that the Sequence implementation is about 10x faster.
However, I do not really understand WHY that is. I know that with a Sequence, you do "lazy evaluation", but I cannot find any reason why that helps reducing the processing in this example.
However, here I know why a Sequence is generally faster:
val result = sequenceOf("a", "b", "c")
.map {
println("map: $it")
it.toUpperCase()
}
.any {
println("any: $it")
it.startsWith("B")
}
Because with a Sequence you process the data "vertically", when the first element starts with "B", you don't have to map for the rest of the elements. It makes sense here.
So, why is it also faster in the first example?
Let's look at what those two implementations are actually doing:
The List implementation first creates a List in memory with 50 million elements.  This will take a bare minimum of 200MB, since an integer takes 4 bytes.
(In fact, it's probably far more than that.  As Alexey Romanov pointed out, since it's a generic List implementation and not an IntList, it won't be storing the integers directly, but will be ‘boxing’ them — storing references to Int objects.  On the JVM, each reference could be 8 or 16 bytes, and each Int could take 16, giving 1–2GB.  Also, depending how the List gets created, it might start with a small array and keep creating larger and larger ones as the list grows, copying all the values across each time, using more memory still.)
Then it has to read all the values back from the list, filter them, and create another list in memory.
Finally, it has to read all those values back in again, to calculate the average.
The Sequence implementation, on the other hand, doesn't have to store anything!  It simply generates the values in order, and as it does each one it checks whether it's divisible by 3 and if so includes it in the average.
(That's pretty much how you'd do it if you were implementing it ‘by hand’.)
You can see that in addition to the divisibility checking and average calculation, the List implementation is doing a massive amount of memory access, which will take a lot of time.  That's the main reason it's far slower than the Sequence version, which doesn't!
Seeing this, you might ask why we don't use Sequences everywhere…  But this is a fairly extreme example.  Setting up and then iterating the Sequence has some overhead of its own, and for smallish lists that can outweigh the memory overhead.  So Sequences only have a clear advantage in cases when the lists are very large, are processed strictly in order, there are several intermediate steps, and/or many items are filtered out along the way (especially if the Sequence is infinite!).
In my experience, those conditions don't occur very often.  But this question shows how important it is to recognise them when they do!
Leveraging lazy-evaluation allows avoiding the creation of intermediate objects that are irrelevant from the point of the end goal.
Also, the benchmarking method used in the mentioned article is not super accurate. Try to repeat the experiment with JMH.
Initial code produces a list containing 50_000_000 objects:
val list = generateSequence(1) { it + 1 }
.take(50_000_000)
.toList()
then iterates through it and creates another list containing a subset of its elements:
.filter { it % 3 == 0 }
... and then proceeds with calculating the average:
.average()
Using sequences allows you to avoid doing all those intermediate steps. The below code doesn't produce 50_000_000 elements, it's just a representation of that 1...50_000_000 sequence:
val sequence = generateSequence(1) { it + 1 }
.take(50_000_000)
adding a filtering to it doesn't trigger the calculation itself as well but derives a new sequence from the existing one (3, 6, 9...):
.filter { it % 3 == 0 }
and eventually, a terminal operation is called that triggers the evaluation of the sequence and the actual calculation:
.average()
Some relevant reading:
Kotlin: Beware of Java Stream API Habits
Kotlin Collections API Performance Antipatterns

difference between toList().take(10) and take(10).toList() in kotlin

I am just trying the new kotlin language. I came across sequences which generate infinite list. I generated a sequence and tried printing the first 10 elements. But the below code didnt print anything:
fun main(args: Array<String>) {
val generatePrimeFrom2 = generateSequence(3){ it + 2 }
print(generatePrimeFrom2.toList().take(10))
}
But when I changed take(10).toList() in the print statement it work fine. Why is it so ?
This code worked fine for me:
fun main(args: Array<String>) {
val generatePrimeFrom2 = generateSequence(3){ it + 2 }
print(generatePrimeFrom2.take(10).toList())
}
The generateSequence function generates a sequence that is either infinite or finishes when the lambda passed to it returns null. In your case, it is { it + 2 }, which never returns null, so the sequence is infinite.
When you call .toList() on a sequence, it will try to collect all sequence elements and thus will never stop if the sequence is infinite (unless the index overflows or an out-of-memory error happens), so it does not print anything because it does not finish.
In the second case, on contrary, you limit the number of elements in the sequence with .take(10) before trying to collect its items. Then the .toList() call simply collects the 10 items and finishes.
It may become more clear if you check this Q&A about differences between Sequence<T> and Iterable<T>: (link)
Here is the hint -> generate infinite list. In the first solution you first want to create a list (wait infinity) then take first 10 elements.
On the second snippet, from infinite list you take only first 10 elements and change it to list
generatePrimeFrom2.toList() tries to compute/create an infinite-length list.
generatePrimeFrom2.toList().take(10) then takes the first 10 elements from the infinite-length list.
It does not print because it is calculating that infinite-length list.
Whereas, generatePrimeFrom2.take(10) only tries to compute the first 10 elements.
generatePrimeFrom2.take(10).toList() converts the first 10 elements to the list.
You know, generateSequence(3){ it + 2 } does not have the end. So it has the infinite length.
Sequences do not have actual values, calculated when they are needed, but Lists have to have actual values.
I came across sequences which generate infinite list.
This is not actually correct. The main point is that a sequence is not a list. It is a lazily evaluated construct and only the items you request will actually become "materialized", i.e., their memory allocated on the heap.
That's why it's not interchangeable to write
infiniteSeq.toList().take(10)
and
infiniteSeq.take(10).toList()
The former will try to instantiate infinitely many items—and, predictably, fail at it.

VB.NET, best practice for sorting concurrent results of threadpooling?

In short, I'm trying to "sort" incoming results of using threadpooling as they finish. I have a functional solution, but there's no way in the world it's the best way to do it (it's prone to huge pauses). So here I am! I'll try to hit the bullet points of what's going on/what needs to happen and then post my current solution.
The intent of the code is to get information about files in a directory, and then write that to a text file.
I have a list (Counter.ListOfFiles) that is a list of the file paths sorted in a particular way. This is the guide that dictates the order I need to write to the text file.
I'm using a threadpool to collect the information about each file, create a stringbuilder with all of the text ready to write to the text file. I then call a procedure(SyncUpdate, inlcluded below), send the stringbuilder(strBld) from that thread along with the name of the path of the file that particular thread just wrote to the stringbuilder about(Xref).
The procedure includes a synclock to hold all the other threads until it finds a thread passing the correct information. That "correct" information being when the xref passed by the thread matches the first item in my list (FirstListItem). When that happens, I write to the text file, delete the first item in the list and do it again with the next thread.
The way I'm using the monitor is probably not great, in fact I have little doubt I'm using it in an offensively wanton manner. Basically while the xref (from the thread) <> the first item in my list, I'm doing a pulseall for the monitor. I originally was using monitor.wait, but it would eventually just give up trying to sort through the list, even when using a pulse elsewhere. I may have just been coding something awkwardly. Either way, I don't think it's going to change anything.
Basically the problem comes down to the fact that the monitor will pulse through all of the items it has in the queue, when there's a good chance the item I am looking for probably got passed to it somewhere earlier in the queue or whatever and it's now going to sort through all of the items again before looping back around to find a criteria that matches. The result of this is that my code will hit one of these and take a huge amount of time to complete.
I'm open to believing I'm just using the wrong tool for the job, or just not using tool I have correctly. I would strongly prefer some sort of threaded solution (unsurprisingly, it's much faster!). I've been messing around a bit with the Parallel Task functionality today, and a lot of the stuff looks promising, but I have even less experience with that vs. threadpool, and you can see how I'm abusing that! Maybe something with queue? You get the idea. I am directionless. Anything someone could suggest would be much appreciated. Thanks! Let me know if you need any additional information.
Private Sub SyncUpdateResource(strBld As Object, Xref As String)
SyncLock (CType(strBld, StringBuilder))
Dim FirstListitem As String = counter.ListOfFiles.First
Do While Xref <> FirstListitem
FirstListitem = Counter.ListOfFiles.First
'This makes the code much faster for reasons I can only guess at.
Thread.Sleep(5)
Monitor.PulseAll(CType(strBld, StringBuilder))
Loop
Dim strVol As String = Form1.Volname
Dim strLFPPath As String = Form1.txtPathDir
My.Computer.FileSystem.WriteAllText(strLFPPath & "\" & strVol & ".txt", strBld.ToString, True)
Counter.ListOfFiles.Remove(Xref)
End SyncLock
End Sub
This is a pretty typical multiple producer, single consumer application. The only wrinkle is that you have to order the results before they're written to the output. That's not difficult to do. So let's let that requirement slide for a moment.
The easiest way in .NET to implement a producer/consumer relationship is with BlockingCollection, which is a thread-safe FIFO queue. Basically, you do this:
In your case, the producer threads get items, do whatever processing they need to, and then put the item onto the queue. There's no need for any explicit synchronization--the BlockingCollection class implementation does that for you.
Your consumer pulls things from the queue and outputs them. You can see a really simple example of this in my article Simple Multithreading, Part 2. (Scroll down to the third example if you're just interested in the code.) That example just uses one producer and one consumer, but you can have N producers if you want.
Your requirements have a little wrinkle in that the consumer can't just write items to the file as it gets them. It has to make sure that it's writing them in the proper order. As I said, that's not too difficult to do.
What you want is a priority queue of some sort onto which you can place an item if it comes in out of order. Your priority queue can be a sorted list or even just a sequential list if the number of items you expect to get out of order isn't very large. That is, if you typically have only a half dozen items at a time that could be out of order, then a sequential list could work just fine.
I'd use a heap because it performs well. The .NET Framework doesn't supply a heap, but I have a simple one that works well for jobs like this. See A Generic BinaryHeap Class.
So here's how I'd write the consumer (the code is in pseudo-C#, but you can probably convert it easily enough).
The assumption here is that you have a BlockingCollection called sharedQueue that contains the items. The producers put items on that queue. Consumers do this:
var heap = new BinaryHeap<ItemType>();
foreach (var item in sharedQueue.GetConsumingEnumerable())
{
if (item.SequenceKey == expectedSequenceKey)
{
// output this item
// then check the heap to see if other items need to be output
expectedSequenceKey = expectedSequenceKey + 1;
while (heap.Count > 0 && heap.Peek().SequenceKey == expectedSequenceKey)
{
var heapItem = heap.RemoveRoot();
// output heapItem
expectedSequenceKey = expectedSequenceKey + 1;
}
}
else
{
// item is out of order
// put it on the heap
heap.Insert(item);
}
}
// if the heap contains items after everything is processed,
// then some error occurred.
One glaring problem with this approach as written is that the heap could grow without bound if one of your consumers crashes or goes into an infinite loop. But then, your other approach probably would suffer from that as well. If you think that's an issue, you'll have to add some way to skip an item that you think won't ever be forthcoming. Or kill the program. Or something.
If you don't have a binary heap or don't want to use one, you can do the same thing with a SortedList<ItemType>. SortedList will be faster than List, but slower than BinaryHeap if the number of items in the list is even moderately large (a couple dozen). Fewer than that and it's probably a wash.
I know that's a lot of info. I'm happy to answer any questions you might have.