When to use Sequence over List in Kotlin? - kotlin

Most Kotlin examples and real-world codebases I've seen perform operations over a regular list.
data class Person(val name: String, val age: Int)
fun main() {
val people = listOf(Person("John", 29), Person("Jane", 31))
people.filter { it.age > 30 }.map { it.name }
}
What would be the real-world scenarios where it makes sense to use Sequence over List or vice-versa?
people.asSequence().filter { it.age > 30 }.map { it.name }

Intuition says sequences should be better for the performance as they focus on processing a single item fully before going to the next item. Processing of collections seems to be a huge waste of resources as we have to create multiple intermediary collections in the process.
However, reality is much different - both solutions have comparable performance and I believe potential differences are actually in favor of collections (Kotlin 1.8.x). There are several reasons for this:
Collection processing is fully inlined, sequences require calling of lambdas.
Implementation for collections is generally simpler, so there is less overhead.
In some cases, e.g. map() we know the size of the resulting list upfront, so we can allocate the space for it. Sequences require copying of the data for growing.
Some of these problems could be addressed in the future by making possible to inline sequence processing. Then they should be generally superior in the terms of the performance. For now I would say collections are the default approach and we can use sequences in very specific cases, for example:
Generating items on demand: generators, loading from the disk/network, infinite sequences, etc.
If processing is resource-heavy, it requires some I/O, big amounts of memory, etc., we probably like to process a single item fully before going for the next one.
If we use flat maps and then e.g. filter, then sequences allow to never keep all items in the memory at once. For example, we have a list of 1000 items, each flat maps to 1000 items, then we filter it which by average keeps only a single item per 1000. While using sequences we only keep a few thousands of items in the memory at any specific time. While using collections, we have to create a list of a million of items.
If we need to observe the progress per-item and not per-stage.
There are probably more examples like this. Generally speaking, if you see a reason to process items one-by-one, sequences allow exactly this.

Related

OOP design for a data processing pipeline

I have a collection of instances of the same object, like so (in Python - my question is actually language independent)
class shopping_cart:
def __init__(self, newID, newproduct, currentdate, dollars ):
shopping_cart.customerID = newID
shopping_cart.product = newproduct
shopping_cart.date = currentdate
shopping_cart.value = dollars
that models what each customer bought, when, and for how much money. Now, in the software that I'm writing I need to compute some basic statistics about my customers and for this I need to compute things like the mean value of all items that were bought - or the mean value each single customer bought. Currently the dataset is very small, so I do this by looping over all instances of my shopping_cart objects and extracting the data from each instance as I need it.
But data will get huge soon enough - and then it might be that looping like that is simply too slow to get everyday statistics in time - and for different operations I will need my data to be organized in structures that will offer speed for the range of operations that I want to perform in the future (e.g. vectorized data, so that I can make use of fast algorithms for that).
Is there a way to use OOP design that sufficiently well allows me to refactor the underlying data structures by separating the operations that I need to perform on the data from the structure in which the data is saved? (I might have to rewrite my code and redesign my class, but I'd rather do it now, to support such encapsulation, than do it later, where I might have to go through much bigger refactoring when I have to rewrite the operations and the data structures together.)
I think your question mixes two different things.
One is decoupling your objects from the methods you want to apply to them. You will be interested in the Visitor pattern for that.
The other is about increasing performance when processing lots of objects. For this you can consider the Pipe and Filter (or Pipeline) pattern where you partition the objects to process them in parallel execution pipelines and group results in the end.
As a footnote I think you meant
class shopping_cart:
def __init__(self, newID, newproduct, currentdate, dollars ):
self.customerID = newID
self.product = newproduct
self.date = currentdate
self.value = dollars
Otherwise you are setting class members, not instance members.

How do I count occurrences of a property value in a collection?

I have some data that I arrange into a collection of custom class objects.
Each object has a couple of properties aside from its unique name, which I will refer to as batch and exists
There are many objects in my collection, but only a few possible values of batch (although the number of possibilities is not pre-defined).
What is the easiest way to count occurrences of each possible value of batch?
Ultimately I want to create a userform something like this (values are arbitrary, for illustration):
Batch A 25 parts (2 missing)
Batch B 17 parts
Batch C 16 parts (1 missing)
One of my ideas was to make a custom "batch" class, which would have properties .count and .existcount and create a collection of those objects.
I want to know if there is a simpler, more straightforward way to count these values. Should I scrap the idea of a secondary collection and just create some loops and counter variables when I generate my userform?
You described well the two possibilities that you have:
Loop over your collection every time you need the count
Precompute the statistics, and access it when needed
This is a common choice one often has to do. I think here it is between performance vs. complexity.
Option 1 with a naive loop implementation will take you an O(n) time, where n is the size of your collection. And, unless your collection is static, you will have to compute it everytime you need your statistics. On the bright side, the naive looping is fairly trivial to write. Performance on frequent queries and/or large collections could suffer.
Option 2 is fast for retrieval, O(1) basically. But everytime your collection changes, you need to recompute your statistics. However this is incremental recomputing, i.e. you do not have to go through the whole collection but just over the changed items. But that means you need to deal with all the possibilities of updates (new item, deleted item, updated items). So that's a bit more complex than the naive loop. Now if your collections are entirely new all the time, and you query them only once, you have little to gain here.
So up to you to decide where to tradeoff according to the parameters of your problems.

How to represent a binary relation

I plan to make a class that represents a strict partially ordered set, and I assume the most natural way to model its interface is as a binary relation. This gives functions like:
bool test(elementA, elementB); //return true if elementA < elementB
void set(elementA, elementB); //declare that elementA < elementB
void clear(elementA, elementB); //forget that elementA < elementB
and possibly functions like:
void applyTransitivity(); //if test(a,b) and test(b, c), then set(a, c)
bool checkIrreflexivity(); //return true if for no a, a < a
bool checkAsymmetry(); //return true if for no a and b, a < b and b < a
The naive implementation would be to have a list of pairs such that (a, b) indicates a < b. However, it's probably not optimal. For example, test would be linear time. Perhaps it could be better done as a hash map of lists.
Ideally, though, the in memory representation would by its nature enforce applyTransitivity to always be "in effect" and not permit the creation of edges that cause reflexivity or symmetry. In other words, the degrees of freedom of the data structure represent the degrees of freedom of a strict poset. Is there a known way to do this? Or, more realistically, is there a means of checking for being cyclical, and maintaining transitivity that is amortized and iterative with each call to set and clear, so that the cost of enforcing the correctness is low. Is there a working implementation?
Okay, let's talk about achieving bare metal-scraping micro-efficiency, and you can choose how deep down that abyss you want to go. At this architectural level, there are no data structures like hash maps and lists, there aren't even data types, just bits and bytes in memory.
As an aside, you'll also find a lot of info on representations here by looking into common representations of DAGs. However, most of the common reps are designed more for convenience than efficiency.
Here, we want the data for a to be fused with that adjacency data into a single memory block. So you want to store the 'list', so to speak, of items that have a relation to a in a's own memory block so that we can potentially access a and all the elements related to a within a single cache line (bonus points if those related elements might also fit in the same cache line, but that's an NP-hard problem).
You can do that by storing, say, 32-bit indices in a. We can model such objects like so if we go a little higher level and use C for exemplary purposes:
struct Node
{
// node data
...
int links[]; // variable-length struct
};
This makes the Node a variable-length structure whose size and potentially even address changes, so we need an extra level of indirection to get stability and avoid invalidation, like an index to an index (if you control the memory allocator/array and it's purely contiguous), or an index to a pointer (or reference in some languages).
That makes your test function still involve a linear time search, but linear with respect to the number of elements related to a, not the number of elements total. Because we used a variable-length structure, a and its neighbor indices will potentially fit in a single cache line, and it's likely that a will already be in the cache just to make the query.
It's similar to the basic idea you had of the hash map storing lists, but without the explosion of lists overhead and without the hash lookup (which may be constant time but not nearly as fast as just accessing the connections to a from the same memory block). Most importantly, it's far more cache-friendly, and that's often going to make the difference between a few cycles and hundreds.
Now this means that you still have to roll up your sleeves and check for things like cycles yourself. If you want a data structure that more directly and conveniently models the problem, you'll find a nicer fit with graph data structures revolving around a formalization of a directed edge. However, those are much more convenient than they are efficient.
If you need the container to be generic and a can be any given type, T, then you can always wrap it (using C++ now):
template <class T>
struct Node
{
T node_data;
int links[1]; // VLS, not necessarily actually storing 1 element
};
And still fuse this all into one memory block this way. We need placement new here to preserve those C++ object semantics and possibly keep an eye on alignment here.
Transitivity checks always involves a search of some sort (breadth first or depth first). I don't think there's any rep that avoids this unless you want to memoize/cache a potentially massive explosion of transitive data.
At this point you should have something pretty fast if you want to go this deep down the abyss and have a solution that's really hard to maintain and understand. I've unfortunately found that this doesn't impress the ladies very much as with the case of having a car that goes really fast, but it can make your software go really, really fast.

Merge Sort in a data store?

I'm trying to make a "friend stream" for the project I'm working on. I have individual users streams saved in Redis ZSETS. Something like:
key : { stream_id : time }
user1-stream: { 1:9931112, 3:93291, 9:9181273, ...}
user2-stream: { 4:4239191, 2:92919, 7:3293021, ...}
user3-stream: { 8:3299213, 5:97313, 6:7919921, ...}
...
user4-friends: [1,2,3]
Right now, to make user4's friend stream, I would call:
ZUNIONSTORE user4-friend-stream, [user1-stream, user2-stream, user3-stream]
However, ZUNIONSTORE is slow when you try to merge ZSETS totaling more than 1-2000 elements.
I'd really love to have Redis do a merge sort on the ZSETS, and limit the results to a few hundred elements. Are there any off-the-shelf data stores that will do what I want? If not, is there any kind of framework for developing redis-like data stores?
I suppose I could just fork Redis and add the function I need, but I was hoping to avoid that.
People tend to think that a zset is just a skip list. This is wrong. It is a skip list (ordered data structure) plus a non ordered dictionary (implemented as a hash table). The semantic of a merge operation would have to be defined. For instance, how would you merge non disjoint zsets whose common items do not have the same score?
To implement a merge algorithm for ZUNIONSTORE, you would have to get the items ordered (easy with the skip lists), merge them while building the output (which happens to be a zset as well: skiplist plus dictionary).
Because the cardinality of the result cannot be guessed at the beginning of the algorithm, I don't think it is possible to build this skiplist + dictionary in linear time. It will be O(n log n) at best. So the merge is linear, but building the output is not: it defeats the benefit of using a merge algorithm.
Now, if you want to implement a ZUNION (i.e. directly returning the result, not building the result as a zset), and limit the result to a given number of items, a merge algorithm makes sense.
RDBMS supporting merge joins can typically do it (but this is usually not very efficient, due to the cost of random I/Os). I'm not aware of a NoSQL store supporting similar capabilities.
To implement it in Redis, you could try a Lua server-side script, but it may be complex, and I think it will be efficient only if the zsets are much larger than the limit provided in the zunion. In that case, the limit on the number of items will offset the overhead of running interpreted Lua code.
The last possibility is to implement it in C in the Redis source code, which is not that difficult. The drawback is the burden to maintain a patch for the Redis versions you use. Redis itself provides no framework to do that, and the idea of defining Redis plugins (isolated from Redis source code) is generally rejected by the author.

What is the cost of object reference in Scala?

Assume we build an object to represent some network (social, wireless, whatever). So we have some 'node' object to represent the KIND of network, different nodes might have different behaviors and so forth. The network has a MutableList of nodes.
But each node has neighbors, and these neighbors are also nodes. So somewhere, there has to be a list, per node, of all of the neighbors of that node--or such a list has to be generated on the fly whenever it is needed. If the list of neighbors is stored in the node objects, is it cheaper to store it (a) as a list of nodes, or (b) as list of numbers that can be used to reference nodes out of the network?
Some code for clarity:
//approach (a)
class network {
val nodes = new MutableList[Node]
// other stuff //
}
class Node {
val neighbors = new MutableList[Node]
// other stuff //
}
//approach (b)
class Network {
val nodes = new MutableList[Node]
val indexed_list = //(some function to get an indexed list off nodes)
//other stuff//
}
class Node {
val neighbors = MutableList[Int]
//other stuff//
}
Approach (a) seems like the easiest. My first question is whether this is costly in Scala 2.8, and the second is whether it breaks the principle of DRY?
Short answer: premature optimization is the root of etc. Use the clean reference approach. When you have performance issues there's no substitute for profiling and benchmarking.
Long answer: Scala uses the exact same reference machinery as Java so this is really a JVM question more than a Scala question. Formally the JVM spec doesn't say one word about how references are implemented. In practice they tend to be word sized or smaller pointers that either point to an object or index into a table that points to the object (the later helps garbage collectors).
Either way, an array of refs is about the same size as a array of ints on a 32 bit vm or about double on a 64bit vm (unless compressed-oops are used). That doubling might be important to you or might not.
If you go with the ref based approach, each traversal from a node to a neighbor is a reference indirection. With the int based approach, each traversal from a node to a neighbor is a lookup into a table and then a reference indirection. So the int approach is more expensive computationally. And that's assuming you put the ints into a collection that doesn't box the ints. If you do box the ints then it's just pure craziness because now you've got just as many references as the original AND you've got a table lookup.
Anyway, if you go with the reference based approach then the extra references can make a bit of extra work for a garbage collector. If the only references to nodes lie in one array then the gc will scan that pretty damn fast. If they're scattered all over in a graph then the gc will have to work harder to track them all down. That may or may not affect your needs.
From a cleanliness standpoint the ref based approach is much nicer. So go with it and then profile to see where you're spending your time. That or benchmark both approaches.
The question is - what kind of a cost? Memory-wise, the b) approach would probably end up consuming more memory, since you have both mutable lists, and boxed integers in that list, and another global structure holding all the indices. Also, it would probably be slower because you would need several levels of indirection to reach the neighbour node.
One important note - as soon as you start storing integers into mutable lists, they will undergo boxing. So, you will have a list of heap objects in both cases. To avoid this, and furthermore to conserve memory, in the b) approach you would have to keep a dynamically grown array of integers that are the indices of the neighbours.
Now, even if you modify the approach b) as suggested above, and make sure the indexed list in the Network class is really an efficient structure (a direct lookup table or a hash table), you would still pay an indirection cost to find your Node. And memory consumption would still be higher. The only benefit I see is in keeping some sort of a table of weak references if you're concerned you might run out of memory, and recreate the Node object when you need it and you cannot find it in your indexed_list which keeps a set of weak references.
This is, of course, just a hypothesis, you would have to profile/benchmark your code to see the difference.
My suggestion would be to use something like an ArrayBuffer in Node and use it store direct references to nodes.
If memory concerns are an issue, and you want to do the b) approach together with weak references, then I would further suggest rolling in your own dynamically grown integer-array for neighbours, to avoid boxing with ArrayBuffer[Int].