OOP design for a data processing pipeline - oop

I have a collection of instances of the same object, like so (in Python - my question is actually language independent)
class shopping_cart:
def __init__(self, newID, newproduct, currentdate, dollars ):
shopping_cart.customerID = newID
shopping_cart.product = newproduct
shopping_cart.date = currentdate
shopping_cart.value = dollars
that models what each customer bought, when, and for how much money. Now, in the software that I'm writing I need to compute some basic statistics about my customers and for this I need to compute things like the mean value of all items that were bought - or the mean value each single customer bought. Currently the dataset is very small, so I do this by looping over all instances of my shopping_cart objects and extracting the data from each instance as I need it.
But data will get huge soon enough - and then it might be that looping like that is simply too slow to get everyday statistics in time - and for different operations I will need my data to be organized in structures that will offer speed for the range of operations that I want to perform in the future (e.g. vectorized data, so that I can make use of fast algorithms for that).
Is there a way to use OOP design that sufficiently well allows me to refactor the underlying data structures by separating the operations that I need to perform on the data from the structure in which the data is saved? (I might have to rewrite my code and redesign my class, but I'd rather do it now, to support such encapsulation, than do it later, where I might have to go through much bigger refactoring when I have to rewrite the operations and the data structures together.)

I think your question mixes two different things.
One is decoupling your objects from the methods you want to apply to them. You will be interested in the Visitor pattern for that.
The other is about increasing performance when processing lots of objects. For this you can consider the Pipe and Filter (or Pipeline) pattern where you partition the objects to process them in parallel execution pipelines and group results in the end.
As a footnote I think you meant
class shopping_cart:
def __init__(self, newID, newproduct, currentdate, dollars ):
self.customerID = newID
self.product = newproduct
self.date = currentdate
self.value = dollars
Otherwise you are setting class members, not instance members.

Related

When to use Sequence over List in Kotlin?

Most Kotlin examples and real-world codebases I've seen perform operations over a regular list.
data class Person(val name: String, val age: Int)
fun main() {
val people = listOf(Person("John", 29), Person("Jane", 31))
people.filter { it.age > 30 }.map { it.name }
}
What would be the real-world scenarios where it makes sense to use Sequence over List or vice-versa?
people.asSequence().filter { it.age > 30 }.map { it.name }
Intuition says sequences should be better for the performance as they focus on processing a single item fully before going to the next item. Processing of collections seems to be a huge waste of resources as we have to create multiple intermediary collections in the process.
However, reality is much different - both solutions have comparable performance and I believe potential differences are actually in favor of collections (Kotlin 1.8.x). There are several reasons for this:
Collection processing is fully inlined, sequences require calling of lambdas.
Implementation for collections is generally simpler, so there is less overhead.
In some cases, e.g. map() we know the size of the resulting list upfront, so we can allocate the space for it. Sequences require copying of the data for growing.
Some of these problems could be addressed in the future by making possible to inline sequence processing. Then they should be generally superior in the terms of the performance. For now I would say collections are the default approach and we can use sequences in very specific cases, for example:
Generating items on demand: generators, loading from the disk/network, infinite sequences, etc.
If processing is resource-heavy, it requires some I/O, big amounts of memory, etc., we probably like to process a single item fully before going for the next one.
If we use flat maps and then e.g. filter, then sequences allow to never keep all items in the memory at once. For example, we have a list of 1000 items, each flat maps to 1000 items, then we filter it which by average keeps only a single item per 1000. While using sequences we only keep a few thousands of items in the memory at any specific time. While using collections, we have to create a list of a million of items.
If we need to observe the progress per-item and not per-stage.
There are probably more examples like this. Generally speaking, if you see a reason to process items one-by-one, sequences allow exactly this.

Best way to lookup elements in GemFire Region

I have Regions in GemFire with a large number of records.
I need to lookup elements in those Regions for validation purposes. The lookup is happening for every item we scan; There can be more than 10000 items.
What will be an efficient way to look up element in Regions?
Please suggest.
Vikas-
There are several ways in which you can look up, or fetch multiple elements from a GemFire Region.
As you can see, a GemFire Region indirectly implements java.util.Map, and so provides all the basic Map operations, such as get(key):value, in addition to several other operations that are not available in Map like getAll(Collection keys):Map.
Though, get(key):value is not going to be the most "efficient" method for looking up multiple items at once, but getAll(..) allows you to pass in a Collection of keys for all the values you want returned. Of course, you have to know the keys of all the values you want in advance, so...
You can obtain GemFire's QueryService from the Region by calling region.getRegionService().getQueryService(). The QueryService allows you to write GemFire Queries with OQL (or Object Query Language). See GemFire's User Guide on Querying for more details.
The advantage of using OQL over getAll(keys) is, of course, you do not need to know the keys of all the values you might need to validate up front. If the validation logic is based on some criteria that matches the values that need to be evaluated, you can express this criteria in the OQL Query Predicate.
For example...
SELECT * FROM /People p WHERE p.age >= 21;
To call upon the GemFire QueryService to write the query above, you would...
Region people = cache.getRegion("/People");
...
QueryService queryService = people.getRegionSevice().getQueryService();
Query query = queryService.newQuery("SELECT * FROM /People p WHERE p.age >= $1");
SelectResults<Person> results = (SelectResults<Person>) query.execute(asArray(21));
// process (e.g. validate) the results
OQL Queries can be parameterized and arguments passed to the Query.execute(args:Object[]) method as shown above. When the appropriate Indexes are added to your GemFire Regions, then the performance of your Queries can improve dramatically. See the GemFire User Guide on creating Indexes.
Finally, with GemFire PARTITION Regions especially, where your Region data is partitioned, or "sharded" and distributed across the nodes (GemFire Servers) in the cluster that host the Region of interests (e.g. /People), then you can combine querying with GemFire's Function Execution service to query the data locally (to that node), where the data actually exists (e.g. that shard/bucket of the PARTITION Regioncontaining a subset of the data), rather than bringing the data to you. You can even encapsulate the "validation" logic in the GemFire Function you write.
You will need to use the RegionFunctionContext along with the PartitionRegionHelper to get the local data set of the Region to query. Read the Javadoc of PartitionRegionHelper as it shows the particular example you are looking for in this case.
Spring Data GemFire can help with many of these concerns...
For Querying, you can use the SD Repository abstraction and extension provided in SDG.
For Function Execution, you can use SD GemFire's Function ExeAnnotation support.
Be careful though, using the SD Repository abstraction inside a Function context is not just going to limit the query to the "local" data set of the PARTITION Region. SD Repos always work on the entire data set of the "logical" Region, where the data is necessarily distributed across the nodes in a cluster in a partitioned (sharded) setup.
You should definitely familiarize yourself with GemFire Partitioned Regions.
In summary...
The approach you choose above really depends on several factors, such as, but not limited to:
How you organized the data in the first place (e.g. PARTITION vs. REPLICATE, which refers to the Region's DataPolicy).
How amenable your validation logic is to supplying "criteria" to, say, an OQL Query Predicate to "SELECT" only the Region data you want to validate. Additionally, efficiency might be further improved by applying appropriate Indexing.
How many nodes are in the cluster and how distributed your data is, in which case a Function might be the most advantageous approach... i.e. bring the logic to your data rather than the data to your logic. The later involves selecting the matching data on the nodes where the data resides that could involve several network hops to the nodes containing the data depending on your topology and configuration (i.e. "single-hop access", etc), serializing the data to send over the wire thereby increasing the saturation on your network, and so on and so forth).
Depending on your UC, other factors to consider are your expiration/eviction policies (e.g. whether data has been overflowed to disk), the frequency of the needed validations based on how often the data changes, etc.
Most of the time, it is better to validate the data on the way in and catch errors early. Naturally, as data is updated, you may also need to perform subsequent validations, but that is no substitute for early (as possible) verifications where possible.
There are many factors to consider and the optimal approach is not always apparent, so test and make sure your optimizations and overall approach has the desired effect.
Hope this helps!
Regards,
-John
Set up the PDX serializer and use the query service to get your element. "Select element from /region where id=xxx". This will return your element field without deserializing the record. Make sure that id is indexed.
There are other ways to validate quickly if your inbound data is streaming rather than a client lookup, such as the Function Service.

How do I count occurrences of a property value in a collection?

I have some data that I arrange into a collection of custom class objects.
Each object has a couple of properties aside from its unique name, which I will refer to as batch and exists
There are many objects in my collection, but only a few possible values of batch (although the number of possibilities is not pre-defined).
What is the easiest way to count occurrences of each possible value of batch?
Ultimately I want to create a userform something like this (values are arbitrary, for illustration):
Batch A 25 parts (2 missing)
Batch B 17 parts
Batch C 16 parts (1 missing)
One of my ideas was to make a custom "batch" class, which would have properties .count and .existcount and create a collection of those objects.
I want to know if there is a simpler, more straightforward way to count these values. Should I scrap the idea of a secondary collection and just create some loops and counter variables when I generate my userform?
You described well the two possibilities that you have:
Loop over your collection every time you need the count
Precompute the statistics, and access it when needed
This is a common choice one often has to do. I think here it is between performance vs. complexity.
Option 1 with a naive loop implementation will take you an O(n) time, where n is the size of your collection. And, unless your collection is static, you will have to compute it everytime you need your statistics. On the bright side, the naive looping is fairly trivial to write. Performance on frequent queries and/or large collections could suffer.
Option 2 is fast for retrieval, O(1) basically. But everytime your collection changes, you need to recompute your statistics. However this is incremental recomputing, i.e. you do not have to go through the whole collection but just over the changed items. But that means you need to deal with all the possibilities of updates (new item, deleted item, updated items). So that's a bit more complex than the naive loop. Now if your collections are entirely new all the time, and you query them only once, you have little to gain here.
So up to you to decide where to tradeoff according to the parameters of your problems.

How to represent a binary relation

I plan to make a class that represents a strict partially ordered set, and I assume the most natural way to model its interface is as a binary relation. This gives functions like:
bool test(elementA, elementB); //return true if elementA < elementB
void set(elementA, elementB); //declare that elementA < elementB
void clear(elementA, elementB); //forget that elementA < elementB
and possibly functions like:
void applyTransitivity(); //if test(a,b) and test(b, c), then set(a, c)
bool checkIrreflexivity(); //return true if for no a, a < a
bool checkAsymmetry(); //return true if for no a and b, a < b and b < a
The naive implementation would be to have a list of pairs such that (a, b) indicates a < b. However, it's probably not optimal. For example, test would be linear time. Perhaps it could be better done as a hash map of lists.
Ideally, though, the in memory representation would by its nature enforce applyTransitivity to always be "in effect" and not permit the creation of edges that cause reflexivity or symmetry. In other words, the degrees of freedom of the data structure represent the degrees of freedom of a strict poset. Is there a known way to do this? Or, more realistically, is there a means of checking for being cyclical, and maintaining transitivity that is amortized and iterative with each call to set and clear, so that the cost of enforcing the correctness is low. Is there a working implementation?
Okay, let's talk about achieving bare metal-scraping micro-efficiency, and you can choose how deep down that abyss you want to go. At this architectural level, there are no data structures like hash maps and lists, there aren't even data types, just bits and bytes in memory.
As an aside, you'll also find a lot of info on representations here by looking into common representations of DAGs. However, most of the common reps are designed more for convenience than efficiency.
Here, we want the data for a to be fused with that adjacency data into a single memory block. So you want to store the 'list', so to speak, of items that have a relation to a in a's own memory block so that we can potentially access a and all the elements related to a within a single cache line (bonus points if those related elements might also fit in the same cache line, but that's an NP-hard problem).
You can do that by storing, say, 32-bit indices in a. We can model such objects like so if we go a little higher level and use C for exemplary purposes:
struct Node
{
// node data
...
int links[]; // variable-length struct
};
This makes the Node a variable-length structure whose size and potentially even address changes, so we need an extra level of indirection to get stability and avoid invalidation, like an index to an index (if you control the memory allocator/array and it's purely contiguous), or an index to a pointer (or reference in some languages).
That makes your test function still involve a linear time search, but linear with respect to the number of elements related to a, not the number of elements total. Because we used a variable-length structure, a and its neighbor indices will potentially fit in a single cache line, and it's likely that a will already be in the cache just to make the query.
It's similar to the basic idea you had of the hash map storing lists, but without the explosion of lists overhead and without the hash lookup (which may be constant time but not nearly as fast as just accessing the connections to a from the same memory block). Most importantly, it's far more cache-friendly, and that's often going to make the difference between a few cycles and hundreds.
Now this means that you still have to roll up your sleeves and check for things like cycles yourself. If you want a data structure that more directly and conveniently models the problem, you'll find a nicer fit with graph data structures revolving around a formalization of a directed edge. However, those are much more convenient than they are efficient.
If you need the container to be generic and a can be any given type, T, then you can always wrap it (using C++ now):
template <class T>
struct Node
{
T node_data;
int links[1]; // VLS, not necessarily actually storing 1 element
};
And still fuse this all into one memory block this way. We need placement new here to preserve those C++ object semantics and possibly keep an eye on alignment here.
Transitivity checks always involves a search of some sort (breadth first or depth first). I don't think there's any rep that avoids this unless you want to memoize/cache a potentially massive explosion of transitive data.
At this point you should have something pretty fast if you want to go this deep down the abyss and have a solution that's really hard to maintain and understand. I've unfortunately found that this doesn't impress the ladies very much as with the case of having a car that goes really fast, but it can make your software go really, really fast.

Main-Memory Secondary-Memory Objects

I have a situation where I want to do some DB-related operations in a Java application (e.g. on Eclipse). I use MySQL as a RDBMS and Hibernate as an ORM provider.
I retreive all records using embedded SQL in Java:
//Define conncections ...etc
ResultSet result = myStmt.executeQuery("SELECT * FROM employees");
// iterator
I retreive all records using Hibernate ORM / JPQL:
// Connections,Entity Manager....etc
List result = em.createQuery("SELECT emp FROM Employees emp").getResultList();
// iterator
I know that the RDMS is located on secondary-memory (DISK). The question is, when I get both results back. Where are the employees actually? On the secondary (SM) or on main-memory (MM)?
I want to have at the end two object populations for further testing, one operating on the SM and one on the MM? How is this possible?
Thanks
Frank
Your Java Objects are real Java Objects, they are in (to use your term) MM, at least for a while. The beauty of the Hbernate/JPA programming model is that while in MM you can pretty much treat the objects as if they were any other Java object, make a few changes to them etc. And then at some agreed time Hibernate's persistence mechansim gets them bask to, SM (disk).
You will need to read up on the implications of Sessions and Transactions in order to understand when the transitions between MM and SM occur, and also very importantly, what happens if two users want to work with the same data at the same time.
Maybe start here
It is also possible to create objects in MM that are (at least for now) not related to any data on disk - these are "transient" objects, and also to "disconnect" data in memeory from what's on disk.
My bottom line here is that Hibernate/JPA does remove much grunt work from persistence coding, but it cannot hide the complexity of scale, as your data volumes increase, your data model's complexity grows and your user's actions contend for data you need to do serious thinking. Hibernate allows you to achive good things, but it can't do that thinking for you, you have to make careful choices as your problem domain gets more complex.