Parallelizing L2S Entity Retrieval - sql

Assuming a typical domain entity approach with SQL Server and a dbml/L2S DAL with a logic layer on top of that:
In situations where lazy loading is not an option, I have settled on a convention where getting a list of entities does not also get each item's child entities (no loading), but getting a single entity does (eager loading).
Since getting a single entity also gets children, it causes a cascading effect in which each child then gets its children too. This sounds bad, but as long as the model is not too deep, I usually don't see performance problems that outweigh the benefits of the ease of use.
So if I want to get a list in which each of the items is fully hydrated with children, I combine the GetList and GetItem methods. So I'll get a list and then loop through it getting each item with the full cascade. Even this is generally acceptable in many of the projects I've worked on - but I have recently encountered situations with larger models and/or more data in which it needs to be more efficient.
I've found that partitioning the loop and executing it on multiple threads yields excellent results. In my first experiment with a list of 50 items from one particular project, I did 5 threads of 10 items each and got a 3X improvement in time.
Of course, the mileage will vary depending on the project but all else being equal this is clearly a big opportunity. However, before I go further, I was wondering what others have done that have already been through this. What are some good approaches to parallelizing this type of thing?

Usually it is faster to make a single database call that returns a set of records.
This recordset can "hydrate" the top-level objects, then another recordset can load child objects. I'm not sure how your situation does not allow lazy-loading, but this method is essentially lazy-loading, and will surely be faster than making multiple calls to the database that returns a single record each time.
You could make asynchronous calls to the database so that multiple queries are running in parallel. If you combine this with the first strategy for each "layer" of the model, and write a somewhat more complex hydration function based on multiple-record return sets, you should see that the database handles concurrent connections very well (which is why you see a performance gain from using multiple threads).
But you don't need to explicitly create threads - check out the asynchronous methods of a SqlCommand.


Synchronized collection that blocks on every method

I have a collection that is commonly used between different threads. In one thread I need to add items, remove items, retrieve items and iterate over the list of items. What I am looking for is a collection that blocks access to any of its read/write/remove methods whenever any of these methods are already being called. So if one thread retrieves an item, another thread has to wait until the reading has completed before it can remove an item from the collection.
Kotlin doesn't appear to provide this. However, I could create a wrapper class that provides the synchronization I'm looking for. Java does appear to offer the synchronizedList class but from what I read, this is really for blocking calls on a single method, meaning that no two threads can remove an item at the same time but one can remove while the other reads an item (which is what I am trying to avoid).
Are there any other solutions?
A wrapper such as the one returned by synchronizedList
synchronizes calls to every method, using the wrapper itself as the lock. So one thread would be blocked from calling get(), say, while another thread is currently calling put(). (This is what the question seems to ask for.)
However, as the docs to that method point out, this does nothing to protect sequences of calls, such as you might use when iterating through a collection. If another thread changes the collection in between your calls to next(), then anything could happen. (This is what I think the question is really about!)
To handle that safely, your options include:
Manual synchronization. Surround each sequence of calls to the collection in a synchronized block that synchronises on the collection, e.g.:
val list = Collections.synchronizedList(mutableListOf<String>())
// …
synchronized (list) {
for (i in list) {
// …
This is straightforward, and relatively easy to do if the collection is under your control. But if you miss any sequences, then you could get unexpected behaviour. Also, you'll need to keep your sequences short, to avoid holding the lock for an extended time and affecting performance.
Use a concurrent collection implementation which provides primitives letting you do all the processing you need in a single call, avoiding iteration and other sequences.
For maps, Java provides very good support with its ConcurrentMap interface, and high-performance implementations such as ConcurrentHashMap. These have methods allowing you to iterate, update single or multiple mappings, search, reduce, and many other whole-map operations in a single call, avoiding any concurrency problems.
For sets (as per this question) you can use a ConcurrentSkipListSet, or you can create one from a ConcurrentHashMap with newKeySet().
For lists (as per this question), there are fewer options. (I think concurrent lists are much less commonly needed.) If you don't need random access, ConcurrentLinkedQueue may suffice. Or if modification is much less common than iteration, CopyOnWriteArrayList could work.
There are many other concurrent classes in the java.util.concurrent package, so it's well worth looking through to see if any of those is a better match for your particular case.
If you have specialised requirements, you could write your own collection implementation which supports them. Obviously this is more work, and only worthwhile if none of the above approaches does what you want.
In general, I think it's well worth stepping back and seeing whether iteration is really needed. Historically, in imperative languages all the way from FORTRAN through BASIC and C up to Java, the for loop has traditionally been the tool of choice (sometimes the only structure) for operating on collections of data — and for those of us who grew up on those languages, it's what we reach for instinctively. But the functional programming paradigm provides alternative tools, and so in languages like Kotlin which provide some of them, it's good to stop and ask ourselves “What am I ultimately trying to achieve here?” (Often what we want is actually to update all entries, or map to a new structure, or search for an element, or find the maximum — all of which have better approaches in Kotlin than low-level iteration.)
After all, if you can tell the compiler what you want to do, instead of how to do it, then your program is likely to be shorter and easier to read and maintain, freeing you to think about more important things!

Difference of principles behing CopyOnWriteArrayList and ConcurrentHashMap

In the advanced java collections API, we have CopyOnWriteArrayList and ConcurrentHashMap. yet the underlying principles on these data structures are different. i.e ConcurrentHashMap only locks a segment of the Map on which the write operation is happening. This is how it prevents synchronization problems without affecting performance.
CopyOnWriteArrayList on the other hand prevents concurrency problems by making a duplicate of the original List. Why are these implementations so different? is Java just testing to see which one works better?
A concurrent data structure is designed to ensure that any sequence of individual operations on the data structure will always leave it in a state which is consistent from its own point of view, but a "snapshot" formed by reading the data structure piece by piece may not necessarily represent any state which the data structure ever held. For example, if while one is reading out an collection of users, "Zachary" is renamed to "Adam", the renamed user might get read out as "Adam", "Zachary", both, or neither. Even if during the enumeration the collection was never in a state where the user never existed with both names, or didn't exist with either, the enumeration might make it look like it did.
A copy-on-write collection is designed to let one take a snapshot of the collection's entire state and guarantee that there was some moment in time when the collection actually had that state. The result of every action, including snapshot requests, should be consistent with each action having been performed at some discrete moment in time time between when the request was issued and when it was reported to have completed. If two requests are given before either completes, the selection of which action precedes the other is arbitrary, but there must be a globally-consistent ordering.

How to lazy-load child collections in a single step

I am working on a legacy application where NHibernate has been used without any apparent thought to efficiency. I am currently stepping through a method in which over 200 queries have been executed so far. This is mostly due to the N+1 problem.
Anyway, as I think about fixing this, my question is:
Given an entity with, say, 10 child collections, almost all of which will be accessed during a single operation, is there a way to lazy-load each child collection in a single call rather than have NHibernate lazy-load each individual child record as it is accessed (e.g., in a foreach loop, which is what's happening now). It seems to me that eager loading all of these child collections at once would be a massive query and not very performant. But obviously this N+1 approach is wrong. How can I tell NHibernate to do a query that loads up the whole child collection on demand? This still gives me 11 queries, but it's better than 200 and, perhaps, better than the massive query I'd have to do if I eager loaded everything.
You can use the method NHibernateUtil.Initialize(yourObject.ChildCollection). This method forces proxies to load their data based on their fetching strategy, that you set in your mappings. Source: NHibernate documentation

is there a need to refactor a large data access layer

i have a data access layer that abstracts the rest of the application away from the persistence technology. Right now the implementation is SQL server but that might change. Anyway, i find this main data access class getting larger and large as my tables grow (about 40 tables now). The interface of this data access layer is any question you might want to get data on
public interface IOrderRepository
Customer[] GetCustomerForOrder(int orderID);
Order[] GetCustomerOrders(int customerID);
Product[] GetProductList(int orderID);
Order[] GetallCustomersOrders(int customerID);
etc . . .
the implementation behind this is basic SQL stored procs running the appropriate queries and returning the results in typed collections
this will keep growing and growing. Its pretty maintainable as there isn't a real break of single responsibility but the class is now over 2000 lines of code.
so the question is, due to sure class size (and no real conceptual coupling), should this get broken down and if so on what dimension or level of abstraction.
Absolutely refactor. 2000 lines is huge.
I'd start by breaking it down by return type. Thus you would get one class for accessing Products, one for Orders, one for Customers and so on.
For each of the class, the set of columns selected should probably the same, so that could get refactored into a single variable/method as the extracting of the SQL values into objects.
Also the actual call to the Stored Procedure, including logging and exception handling could and should go into a separate class.
BTW you do have a violation of single responsibility. According to your description your class right now has the following responsibilities:
create sql statements for querying a table (about 40 times)
hydrating the results of calls to stored procedures
calling stored procedures
And I am assuming
- logging
- exception handling
I think it should be factored just because of the size. There are always lots of dimension on which you can break it down. Since the breakdown is simply to make the code more manageable, don't choose too complex a dimension - keep it simple so that it is easy to guess in which class/interface a given function will be found.
This is a hard problem to crack .... firstly break it into multiple files and classes, and secondly split the business objects from the technology object; you can write your business objects in terms of a database interface (which you write yourself). and then in the future if you change DB all you need is to replace the technology object.
Sadly You can't really escape from data-schema growth, you will get more stored-procedures, more tables and more business objects. However, try your level headed best to alter rather than add new tables.
I suggest trying to form a workflow of coupling items them together as resources. By this I mean not making physical dependencies but documentation that will let you relate all the three types of items in you data layer -- e.g.., you could start putting annotations in the comments of your business objects to specify which stored-procedures and tables it depends on. You could do this for the stored-procedures even in the tables in SQL Server (the schema has a description field for tables). These tips should help you keep sight of the big-picture.
Consider a generic DAO if your language accomodates them. You might also think about query by example to cut down on the number of calls required.

methods: multiple parameters or structure?

I noticed by looking at sample code from Apple, that they tend to design methods that receive structures instead of multiple parameters. Why is that? As far as ease of use, I personally prefer the latter, but as far as performance goes, is there one better choice than the other?
[pencil drawPoint:Point3Make(20,40,60)]
[pencil drawPointAtX:20 Y:50 Z:60]
Don't muddle this question with concerns of performance. Don't make premature optimizations (until you know you have a problem) and when thinking about performance hot spots in your code, its almost always in areas dealing with I/O (eg, database, files). So, separate your question on message passing style with performance. You want to make the best design decision first, then optimize for performance only if needed.
With that being said, Apple does not recommend or prefer passing multiple parameters vs a structure/object. Generalizing this outside of the scope of Objective-C, use individuals parameters or objects when it makes sense in the particular scenario. In other words, there isn't a black and white answer that you can follow. Instead, use the following guidelines when deciding:
Pass objects/structures when it makes sense for the method to understand many/all members of the object
Pass objects/structures when you want to validate some rules on the relationship between the various members of the object. This allows you to ensure the consumer of your method constructs a valid object prior to calling your method (thus eliminating the need of the method to validate these conditions).
Pass individual arguments when it is clear the method makes sense and only needs certain elements rather than the entire object
Using a variation on your example, a paint method that takes two coordinates (X and Y) would benefit from taking a Point object rather than two variables, X and Y.
A method retrieveOrderByIdAndName would best be designed by taking the single id and name parameter rather than some container object.
Now, if there was some method to retrieve orders by many different criterion, it would make more send to create a retrieveOrderByCriteria and pass it some criteria structure.
If you are passing the same set of parameters around it is useful to pass them in a structure because they belong together semantically.
The performance hit is probably negligible for such a simple structure as 3 points. Use the readable/reusable solution and then profile your code if you think it is slow :)