is there a need to refactor a large data access layer

is there a need to refactor a large data access layer - sql

i have a data access layer that abstracts the rest of the application away from the persistence technology. Right now the implementation is SQL server but that might change. Anyway, i find this main data access class getting larger and large as my tables grow (about 40 tables now). The interface of this data access layer is any question you might want to get data on
public interface IOrderRepository
{
Customer[] GetCustomerForOrder(int orderID);
Order[] GetCustomerOrders(int customerID);
Product[] GetProductList(int orderID);
Order[] GetallCustomersOrders(int customerID);
etc . . .
}
the implementation behind this is basic SQL stored procs running the appropriate queries and returning the results in typed collections
this will keep growing and growing. Its pretty maintainable as there isn't a real break of single responsibility but the class is now over 2000 lines of code.
so the question is, due to sure class size (and no real conceptual coupling), should this get broken down and if so on what dimension or level of abstraction.

Absolutely refactor. 2000 lines is huge.
I'd start by breaking it down by return type. Thus you would get one class for accessing Products, one for Orders, one for Customers and so on.
For each of the class, the set of columns selected should probably the same, so that could get refactored into a single variable/method as the extracting of the SQL values into objects.
Also the actual call to the Stored Procedure, including logging and exception handling could and should go into a separate class.
BTW you do have a violation of single responsibility. According to your description your class right now has the following responsibilities:
create sql statements for querying a table (about 40 times)
hydrating the results of calls to stored procedures
calling stored procedures
And I am assuming
- logging
- exception handling

I think it should be factored just because of the size. There are always lots of dimension on which you can break it down. Since the breakdown is simply to make the code more manageable, don't choose too complex a dimension - keep it simple so that it is easy to guess in which class/interface a given function will be found.

This is a hard problem to crack .... firstly break it into multiple files and classes, and secondly split the business objects from the technology object; you can write your business objects in terms of a database interface (which you write yourself). and then in the future if you change DB all you need is to replace the technology object.
Sadly You can't really escape from data-schema growth, you will get more stored-procedures, more tables and more business objects. However, try your level headed best to alter rather than add new tables.
I suggest trying to form a workflow of coupling items them together as resources. By this I mean not making physical dependencies but documentation that will let you relate all the three types of items in you data layer -- e.g.., you could start putting annotations in the comments of your business objects to specify which stored-procedures and tables it depends on. You could do this for the stored-procedures even in the tables in SQL Server (the schema has a description field for tables). These tips should help you keep sight of the big-picture.

Consider a generic DAO if your language accomodates them. You might also think about query by example to cut down on the number of calls required.

Related

Design / Architecture for many instances OOP (or another) implementation

We want to write an API (Python Library) which provides information about few systems in our company. We really aren't sure what is the best OOP approach to implement what we want, so I hope you'll have an idea.
The API will expose a series of tests for each system. Each system will be presented as a Class (with properties and methods) and all systems will inherit from a base class (GenericSystem) which will contain basic, generic info regarding the system (I.E dateOfCreation, authors, systemType, name, technology, owner, etc.) Each system has many instances and each instance has a unique ID. Data about each system instance is stored in different databases, so the API will be a place where all users can find info regarding those systems at once. These are the requirements:
We want each user to be able to create an instance of a system (SystemName Class for example) and to be able to get some info about it.
We want each user to be able to create multiple instances of a system (or of GenericSystem) and to be able to get info about all of them at once. (It must be efficient. One query only, not one for each instance). So we thought that we may need to create MultipleSystemNames class which will implement all those plural-approach methods. This is the most challenging requirement, as it seems.
We want that data will be populated and cached to the instances properties and methods. So if I create a SystemName instance and calls systemNameInstance.propertyName, it will run needed queries and populate the data into propertyName. Next time the user will call this property, the data will be immediately returned.
Last one, a single system class approach must be preserved. Each system must be presented as a sole system. We can later create MultiSystem class if needed (For requirement 2) but at it's most basic form, each system must be represented singly (I hope you understand what I mean).
The second and the fourth (2,4) requirements are the ones that we really struggle to figure out.
Should we use MultiSystemNames class for each class and also for GenericSystem (MultiGenericSystems)? We don't want to complicate the user and ourselves.
Do you know any OOP (or another) best practice clean and simplified way? Have we missed something?
I'm sorry if I added some unnecessary information but I really wanted to give you a feel about how we want things to be.
If you've reach so far or not, thank you!

System and instance represents exactly the same think but are used in different contexts. It doesn't matter how you store or retrieve them. So if you need a collection of System you just use native collection data structure (e.g List, Queue, Map in java). The operations related to System/List must be decoupled from POJOs. That means you implement them in services, repositories,etc.
How you store and retrieve the data must not have impact on how you design your data structures. You achieve performance by applying different techniques and/or using proper technologies e.g caching, using key-value stores or nosql databases, denormalize relational database tables and/or using indexes,etc

Class with a list of materials: best practice

I've created the custom class ZMaterial that can be instantiated passing an ID to the constructor which sets the properties for a single material using SELECTs and BAPIs. This class is basically used to READ and UPDATE a single material.
Now I need to create a service to return a list of materials. I already have the procedural code for it in a static method (for now actually a function module), but I would like to keep using a full OOP approach and instantiate a list of my custom material object. The first approach I found is to enhance the static method to instantiate a list of my single material object after the selects are executed and I have the data in internal tables, but it does not seem the most OOP.
The second option in my mind is to create a new class ZMaterialList with one property being a list of objects ZMaterial and then a constructor with the necessary input parameters for the database select. The problem I see with this option is that I create a full class just for the constructor.
What do you think is the best way to proceed?

Create a separate class to produce the list of materials. The single responsibility principle says each class should do exactly one thing. In all but the most simple cases, using a thing is a different responsibility than producing it.
Don’t make a ZMaterialList class. A list’s focus would be managing the list items, i.e. adding, removing, iterating, sorting etc. But you should be fine with a regular STANDARD TABLE OF REF TO ZMaterial.
Make a ZMaterialReader, -Repository, -Query or -Factory class or the like, depending on the precise way you want to produce the ZMaterials. Readers read by keys, repositories read and write, queries use varying sets of selection criteria, factories instantiate with possibly different sets of inputs.
You can well let that class use the original FUNCTION underneath. It’s good style to exploit what’s already there. Just make sure you trust that code, put it in a test harness, and keep it afar from the rest of your oo code.
Extract all public interaction of ZMaterial to an interface and use only that interface. That allows you to offer alternative implementations of ZMaterial, ones that differ in the way they are produced or how they store their data.
Split single production from mass production. Reading MARA to retrieve a single material is okay. But you don’t want thousands of ZMaterials reading MARA individually - that wrecks performance.
Now you’ve got the interface, you could offer a second implementation of ZMaterial whose constructor receives all relevant data and relies on it already having been validated to avoid additional SELECTs.
You could also offer an implementation that doesn’t store its data at all but only stores pointers to rows in internal tables somewhere else. See the flyweight pattern for ideas.
If you expect mass updates on the materials, such as “reclassify all of these as B”, consider extracting these list-oriented operations to separate classes as well.

Separating code - modularising

When developing how useful is it to create small classes to represent little data structures? For example say as a simplified example, a program is using an array of strings to represent names of something, e.g. cars. Instead of just keeping this array inside a method or class, how useful is it to separate this and make it it own class? This way I am thinking that it can be responsible for itself and more actions can be performed on it - validation, etc. which can all be kept separate. Also, it can be reused easily throughout the system. But then where does it stop, i.e. in the car example, you could then go on to create a car object etc. It really can be never ending can't it?

There are several guidelines I use to determine when I need to refactor a data structure into its own class:
Am I storing a lot of interrelated data? If you find yourself storing a couple of arrays, and manipulating them as a unit, it's probably best to store a single array containing objects.
Are these data structures exposed to other classes? If other classes are directly exposed to the data, it's probably best to encapsulate the data in its own class, which makes it easy to keep the conceptual and actual models separate.
Do I find myself frequently performing operations on the data? It might be fine to store an array of names, but if you start adding methods like validateName and checkName to the wrapping class, it might be a good idea to refactor and place those methods on a Name class itself.
Keep in mind: it's often a lot easier and cleaner to put a decent object model in place up front than to try and graft one on after the fact. You shouldn't do it arbitrarily, but as you're working through your program you should pay attention to when it becomes difficult to control the data structures you have--that's a good sign that you should refactor them, as needed.

It makes sense to do this as soon as you are repeating code to operate on the data structure.
Chris B. makes a great point about interrelated data. See the Extract Class refactoring example.

Parallelizing L2S Entity Retrieval

Assuming a typical domain entity approach with SQL Server and a dbml/L2S DAL with a logic layer on top of that:
In situations where lazy loading is not an option, I have settled on a convention where getting a list of entities does not also get each item's child entities (no loading), but getting a single entity does (eager loading).
Since getting a single entity also gets children, it causes a cascading effect in which each child then gets its children too. This sounds bad, but as long as the model is not too deep, I usually don't see performance problems that outweigh the benefits of the ease of use.
So if I want to get a list in which each of the items is fully hydrated with children, I combine the GetList and GetItem methods. So I'll get a list and then loop through it getting each item with the full cascade. Even this is generally acceptable in many of the projects I've worked on - but I have recently encountered situations with larger models and/or more data in which it needs to be more efficient.
I've found that partitioning the loop and executing it on multiple threads yields excellent results. In my first experiment with a list of 50 items from one particular project, I did 5 threads of 10 items each and got a 3X improvement in time.
Of course, the mileage will vary depending on the project but all else being equal this is clearly a big opportunity. However, before I go further, I was wondering what others have done that have already been through this. What are some good approaches to parallelizing this type of thing?

Usually it is faster to make a single database call that returns a set of records.
This recordset can "hydrate" the top-level objects, then another recordset can load child objects. I'm not sure how your situation does not allow lazy-loading, but this method is essentially lazy-loading, and will surely be faster than making multiple calls to the database that returns a single record each time.
You could make asynchronous calls to the database so that multiple queries are running in parallel. If you combine this with the first strategy for each "layer" of the model, and write a somewhat more complex hydration function based on multiple-record return sets, you should see that the database handles concurrent connections very well (which is why you see a performance gain from using multiple threads).
But you don't need to explicitly create threads - check out the asynchronous methods of a SqlCommand.

Where should I store virtual/calculated/complex object fields in my models?

I have models corresponding to database tables. For example, the House class has "color", "price", "square_feet", "real_estate_agent_id" columns.
It is very common for me to want to display the agent name when I display information about a house. As a result, my House class has the following fields:
class House {
String color;
Double price;
Integer squareFeet;
Integer realEstateAgentId;
String realEstateAgentName;
}
I've been referring to realEstateAgentName as a virtual field, as it is pulled from a foreign table (join on real_estate_agent_id).
This doesn't feel right to me, as it mixes actual database columns with foreign object's properties. But it's quick, and in many cases it really works out well.
Other times I find myself doing something like this:
class House {
String color;
Double price;
Integer squareFeet;
Integer realEstateAgentId;
RealEstateAgent realEstateAgent;
}
As you can see, I'm storing the actual object corresponding to the ID that is stored in the House table.
I tend to make the decision to store the entire object vs some key information associated with the ID (e.g. Name) depending on the likelihood I see of needing to access other information about the object it represents.
I have a few questions:
Of the two methods I've been mixing and matching, which is best? I'm leaning towards storing the id + the object, rather than pulling out just the properties from the foreign object that I think I may need. Of the two, this seems more "correct." But it's not perfect, because in many cases I don't have any need to hydrate the entire foreign object, and doing so would cause undue waste of resources or would not be feasible because of the amount of data or the number of joins that would be required when I don't have any use for all the info being brought in. Given that this is the case, it seems like a poor design choice because I will have lots of null fields that aren't really null in my database, but are so in memory simply because there was no need to populate them -- now I have to keep track of which ones I populated.
But is it best practice to store an ID alongside the object it represents? Should I even be storing the object as a property, or should it live externally in some map, with the ID being the key?
In an Object world it seems like the ID shouldn't even be stored as a property, with the foreign Object it represents being the logical replacement. But with everything being tightly coupled with a relational database it doesn't seem very feasible.
Is this frustrating impurity of my models/classes something I just have to live with, or are there patterns out there that address this by having some kind of fork or parent/child subclassing going on where one is a "pure" object while the other is flat like the database?
EDIT: I am looking for design suggestions here rather than specific ORM frameworks like Hibernate/nHibernate/etc. The particular language I'm working in does not have an ORM solution for my language version that I am satisfied with, and the examples were Java-esque but that's not what my source code is written in.

I can tell about Hibernate, because this is the ORM tool I am most familiar with. I believe that other ORM tools also support similar behaviour to some extent.
Hibernate solves your problem with lazy loading. You add your agent as a property to the house, and by default, when the house object is loaded, the agent is represented by a proxy object generated by Hibernate, which contains only the ID. If you query some other property of the agent, Hibernate loads the full object in the background:
class House {
String color;
Double price;
Integer squareFeet;
RealEstateAgent realEstateAgent;
// getters, setters,...
}
House house = (House) session.load(House.class, new Long(123));
// at this point, house refers to a proxy object created by Hibernate
// in the background - no house or agent data has been loaded from DB
house.getId();
// house still refers to the proxy object
RealEstateAgent agent = house.getRealEstateAgent();
// house is now loaded, but agent not - it refers to a proxy object
String name = agent.getName(); // Now the agent data is loaded from DB
OTOH if you are sure that for a specific class you (almost) always need a specific property, you can specify eager loading in the ORM mapping for that property, in which case the property is loaded as soon as the containing object. In the mapping you can also specify whether you want a join query or a subselect query.

LINQ to SQL uses ID + Object and it works out well. I prefer that model as it's most flexible. Hibernate can do the same. One issue you will face is deep loading: when do you actually load the object and not just the ID? Both LINQ to SQL and Hibernate have lazy loading and give you control over this issue.
The Entity Framework however looks to give you this complete control where you can decide just how the data appears regardless the physical underpinnings. It has not been fully realized yet however.
There's really no impurity going on here. The problem is you're trying to represent an abstraction of data that is relationship in an object oriented fashion. To get around the pains of developing like this, larger scale projects are moving to Domain Driven Design where the underlying data is abstracted out into logical groupings of Repositories. Thinking in tables as classes can be problematic for large scale solutions.
Just my 2 cents.

Hibernate, the most popular ORM tool in the Java ecosystem, usually allows you to do this:
class House {
String color;
Double price;
Integer squareFeet;
RealEstateAgent realEstateAgent;
}
This translates to a DB-table that looks like this: house(id, color, price, squareFeet, real_estate_agent_id)
If you need to print the name of the agent you just walk traverse the object graph:
house.getRealEstatAgent().getName()
Through lazy loading, this is done quite efficiently. I wouldn't worry about the fact that an extra query trip to the database may have to be done until your stress tests prove this to be a problem.
Edit after your edit:
All the solutions out there have dealt with the paradigm mismatch (between the OO and Relational worlds) in a similar fashion. The designs have been made, the problem is solved. And yes, it remains a pain in the butt to deal with as an application developer but I suppose it is just the way it is as long as we want to use relational databases and object oriented persistence together.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas