Which is better, filtering results at db or application? - data-access-layer

A simple question. There are cases when I fetch the data and then process it in my BLL. But I realized that the same processing/filtering can be done in my stored procedure and the filtered results returned to BLL.
Which is better, processing at DB or processing in BLL? And Why?
consider the scenario, I want to check whether a product exists in my db and if it exists add that to the order (Example taken from answer by Nour Sabony below)now i can do this checking at my BLL or I an do this at the stored procedure as well. If I combine things to one procedure, i reduce the whole operation to one db call. Is that better?

At the database level.
Databases should be optimized for querying, so why go through all the effort to return a large dataset, and then do filtering in your application afterwards?

As a general rule: Anything that reasonably belongs to the db's responsibilities and can be done at your db, do it at your db.

well, the fatest answer is at database,
but you may consider something like Linq2Sql,
I mean to write an expression at the presentation layer, and it will be parsed as Sql Statement at Data Access Layer.
of course there are some situation BLL should get some data from DAL ,process it ,and then return it to DAL.
take an example :
PutOrder(Order value) procedure , which should check for the availability of the ordered product.
public void PutOrder(Order _order)
{
foreach (OrderDetail _orderDetail in _order.Details)
{
int count = dalOrder.GetProductCount(_orderDetail.Product.ProductID);
if (count == 0)
throw new Exception (string.Format("Product {0} is not available",_orderDetail.Product.Name));
}
dalOrder.PutOrder(_order);
}
but if you are making a Browse view, it is not a good idea (from a performance viewpoint) to bring all the data from Dal and then choose what to display in the Browse view.
may be the following could help:
public List<Product> SearchProduts(Criteria _criteria)
{
string sql = Parser.Parse(_criteria);
///code to pass the sql statement to Database procedure and get the corresponding data.
}

at the database.

I have consistently been taught that data should be handled in the data layer whenever possible. Filtering data is what a DB is specifically there to do and is optimized to do. Therefore that is where it should occur.
This also prevents the duplication of effort. If the data filtering occurs in a data layer then other code/systems could benefit from the same stored procedure instead of duplicating the work in your BLL.
I have found that people like to put the primary logic in the section that they are most comfortable working in. Regardless of a developer's strengths, data processing should go in the data layer whenever possible.

To throw a differing view point out there it also really depends upon your data abstraction model if you have a large 2nd level cache that sits on top of your database and the majority (or entirety) of your data you will be filtering on is in the cache then yes stay inside the application and use the cached data.

The difference is not very important when you consider tables that will never contain more than a few dozen rows.
Filtering at the db becomes the obvious choice when you consider tables that will grow to thousands, hundreds of thousands or millions of rows. It would make zero sense to transfer that much data to filter for a subset of records.

I asked a similar question in a SSRS class earlier this year when I wanted to know "best practice" for having a stored procedure retrieve queried data instead of executing TSQL in the report's data set. The response I received was: Allow the database engine to do the "heavy lifting." That is, do all the selecting, grouping, sorting in the stored procedure and allow SSRS to concentrate on delivering the results.
The same is true for your scenario. By all means, my vote would be to do all the filtering in your stored procedure. Call the sproc from your application and let the database do the work.

If you filter at the database, then only the data that is required by the application is actually sent on the wire between the database and the application server. For this reason, I suggest that it is better to filter at the database.

Related

How to handle a big result set without load it into memory

using nhibernate and getNamedQuery, how can I handle a huge result set, without load all rows to a list?
Let's suppose I need to do some logic with each row, but I don't need it after use it.
I don't want to use pagination, because I can't change the source (a stored procedure).
When you say you don't want to use pagination, do you consider using setMaxResults() and setFirstResult() pagination?
Query query = session.getNamedQuery("storedProc");
query.setFirstResult(fromIndex);
query.setMaxResults(pageSize);
I'm not really sure you have any other option with Hibernate. There are all sorts of ways you can incorporate partitioning to split the work, but that's only after you've loaded the data that you want to partition.
Assuming you're doing some sort of logic against a row and storing in the DB, and seeing as you're already using stored procs, your best bet may be another stored proc. Alternatively, if you want to keep your logic outside of your DB, then you should be working off of tables rather than data from a stored proc.

What is the best practice of sorting data with paging, on Business Tier or Database Tier?

It might be a frequently asked question, however so far i couldn't find a convincing answer.
In my project, I need to do paging for a set of around 20,000+ records which is a joined result from multiple tables, and it need to be sorted differently in different scenarios.
Currently, there are 2 options in front of me:
1, do it by using store procedure on database tier ie. where dl.[row_number] between #index*#size+1 and #index*#size+#size.
The problem of doing it is, you will have to write Store Procs for each sorting seperately.
2, do it on Business Logic tier, and do paging and sorting above the result. (ie. skip(), take())
But it is not ideal neither, since you may end up retrieving 20,000 records, but only 10 of them is used
Is there any standard best practice available for this? thanks in advance
keep this logic at the database layer.
if using the stored proc, then extend it to also include the sort column(s) so you can return the right set
If this is a multiuser application, there will be scalability issues unless you fetch the result for a single display page and commit the transaction.
For example, a paged display on a web site should perform a new query each time the user navigates to the next page. The alternative is to keep the full 20K result in the web session context, which will not scale well.
There is a slight annoyance with the SQL language for sorting, as the column name (or column index) for sorting cannot be parameterized.

Avoiding duplicating SQL code?

I know many ways to avoid duplicating PHP code (in my case PHP). However, I am developing a rather big application that does some calculations on the database with the data it finds, and I have noticed the need to use the same code (parts of SQL) in other places.
I don't like the idea of copying and pasting the same thing over and over again. What is a good way to do this? Should I use stored procedures? I could almost calculate some of the stuff in PHP except that most of the times the queries are calculating values based on also data not returned by the query and it seems stupid to return extra data to PHP so that it could its calculations. Sometimes that may be okay, but now it does not feel so.
What should I do?
For example, all over in many SQL queries I am calculating similar to this:
...
(SELECT SUM(amount) FROM IT INNER JOIN Invoice I WHERE IT.invoiceId=I.id) AS total
...
FROM InvoiceTransaction IT
...
Note that I'm at home now so I'm writing this off the top of my head.
I think you have 2 solutions:
if the SQL returns a small amount of data, I would simply wrap the SQL invocation in a method call and call it (parameterising as necessary)
if the SQL handles a lot of data, I would keep that data in the database and use a stored procedure. You can then call that stored procedure without duplicating the code (but wrap the stored proc call in a function and call it - i.e. as in option 1)
I wouldn't necessarily shy away from stored procedures. But I would advise keeping business logic out of them (keep it in the application itself) and make sure you have sufficient unit testing around it.
I do not prefer store procedure, especially not for the sake of refactoring. You should consider writing a function that return the record you need, and put your SQL queries in that function so you can call it instead of putting your SQL everywhere.
I think we would need an example of a query. Stored procs might be a good option. Or an alternative might be to use views. One advantage in having your queries in views or stored procs is that you can often use the database to see where your tables are used. Disadvantage is that you are locking yourself into one database, however you are probably doing this anyway.

Date ranges in views - is this normal?

I recently started working at a company with an enormous "enterprisey" application. At my last job, I designed the database, but here we have a whole Database Architecture department that I'm not part of.
One of the stranger things in their database is that they have a bunch of views which, instead of having the user provide the date ranges they want to see, join with a (global temporary) table "TMP_PARM_RANG" with a start and end date. Every time the main app starts processing a request, the first thing it does it "DELETE FROM TMP_PARM_RANG;" then an insert into it.
This seems like a bizarre way of doing things, and not very safe, but everybody else here seems ok with it. Is this normal, or is my uneasiness valid?
Update I should mention that they use transactions and per-client locks, so it is guarded against most concurrency problems. Also, there are literally dozens if not hundreds of views that all depend on TMP_PARM_RANG.
Do I understand this correctly?
There is a view like this:
SELECT * FROM some_table, tmp_parm_rang
WHERE some_table.date_column BETWEEN tmp_parm_rang.start_date AND tmp_parm_rang.end_date;
Then in some frontend a user inputs a date range, and the application does the following:
Deletes all existing rows from
TMP_PARM_RANG
Inserts a new row into
TMP_PARM_RANG with the user's values
Selects all rows from the view
I wonder if the changes to TMP_PARM_RANG are committed or rolled back, and if so when? Is it a temporary table or a normal table? Basically, depending on the answers to these questions, the process may not be safe for multiple users to execute in parallel. One hopes that if this were the case they would have already discovered that and addressed it, but who knows?
Even if it is done in a thread-safe way, making changes to the database for simple query operations doesn't make a lot of sense. These DELETEs and INSERTs are generating redo/undo (or whatever the equivalent is in a non-Oracle database) which is completely unnecessary.
A simple and more normal way of accomplishing the same goal would be to execute this query, binding the user's inputs to the query parameters:
SELECT * FROM some_table WHERE some_table.date_column BETWEEN ? AND ?;
If the database is oracle, it's possibly a global temporary table; every session sees its own version of the table and inserts/deletes won't affect other users.
There must be some business reason for this table. I've seen views with dates hardcoded that were actually a partioned view and they were using dates as the partioning field. I've also seen joining on a table like when dealing with daylights saving times imagine a view that returned all activity which occured during DST. And none of these things would ever delete and insert into the table...that's just odd
So either there is a deeper reason for this that needs to be dug out, or it's just something that at the time seemed like a good idea but why it was done that way has been lost as tribal knowledge.
Personally, I'm guessing that it would be a pretty strange occurance. And from what you are saying two methods calling the process at the same time could be very interesting.
Typically date ranges are done as filters on a view, and not driven by outside values stored in other tables.
The only justification I could see for this is if there was a multi-step process, that was only executed once at a time and the dates are needed for multiple operations, across multiple stored procedures.
I suppose it would let them support multiple ranges. For example, they can return all dates between 1/1/2008 and 1/1/2009 AND 1/1/2006 and 1/1/2007 to compare 2006 data to 2008 data. You couldn't do that with a single pair of bound parameters. Also, I don't know how Oracle does it's query plan caching for views, but perhaps it has something to do with that? With the date columns being checked as part of the view the server could cache a plan that always assumes the dates will be checked.
Just throwing out some guesses here :)
Also, you wrote:
I should mention that they use
transactions and per-client locks, so
it is guarded against most concurrency
problems.
While that may guard against data consistency problems due to concurrency, it hurts when it comes to performance problems due to concurrency.
Do they also add one -in the application- to generate the next unique value for the primary key?
It seems that the concept of shared state eludes these folks, or the reason for the shared state eludes us.
That sounds like a pretty weird algorithm to me. I wonder how it handles concurrency - is it wrapped in a transaction?
Sounds to me like someone just wasn't sure how to write their WHERE clause.
The views are probably used as temp tables. In SQL Server we can use a table variable or a temp table (# / ##) for this purpose. Although creating views are not recommended by experts, I have created lots of them for my SSRS projects because the tables I am working on do not reference one another (NO FK's, seriously!). I have to workaround deficiencies in the database design; that's why I am using views a lot.
With the global temporary table GTT approach that you comment is being used here, the method is certainly safe with regard to a multiuser system, so no problem there. If this is Oracle then I'd want to check that the system either is using an appropriate level of dynamic sampling so that the GTT is joined appropriately, or that a call to DBMS_STATS is made to supply statistics on the GTT.

LINQ to SQL updates

Does any one have some idea on how to run the following statement with LINQ?
UPDATE FileEntity SET DateDeleted = GETDATE() WHERE ID IN (1,2,3)
I've come to both love and hate LINQ, but so far there's been little that hasn't worked out well. The obvious solution which I want to avoid is to enumerate all file entities and set them manually.
foreach (var file in db.FileEntities.Where(x => ids.Contains(x.ID)))
{
file.DateDeleted = DateTime.Now;
}
db.SubmitChanges();
There problem with the above code, except for the sizable overhead is that each entity has a Data field which can be rather large, so for a large update a lot of data runs cross the database connection for no particular reason. (The LINQ solution is to delay load the Data property, but this wouldn't be necessary if there was some way to just update the field with LINQ to SQL).
I'm thinking some query expression provider thing that would result in the above T-SQL...
LINQ cannot perform in store updates - it is language integrated query, not update. Most (maybe even all) OR mappers will generate a select statement to fetch the data, modify it in memory, and perform the update using a separate update statement. Smart OR mappers will only fetch the primary key until additional data is required, but then they will usually fetch the whole rest because it would be much to expensive to fetch only a single attribute at once.
If you really care about this optimization, use a stored procedure or hand-written SQL statement. If you want compacter code, you can use the following.
db.FileEntities.
Where(x => ids.Contains(x.ID)).
Select(x => x.DateDeleted = DateTime.Now; return x; );
db.SubmitChanges();
I don't like this because I find it less readable, but some prefer such an solution.
LINQ to SQL is an ORM, like any other, and as such, it was not designed to handle bulk updates/inserts/deleted. The general idea with L2S, EF, NHibernate, LLBLGen and the rest is to handle the mapping of relational data to your object graphs for you, eliminating the need to manage a large library of stored procs which, ultimately, limit your flexability and adaptability.
When it comes to bulk updates, those are best left to the thing that does them best...the database server. L2S and EF both provide the ability to map stored procedures to your model, which allows your stored procs to be somewhat entity oriented. Since your using L2S, just write a proc that takes the set of identities as input, and executes the SQL statement at the beginning of your question. Drag that stored proc onto your L2S model, and call it.
Its the best solution for the problem at hand, which is a bulk update. Like with reporting, object graphs and object-relational mapping are not the best solution for bulk processes.