What is the best practice of sorting data with paging, on Business Tier or Database Tier? - sql

It might be a frequently asked question, however so far i couldn't find a convincing answer.
In my project, I need to do paging for a set of around 20,000+ records which is a joined result from multiple tables, and it need to be sorted differently in different scenarios.
Currently, there are 2 options in front of me:
1, do it by using store procedure on database tier ie. where dl.[row_number] between #index*#size+1 and #index*#size+#size.
The problem of doing it is, you will have to write Store Procs for each sorting seperately.
2, do it on Business Logic tier, and do paging and sorting above the result. (ie. skip(), take())
But it is not ideal neither, since you may end up retrieving 20,000 records, but only 10 of them is used
Is there any standard best practice available for this? thanks in advance

keep this logic at the database layer.
if using the stored proc, then extend it to also include the sort column(s) so you can return the right set

If this is a multiuser application, there will be scalability issues unless you fetch the result for a single display page and commit the transaction.
For example, a paged display on a web site should perform a new query each time the user navigates to the next page. The alternative is to keep the full 20K result in the web session context, which will not scale well.
There is a slight annoyance with the SQL language for sorting, as the column name (or column index) for sorting cannot be parameterized.

Related

Storing Search results for Future Use

I have the following scenario where the search returns a list of userid values (1,2,3,4,5,6... etc.) If the search were to be run again, the results are guaranteed to change given some time. However I need to stored the instance of the search results to be used in the future.
We have a current implementation (legacy), which creates a record for the search_id with the criteria and inserts every row returned into a different table with the associated search_id.
table search_results
search_id unsigned int FK, PK (clustered index)
user_id unsigned int FK
This is an unacceptable approach as this table has grown onto millions of records. I've considered partitioning the table, but either I will have numerous partitions (1000s).
I've optimized the existing tables that search results expired unless they're used elsewhere, so all the search results are referenced elsewhere.
In the current schema, I cannot store the results as serialized arrays or XML. I am looking to efficiently store the search result information, such that it can be efficiently accessed later without being burdened by the number of records.
EDIT: Thank you for the answers, I don't have any problems running the searches themselves, but the result set for the search gets used in this case for recipient lists, which will be used over and over again, the purpose of storing is exactly to have a snapshot of the data at the given time.
The answer is don't store query results. It's a terrible idea!
It introduces statefulness, which is very bad unless you really (really really) need it
It isn't scalable (as you're finding out)
The data is stale as soon as it's stored
The correct approach is to fix your query/database so it runs acceptable quickly.
If you can't make the queries faster using better SQL and/or indexes etc, I recommend using lucene (or any text-based search engine) and denormalizing your database into it. Lucene queries are incredibly fast.
I recently did exactly this on a large web site that was doing what you're doing: It was caching query results from the production relational database in the session object in an attempt top speed up queries, but it was a mess, and wasn't much faster anyway - before my time, a "senior" java developer (whose name started with Jam.. and ended with .illiams) who was actually a moron decided it was a good idea.
I put in Solr (a java-tailored lucene implementation) and kept Solr up to date with the relational database (using work queues) and the web queries are now just a few milliseconds.
Is there a reason why you need to store every search? Surely you would want the most up to date information available for the user ?
I'll admit first, this isn't a great solution.
Setup another database alongside your current one [SYS_Searches]
The save script could use SELECT INTO [SYS_Searches].Results_{Search_ID}
The script that retrieves can do a simple SELECT out of the matching table.
Benefits:
Every search is neatly packed into it's own table, [preferably in another DB]
The retrieval query is very simple
The retrieval time should be very quick, no massive table scans.
Drawbacks:
You will have a table for every x user * y searches a user can store.
This could get very silly very quickly unless there is management involved to expire results or the user can only have 1 cached search result set.
Not pretty, but I can't think of another way.

Dynamically creating tables as a means of partitioning: OK or bad practice?

Is it reasonable for an application to create database tables dynamically as a means of partitioning?
For example, say I have a large table "widgets" with a "userID" column identifying the owner of each row. If this table tended to grow extremely large, would it make sense to instead have the application create a new table called "widgets_{username}" for each new user? Assume that the application will only ever have to query for widgets belonging to a single user at a time (i.e. no need to try and join any of these user widget tables together).
Doing this would break up the one large table into more easily-managed chunks, but this doesn't seem like an elegant solution. In my mind, the database schema should be defined when the application is written, and any runtime data is stored as rows, not as additional tables.
As a more general question, is modifying the database schema at runtime ever ok?
Edit: This question is mostly hypothetical; I had a pretty good feeling that creating tables at runtime didn't make sense. That being said, we do have a table with millions of rows in our application. SELECTs perform fine, but things like deleting all rows owned by a particular user can take a while. Basically I'm looking for some solid reasoning why just dynamically creating a table for each user doesn't make sense for when I'm asked.
NO, NO, NO!! Now repeat after me, I will not do this because it will create many headaches and problems in the future! Databases are made to handle large amounts of information. they use indexes to quickly find what you are after. think phone book how effective is the index? would it be better to have a different book for each last name?
This will not give you anything performance wise. Keep a single table, but be sure to index on UserID and you'll be able to get the data fast. however if you split the table up, it becomes impossible/really really hard to get any info that spans multiple users, like search all users for a certain widget, count of all widgets of a certain type, etc. you need to have every query be built dynamically.
If deleting rows is slow, look into that. How many rows at one time are we talking about 10, 1000, 100000? What is your clustered index on this table? Could you use a "soft delete", where you have a status column that you UPDATE to "D" to mark the row as deleted. Can you delete the rows at a later time, with less database activity. is the delete slow because it is being blocked by other activity. look into those before you break up the table.
No, that would be a bad idea. However some DBMSs (e.g. Oracle) allow a single table to be partitioned on values of a column, which would achieve the objective without creating new tables at run time. Having said that, it is not "the norm" to partition tables like this: it is only usually done in very large databases.
Using an index on userID should result nearly in the same performance.
In my opinion, changing the database schema at runtime is bad practice.
Consider, for example, security issues...
Is it reasonable for an application to create database tables
dynamically as a means of partitioning?
No. (smile)

Which is better, filtering results at db or application?

A simple question. There are cases when I fetch the data and then process it in my BLL. But I realized that the same processing/filtering can be done in my stored procedure and the filtered results returned to BLL.
Which is better, processing at DB or processing in BLL? And Why?
consider the scenario, I want to check whether a product exists in my db and if it exists add that to the order (Example taken from answer by Nour Sabony below)now i can do this checking at my BLL or I an do this at the stored procedure as well. If I combine things to one procedure, i reduce the whole operation to one db call. Is that better?
At the database level.
Databases should be optimized for querying, so why go through all the effort to return a large dataset, and then do filtering in your application afterwards?
As a general rule: Anything that reasonably belongs to the db's responsibilities and can be done at your db, do it at your db.
well, the fatest answer is at database,
but you may consider something like Linq2Sql,
I mean to write an expression at the presentation layer, and it will be parsed as Sql Statement at Data Access Layer.
of course there are some situation BLL should get some data from DAL ,process it ,and then return it to DAL.
take an example :
PutOrder(Order value) procedure , which should check for the availability of the ordered product.
public void PutOrder(Order _order)
{
foreach (OrderDetail _orderDetail in _order.Details)
{
int count = dalOrder.GetProductCount(_orderDetail.Product.ProductID);
if (count == 0)
throw new Exception (string.Format("Product {0} is not available",_orderDetail.Product.Name));
}
dalOrder.PutOrder(_order);
}
but if you are making a Browse view, it is not a good idea (from a performance viewpoint) to bring all the data from Dal and then choose what to display in the Browse view.
may be the following could help:
public List<Product> SearchProduts(Criteria _criteria)
{
string sql = Parser.Parse(_criteria);
///code to pass the sql statement to Database procedure and get the corresponding data.
}
at the database.
I have consistently been taught that data should be handled in the data layer whenever possible. Filtering data is what a DB is specifically there to do and is optimized to do. Therefore that is where it should occur.
This also prevents the duplication of effort. If the data filtering occurs in a data layer then other code/systems could benefit from the same stored procedure instead of duplicating the work in your BLL.
I have found that people like to put the primary logic in the section that they are most comfortable working in. Regardless of a developer's strengths, data processing should go in the data layer whenever possible.
To throw a differing view point out there it also really depends upon your data abstraction model if you have a large 2nd level cache that sits on top of your database and the majority (or entirety) of your data you will be filtering on is in the cache then yes stay inside the application and use the cached data.
The difference is not very important when you consider tables that will never contain more than a few dozen rows.
Filtering at the db becomes the obvious choice when you consider tables that will grow to thousands, hundreds of thousands or millions of rows. It would make zero sense to transfer that much data to filter for a subset of records.
I asked a similar question in a SSRS class earlier this year when I wanted to know "best practice" for having a stored procedure retrieve queried data instead of executing TSQL in the report's data set. The response I received was: Allow the database engine to do the "heavy lifting." That is, do all the selecting, grouping, sorting in the stored procedure and allow SSRS to concentrate on delivering the results.
The same is true for your scenario. By all means, my vote would be to do all the filtering in your stored procedure. Call the sproc from your application and let the database do the work.
If you filter at the database, then only the data that is required by the application is actually sent on the wire between the database and the application server. For this reason, I suggest that it is better to filter at the database.

When to try and tune the SQL or just summarize data in a table?

I have an EMPLOYEE table in a SQL Server 2008 database which stores information for employees (~80,000+) many times for each year. For instance, there could by 10 different instances of each employees data for different years.
I'm reporting on this data via a web app, and wanted to report mostly with queries directly against the EMPLOYEE table, using functions to get information that needed to be computed or derived for reporting purposes.
These functions sometimes have to refer to an EMPLOYEE_DETAIL table which has 100,000+ rows for each year - so now that I'm starting to write some reporting-type queries, some take around 5-10 seconds to run, which is a bit too slow.
My question is, in a situation like this, should I try and tune functions and such so I
can always query the data directly for reporting (real-time), or is a better approach to summarize the data I need in a static table via a procedure or saved query, and use that for any reporting?
I guess any changes in reporting needs could be reflected in the "summarizing mechanism" I use...but I'm torn on what to do here...
Before refactoring your functions I would suggest you take a look at your indexes. You would be amazed at how much of a difference well constructed indexes can make. Also, index maintenance will probably require less effort than a "summarizing mechanism"
Personally, I'd use the following approach:
If it's possible to tune the function, for example, by adding an index specifically suited to the needs of your query or by using a different clustered index on your tables, then tune it. Life is so much easier if you do not have to deal with redundancy.
If you feel that you have reached the point where optimization is no longer possible (fetching a few thousand fragmented pages from disk will take some time, no matter what you do), it might be better to store some data redundantly rather than completely restructuring the way you store your data. If you take this route, be very careful to avoid inconsistencies.
SQL Server, for example, allows you to use indexed views, which store summary data (i.e. the result of some view) redundantly for quick access, but also automatically take care of updating that data. Of course, there is a performance penalty when modifying the underlying tables, so you'll have to check if that fits your needs.
Ohterwise, if the data does not have to be up-to-date, periodic recalculation of the summaries (at night, when nobody is working) might be the way to go.
Should I try and tune functions and
such so I can always query the data
directly for reporting (real-time), or
is a better approach to summarize the
data I need in a static table via a
procedure or saved query, and use that
for any reporting?
From the description of your data and queries (historic data for up to 10 years, aggregate queries for computed values) this looks like an OLAP business inteligence type data store, whre the is more important to look at historic trends and old read-only data rather than see the current churn and last to the second update that occured. As such the best solution would be to setup an SQL Analysis Services server and query that instead of the relational database.
This is a generic response, without knowing the details of your specifics. Your data size (~80k-800k employee records, ~100k -1 mil detail records) is well within the capabilities of SQL Server relational engine to give sub second responses on aggregates and business inteligence type queries, specially if you add in something like indexed views for some problem aggregates. But what the relational engine (SQL Server) can do will pale in comparison with what the analytical engine (SQL Server Analysis Services) can.
My question is, in a situation like this, should I try and tune functions and such so I
can always query the data directly for reporting (real-time), or is a better approach to summarize the data I need in a static table via a procedure or saved query, and use that for any reporting?
You can summarize the data in chunks of day, month etc, aggregate these chunks in your reports and invalidate them if some data in the past changes (to correct the errors etc.)
What is your client happy with, in terms of real time reporting & performance?
Having said that, it might be worthwhile to tune your query/indexes.
I'd be surprised if you can't improve performance by modifying your indexes.
Check indexes, rework functions, buy more hardware, do anything before you try the static table route.
100,000 rows per year (presumably around 1 million total) is nothing. If those queries are taking 5-10 seconds to run then there is either a problem with your query or a problem with your indexes (or both). I'd put money on your perf issues being the result of one or more table scans or index scans.
When you start to close on the billion-row mark, that's when you often need to start denormalizing, and only in a heavy transactional environment where you can't afford to index more aggressively.
There are, of course, always exceptions, but when you're working with databases it's preferable to look for major optimizations before you start complicating your architecture and schema with partitions and triggers and so on.

Sorting based on calculation with nhibernate - best practice

I need to do paging with the sort order based on a calculation. The calculation is similar to something like reddit's hotness algorithm in that its dependant on time - time since post creation.
I'm wondering what the best practice for this would be. Whether to have this sort as a SQL function, or to run an update once an hour to calculate the whole table.
The table has hundreds of thousands of rows. And I'm using nhibernate, so this could cause problems for the scheduled full calcution.
Any advice?
It most likely will depend a lot on the load on your server. A few assumptions for my answer:
Your calculation is most likely not simple, but will take into account a variety of factors, including time elapsed since post
You are expecting at least reasonable growth in your site, meaning new data will be added to your table.
I would suggest your best bet would be to calculate and store your ranking value, and as Nuno G mentioned retrieve using an ordered clause. As you note there are likely to be some implications, two of which would be:
Scheduling Updates
Ensuring access to the table
As far as scheduling goes you may be able to look at some ways of intelligently recalculating your value. For example, you may be able to identify when a calculation is likely to be altered (for example, if a dependant record is updated you might fire a trigger, adding the ID of your table to a queue for recalculation). You may also do the update in ranges, rather then in the full table.
You will also want to minimise any locking of your table whilst you are recalculating. There are a number of ways to do this, including setting your isolation levels (using MS SQL terminonlogy). If you are really worried you could even perform your calculation externally (eg. in a temp table) and then simply run an update of the values to your main table.
As a final note I would recommend looking into the paging options available to you - if you are talking about thousands of records make sure that your mechanism determines the page you need on the SQL server so that you are not returning the thousands of rows to your application, as this will slow things down for you.
If you can perform the calculation using SQL, try use Hibernate to load the sorted collection by executing a SQLQuery, where your query includes a 'ORDER BY' expression.