To aggregate or not to aggregate, that is the database schema design question

To aggregate or not to aggregate, that is the database schema design question - sql

If you're doing min/max/avg queries, do you prefer to use aggregation tables or simply query across a range of rows in the raw table?
This is obviously a very open-ended question and there's no one right answer, so I'm just looking for people's general suggestions. Assume that the raw data table consists of a timestamp, a numeric foreign key (say a user id), and a decimal value (say a purchase amount). Furthermore, assume that there are millions of rows in the table.
I have done both and am torn. On one hand aggregation tables have given me significantly faster queries but at the cost of a proliferation of additional tables. Displaying the current values for an aggregated range either requires dropping entirely back to the raw data table or combining more fine grained aggregations. I have found that keeping track in the application code of which aggregation table to query when is more work that you'd think and that schema changes will be required, as the original aggregation ranges will invariably not be enough ("But I wanted to see our sales over the last 3 pay periods!").
On the other hand, querying from the raw data can be punishingly slow but lets me be very flexible about the data ranges. When the range bounds change, I simply change a query rather than having to rebuild aggregation tables. Likewise the application code requires fewer updates. I suspect that if I was smarter about my indexing (i.e. always having good covering indexes), I would be able to reduce the penalty of selecting from the raw data but that's by no means a panacea.
Is there anyway I can have the best of both worlds?

We had that same problem and ran into the same issues you ran into. We ended up switching our reporting to Analysis Services. There is a learning curve with MDX and Analysis services itself, but it's been great. Some of the benefits we have found are:
You have a lot of flexibility for
querying any way you want. Before we
had to build specific aggregates,
but now one cube answers all our
questions.
Storage in a cube is far smaller
than the detailed data.
Building and processing the cubes
takes less time and produces less
load on the database servers than
the aggregates did.
Some CONS:
There is a learning curve around
building cubes and learning MDX.
We had to create some tools to
automate working with the cubes.
UPDATE:
Since you're using MySql, you could take a look at Pentaho Mondrian, which is an open source OLAP solution that supports MySql. I've never used it though, so I don't know if it will work for you or not. Would be interested in knowing if it works for you though.

It helps to pick a good primary key (ie [user_id, used_date, used_time]). For a constant user_id it's then very fast to do a range-condition on used_date.
But as the table grows, you can reduce your table-size by aggregating to a table like [user_id, used_date]. For every range where the time-of-day doesn't matter you can then use that table. An other way to reduce the table-size is archiving old data that you don't (allow) querying anymore.

I always lean towards raw data. Once aggregated, you can't go back.
Nothing to do with deletion - unless there's the simplest of aggregated data sets, you can't accurately revert/transpose the data back to raw.
Ideally, I'd use a materialized view (assuming that the data can fit within the constraints) because it is effectively a table. But MySQL doesn't support them, so the next consideration would be a view with the computed columns, or a trigger to update an actual table.

Long history question, for currently, I found this useful, answered by microstrategy engineer
BTW, another already have solutions like (cube.dev/dremio) you don't have to do by yourself.

Related

Data profiling of columns for big table (SQL Server)

I have table with over 40 million records. I need to make data profiling, including Nulls count, Distinct Values, Zeros and Blancs, %Numeric, %Date, Needs to be Trimmed, etc.
The examples that I was able to find are always including implementation of the task using cursors. For big table such solution is performance killer.
I would be happy if I receive suggestions and examples which give better performance alternatives. Is it possible to create multiple stored procedures and combine the results in a table? I have not used stored procedures so far, so I base my question only on understanding that I got from documentation.

As Gordon mentioned, you should include your table's schema and some sample data to get the best answers, but a couple things you can look into are as follows:
Columnstore Indexes - These can be helpful for analytical querying against a table, e.g. SUM(), COUNT(), COUNT(DISTINCT) etc. This is because of the efficiencies in compression that can be achieved up and down the column for analytics. This is useful if you need a "real time" based on answer every time you query against the data.
You can periodically stage and update the results in a data warehouse type table. You basically can store the results to those aggregations in it's own table and periodically update it with either a SQL Agent Job (this isn't necessarily a real time solution) or use triggers to automatically update your data warehouse table (which will be closer to a real time solution but can be performance heavy if not implemented in a lean manner).
OLAP Cubes - This is more of an automated way to the above solution and has better maintainability but is also more advanced of a solution. This is a methodology for building out an actual OLAP based data warehouse.
In terms of difficulty of implementation and based on the size of your data (which isn't anything too huge) my recommendation would be to start with columnstore indexes and see how that helps your queries. I've had much success using them for analytical querying. Otherwise my remaining recommendations are in order of difficulty as well.

Three SQL tables or one?

I have a choice of creating three tables with identical structure but different content or one table with all of the data and one additional column that distinguishes the data. Each table will have about 10,000 rows in it, and it will be used exclusively for looking up data. The key design criteria is speed of lookup, so which is faster: three tables with 10K rows each or one table with 30K rows, or is there no substantive difference? Note: all columns that will be used as query parameters will have indices.

There should be no substantial difference between 10k or 30k rows in any modern RDBMS in terms of lookup time. In any case not enough difference to warrant the de-normalization. Indexed qualifier column is a common approach for such a design.
The only time you may consider de-normalizing if your update pattern affects a limited set of data that you can put in a "short" table (say, today's messages in social network) with few(er) indexes for fast inserts/updates and there is a background process transferring the stabilized updates to a large, fully indexed table. The case were you really win during write operations will be a dramatic one though, with very particular and unfortunate requirements. RDBMS engines are sophisticated enough to handle most of the simple scenarios in very efficient way. 30k or rows does not sound like a candidate.
If still in doubt, it is very easy to write a test to check on your particular database / system setup. I think if you post your findings here with real data, it will be a useful info for everyone in your steps.

Apart from the speed issue, which the other posters have covered and I agree with, you should also take into consideration the business model that your are replicating in your database, as this may affect the maintenance cost of your solution.
If is it possible that the 3 'things' may turn into 4, and you have chosen the separate table path, then you will have to add another table. Whereas if you choose the discriminator path then it is as simple as coming up with a new discriminator.
However, if you choose the discriminator path and then new requirements dictate that one of 'things' has more data to store then you are going to have to add extra columns to your table which have no relevance to the other 'things'.
I cannot say which is the right way to go, as only you know your business model.

When to try and tune the SQL or just summarize data in a table?

I have an EMPLOYEE table in a SQL Server 2008 database which stores information for employees (~80,000+) many times for each year. For instance, there could by 10 different instances of each employees data for different years.
I'm reporting on this data via a web app, and wanted to report mostly with queries directly against the EMPLOYEE table, using functions to get information that needed to be computed or derived for reporting purposes.
These functions sometimes have to refer to an EMPLOYEE_DETAIL table which has 100,000+ rows for each year - so now that I'm starting to write some reporting-type queries, some take around 5-10 seconds to run, which is a bit too slow.
My question is, in a situation like this, should I try and tune functions and such so I
can always query the data directly for reporting (real-time), or is a better approach to summarize the data I need in a static table via a procedure or saved query, and use that for any reporting?
I guess any changes in reporting needs could be reflected in the "summarizing mechanism" I use...but I'm torn on what to do here...

Before refactoring your functions I would suggest you take a look at your indexes. You would be amazed at how much of a difference well constructed indexes can make. Also, index maintenance will probably require less effort than a "summarizing mechanism"

Personally, I'd use the following approach:
If it's possible to tune the function, for example, by adding an index specifically suited to the needs of your query or by using a different clustered index on your tables, then tune it. Life is so much easier if you do not have to deal with redundancy.
If you feel that you have reached the point where optimization is no longer possible (fetching a few thousand fragmented pages from disk will take some time, no matter what you do), it might be better to store some data redundantly rather than completely restructuring the way you store your data. If you take this route, be very careful to avoid inconsistencies.
SQL Server, for example, allows you to use indexed views, which store summary data (i.e. the result of some view) redundantly for quick access, but also automatically take care of updating that data. Of course, there is a performance penalty when modifying the underlying tables, so you'll have to check if that fits your needs.
Ohterwise, if the data does not have to be up-to-date, periodic recalculation of the summaries (at night, when nobody is working) might be the way to go.

Should I try and tune functions and
such so I can always query the data
directly for reporting (real-time), or
is a better approach to summarize the
data I need in a static table via a
procedure or saved query, and use that
for any reporting?
From the description of your data and queries (historic data for up to 10 years, aggregate queries for computed values) this looks like an OLAP business inteligence type data store, whre the is more important to look at historic trends and old read-only data rather than see the current churn and last to the second update that occured. As such the best solution would be to setup an SQL Analysis Services server and query that instead of the relational database.
This is a generic response, without knowing the details of your specifics. Your data size (~80k-800k employee records, ~100k -1 mil detail records) is well within the capabilities of SQL Server relational engine to give sub second responses on aggregates and business inteligence type queries, specially if you add in something like indexed views for some problem aggregates. But what the relational engine (SQL Server) can do will pale in comparison with what the analytical engine (SQL Server Analysis Services) can.

My question is, in a situation like this, should I try and tune functions and such so I
can always query the data directly for reporting (real-time), or is a better approach to summarize the data I need in a static table via a procedure or saved query, and use that for any reporting?
You can summarize the data in chunks of day, month etc, aggregate these chunks in your reports and invalidate them if some data in the past changes (to correct the errors etc.)

What is your client happy with, in terms of real time reporting & performance?
Having said that, it might be worthwhile to tune your query/indexes.
I'd be surprised if you can't improve performance by modifying your indexes.

Check indexes, rework functions, buy more hardware, do anything before you try the static table route.

100,000 rows per year (presumably around 1 million total) is nothing. If those queries are taking 5-10 seconds to run then there is either a problem with your query or a problem with your indexes (or both). I'd put money on your perf issues being the result of one or more table scans or index scans.
When you start to close on the billion-row mark, that's when you often need to start denormalizing, and only in a heavy transactional environment where you can't afford to index more aggressively.
There are, of course, always exceptions, but when you're working with databases it's preferable to look for major optimizations before you start complicating your architecture and schema with partitions and triggers and so on.

What is table partitioning?

In which case we should use table partitioning?

An example may help.
We collected data on a daily basis from a set of 124 grocery stores. Each days data was completely distinct from every other days. We partitioned the data on the date. This allowed us to have faster
searches because oracle can use partitioned indexes and quickly eliminate all of the non-relevant days.
This also allows for much easier backup operations because you can work in just the new partitions.
Also after 5 years of data we needed to get rid of an entire days data. You can "drop" or eliminate an entire partition at a time instead of deleting rows. So getting rid of old data was a snap.
So... They are good for large sets of data and very good for improving performance in some cases.

Partitioning enables tables and indexes or index-organized tables to be subdivided into smaller manageable pieces and these each small piece is called a "partition".
For more info: Partitioning in Oracle. What? Why? When? Who? Where? How?

When you want to break a table down into smaller tables (based on some logical breakdown) to improve performance. So now the user can refer to tables as one table name or to the individual partitions within.

Table partitioning consist in a technique adopted by some database management systems to deal with large databases. Instead of a single table storage location, they split your table in several files for quicker queries.
If you have a table which will store large ammounts of data (I mean REALLY large ammounts, like millions of records) table partitioning will be a good option.

i) It is the subdivision of a single table into multiple segments, called partitions, each of which holds a subset of values.
ii) You should use it when you have read the documentation, run some test cases, fully understood the advantages and disadvantages of it, and found that it will be of benefit.
You are not going to get a complete answer on a forum. Go and read the documentation and come back when you have a manageable problem.

How do you optimize tables for specific queries?

What are the patterns you use to determine the frequent queries?
How do you select the optimization factors?
What are the types of changes one can make?

This is a nice question, if rather broad (and none the worse for that).
If I understand you, then you're asking how to attack the problem of optimisation starting from scratch.
The first question to ask is: "is there a performance problem?"
If there is no problem, then you're done. This is often the case. Nice.
On the other hand...
Determine Frequent Queries
Logging will get you your frequent queries.
If you're using some kind of data access layer, then it might be simple to add code to log all queries.
It is also a good idea to log when the query was executed and how long each query takes. This can give you an idea of where the problems are.
Also, ask the users which bits annoy them. If a slow response doesn't annoy the user, then it doesn't matter.
Select the optimization factors?
(I may be misunderstanding this part of the question)
You're looking for any patterns in the queries / response times.
These will typically be queries over large tables or queries which join many tables in a single query. ... but if you log response times, you can be guided by those.
Types of changes one can make?
You're specifically asking about optimising tables.
Here are some of the things you can look for:
Denormalisation. This brings several tables together into one wider table, so in stead of your query joining several tables together, you can just read one table. This is a very common and powerful technique. NB. I advise keeping the original normalised tables and building the denormalised table in addition - this way, you're not throwing anything away. How you keep it up to date is another question. You might use triggers on the underlying tables, or run a refresh process periodically.
Normalisation. This is not often considered to be an optimisation process, but it is in 2 cases:
updates. Normalisation makes updates much faster because each update is the smallest it can be (you are updating the smallest - in terms of columns and rows - possible table. This is almost the very definition of normalisation.
Querying a denormalised table to get information which exists on a much smaller (fewer rows) table may be causing a problem. In this case, store the normalised table as well as the denormalised one (see above).
Horizontal partitionning. This means making tables smaller by putting some rows in another, identical table. A common use case is to have all of this month's rows in table ThisMonthSales, and all older rows in table OldSales, where both tables have an identical schema. If most queries are for recent data, this strategy can mean that 99% of all queries are only looking at 1% of the data - a huge performance win.
Vertical partitionning. This is Chopping fields off a table and putting them in a new table which is joinned back to the main table by the primary key. This can be useful for very wide tables (e.g. with dozens of fields), and may possibly help if tables are sparsely populated.
Indeces. I'm not sure if your quesion covers these, but there are plenty of other answers on SO concerning the use of indeces. A good way to find a case for an index is: find a slow query. look at the query plan and find a table scan. Index fields on that table so as to remove the table scan. I can write more on this if required - leave a comment.
You might also like my post on this.

That's difficult to answer without knowing which system you're talking about.
In Oracle, for example, the Enterprise Manager lets you see which queries took up the most time, lets you compare different execution profiles, and lets you analyze queries over a block of time so that you don't add an index that's going to help one query at the expense of every other one you run.

Your question is a bit vague. Which DB platform?
If we are talking about SQL Server:
Use the Dynamic Management Views. Use SQL Profiler. Install the SP2 and the performance dashboard reports.
After determining the most costly queries (i.e. number of times run x cost one one query), examine their execution plans, and look at the sizes of the tables involved, and whether they are predominately Read or Write, or a mixture of both.
If the system is under your full control (apps. and DB) you can often re-write queries that are badly formed (quite a common occurrance), such as deep correlated sub-queries which can often be re-written as derived table joins with a little thought. Otherwise, you options are to create covering non-clustered indexes and ensure that statistics are kept up to date.

For MySQL there is a feature called log slow queries
The rest is based on what kind of data you have and how it is setup.

In SQL server you can use trace to find out how your query is performing. Use ctrl + k or l
For example if u see full table scan happening in a table with large number of records then it probably is not a good query.
A more specific question will definitely fetch you better answers.

If your table is predominantly read, place a clustered index on the table.

My experience is with mainly DB2 and a smattering of Oracle in the early days.
If your DBMS is any good, it will have the ability to collect stats on specific queries and explain the plan it used for extracting the data.
For example, if you have a table (x) with two columns (date and diskusage) and only have an index on date, the query:
select diskusage from x where date = '2008-01-01'
will be very efficient since it can use the index. On the other hand, the query
select date from x where diskusage > 90
would not be so efficient. In the former case, the "explain plan" would tell you that it could use the index. In the latter, it would have said that it had to do a table scan to get the rows (that's basically looking at every row to see if it matches).
Really intelligent DBMS' may also explain what you should do to improve the performance (add an index on diskusage in this case).
As to how to see what queries are being run, you can either collect that from the DBMS (if it allows it) or force everyone to do their queries through stored procedures so that the DBA control what the queries are - that's their job, keeping the DB running efficiently.

indices on PKs and FKs and one thing that always helps PARTITIONING...

1. What are the patterns you use to determine the frequent queries?
Depends on what level you are dealing with the database. If you're a DBA or a have access to the tools, db's like Oracle allow you to run jobs and generate stats/reports over a specified period of time. If you're a developer writing an application against a db, you can just do performance profiling within your app.
2. How do you select the optimization factors?
I try and get a general feel for how the table is being used and the data it contains. I go about with the following questions.
Is it going to be updated a ton and on what fields do updates occur?
Does it have columns with low cardinality?
Is it worth indexing? (tables that are very small can be slowed down if accessed by an index)
How much maintenance/headache is it worth to have it run faster?
Ratio of updates/inserts vs queries?
etc.
3. What are the types of changes one can make?
-- If using Oracle, keep statistics up to date! =)
-- Normalization/De-Normalization either one can improve performance depending on the usage of the table. I almost always normalize and then only if I can in no other practical way make the query faster will de-normalize. A nice way to denormalize for queries and when your situation allows it is to keep the real tables normalized and create a denormalized "table" with a materialized view.
-- Index judiciously. Too many can be bad on many levels. BitMap indexes are great in Oracle as long as you're not updating the column frequently and that column has a low cardinality.
-- Using Index organized tables.
-- Partitioned and sub-partitioned tables and indexes
-- Use stored procedures to reduce round trips by applications, increase security, and enable query optimization without affecting users.
-- Pin tables in memory if appropriate (accessed a lot and fairly small)
-- Device partitioning between index and table database files.
..... the list goes on. =)
Hope this is helpful for you.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas