Is it acceptable to dynamically generate the total of the contents of a field using up to 10k records instead of storing the total in a table?
I have some reasons to prefer on-demand generation of a total, but how bad is the performance price on an average home PC? (There would be some joins -ORM managed- involved in figuring the total.)
Let me know if I'm leaving out any info important to deciding the answer.
EDIT: This is a stand-alone program on a user's PC.
If you have appropriate indexing in place, it won't be too bad to do on demand calculations. The reason that I mention indexing is that you haven't specified whether the total is on all the values in a column, or on a subset - if it's a subset, then the fields that make up the filter may need to be indexed, so as to avoid table scans.
Usually it is totally acceptable and even recommended to recalculate values. If you start storing calculated values, you'll face some overhead ensuring that they are always up to date, usually using triggers.
That said, if your specific calculation query turns out to take a lot of time, you might need to go that route, but only do that if you actually hit a performance problem, not upfront.
Using a Sql query you can quickly and inexpensively get the total number of records using the max function.
It is better to generate the total then keep it as a record, the same way as you would keep a persons birth date and determine their age then keep their age.
How offten and by what number of users u must get this total value, how offten data on which total depends are updated.
Maybe only thing you need is to make this big query once a day (or once at all) and save it somewhere in db and then update it when data, on which your total consist, are changed
You "could" calculate the total with SQL (I am assuming you do not want total number of records ... the price total or whatever it is). SQL is quite good at mathematics when it gets told to do so :) No storing of total.
But, as it is all run on the client machine, I think my preference would be to total using C#. Then the business rules for calculating the total are out of the DB/SQL. By that I mean if you had a complex calculation for total that reuired adding say 5% to orders below £50 and the "business" changed it to add 10% to orders below £50 it is done in your "business logic" code rather than in your storage medium (in this case SQL).
Kindness,
Dan
I think that it should not take long, probably less than a second, to generate a sum from 8000-10000 records. Even on a single PC the query plan for this query should be dominated by a single table scan, which will generate mostly sequential I/O.
Proper indexing should make any joins reasonably efficient unless the schema is deeply flawed and unless you have (say) large blob fields in the table the total data volume for the rows should not be very large at all. If you still have performance issues going through an O/R mapper, consider re-casting the functionality as a report where you can control the SQL.
Related
So I'm looking into data warehousing and partitioning and am very curious at to what scale makes the most sense for partitioning a data on a key (for instance, SaleDate).
Tutorials often mention that you're trying to break it down into logical chunks so as to make updating the data less likely to cause service disruptions.
So let's say I'm a medium scale company working in a given US state. I do a lot of work in relation to SaleDate, often tens of thousands of transactions a day (with requisite transaction details, 4-50 each?), and have about 5 years of data. I would like to query and build trend information off of that; for instance:
On a yearly basis to know what items are becoming less popular over time.
On a monthly basis to see what items get popular at a certain time of year (ice in summer)
On a weekly basis to see how well my individual stores are doing
On a daily basis to observe theft trends or something
Now my business unit also wants to query that data, but I'd like to be able to keep it responsive.
How do I know that it would be best to partition on Year, Month, Week, Day, etc for this data set? Is it just whatever I actually observe as providing the best response time by testing out each scenario? Or is there some kind of scale that I can use to understand where my partitions would be the most efficient?
Edit: I, personally, am using Sql Server 2012. But I'm curious as to how others view this question in relation to the core concept rather than the implementation (Unless this isn't one of those cases where you can do so).
Things to consider:
What type of database are you using? Really important, different strategies for Oracle vs SQLServer vs IBM, etc.
Sample queries and run times. Partitions usage depends on the conditions in your where clause, what are you filtering on?
Does it make sense to create/use aggregate tables? Seems like a monthly aggregate would save you some time.
Partitions usage depends on the conditions in your where clause, what are you filtering on?
Lots of options based on the hardware and storage options available to you, need more details to make a more specific recommendation.
Here is an Ms-SQL 2012 database with 7 million records a day, with an ambition to grow the database to 6 years of data for trend analyses.
The partitions are based on the YearWeek column, expressed as an integer (after 201453 comes 201501). So each partition holds one week of transaction data.
This makes for a maximum of 320 partitions, which is well chosen below the maximum of 1000 partitions within a scheme. The maximum size for one partition in one table is now approx. 10 Gb, which makes it much easier to handle than the 3Tb size of the total.
A new file in the partition scheme is used for each new year. The 500Gb datafiles are suitable for backup and deletion.
When calculating data for one month the 4 processors are working in parallel to handle one partition each.
I'm having some performance problems where a SQL query calculating the average of a column is progressively getting slower as the number of records grows. Is there an index type that I can add to the column that will allow for faster average calculations?
The DB in question is PostgreSQL and I'm aware that particular index type might not be available, but I'm also interested in the theoretical answer, weather this is even possible without some sort of caching solution.
To be more specific, the data in question is essentially a log with this sort of definition:
table log {
int duration
date time
string event
}
I'm doing queries like
SELECT average(duration) FROM log WHERE event = 'finished'; # gets average time to completion
SELECT average(duration) FROM log WHERE event = 'finished' and date > $yesterday; # average today
The second one is always fairly fast since it has a more restrictive WHERE clause, but the total average duration one is the type of query that is causing the problem. I understand that I could cache the values, using OLAP or something, my question is weather there is a way I can do this entirely by DB side optimisations such as indices.
The performance of calculating an average will always get slower the more records you have, at it always has to use values from every record in the result.
An index can still help, if the index contains less data than the table itself. Creating an index for the field that you want the average for generally isn't helpful as you don't want to do a lookup, you just want to get to all the data as efficiently as possible. Typically you would add the field as an output field in an index that is already used by the query.
Depends what you are doing? If you aren't filtering the data then beyond having the clustered index in order, how else is the database to calculate an average of the column?
There are systems which perform online analytical processing (OLAP) which will do things like keeping running sums and averages down the information you wish to examine. It all depends one what you are doing and your definition of "slow".
If you have a web based program for instance, perhaps you can generate an average once a minute and then cache it, serving the cached value out to users over and over again.
Speeding up aggregates is usually done by keeping additional tables.
Assuming sizeable table detail(id, dimA, dimB, dimC, value) if you would like to make the performance of AVG (or other aggregate functions) be nearly constant time regardless of number of records you could introduce a new table
dimAavg(dimA, avgValue)
The size of this table will depend only on the number of distinct values of dimA (furthermore this table could make sense in your design as it can hold the domain of the values available for dimA in detail (and other attributes related to the domain values; you might/should already have such table)
This table is only helpful if you will anlayze by dimA only, once you'll need AVG(value) according to dimA and dimB it becomes useless. So, you need to know by which attributes you will want to do fast analysis on. The number of rows required for keeping aggregates on multiple attributes is n(dimA) x n(dimB) x n(dimC) x ... which may or may not grow pretty quickly.
Maintaining this table increases the costs of updates (incl. inserts and deletes), but there are further optimizations that you can employ...
For example let us assume that system predominantly does inserts and only occasionally updates and deletes.
Lets further assume that you want to analyze by dimA only and that ids are increasing. Then having structure such as
dimA_agg(dimA, Total, Count, LastID)
can help without a big impact on the system.
This is because you could have triggers that would not fire on every insert, but lets say on ever 100 inserts.
This way you can still get accurate aggregates from this table and the details table with
SELECT a.dimA, (SUM(d.value)+MAX(a.Total))/(COUNT(d.id)+MAX(a.Count)) as avgDimA
FROM details d INNER JOIN
dimA_agg a ON a.dimA = d.dimA AND d.id > a.LastID
GROUP BY a.dimA
The above query with proper indexes would get one row from dimA_agg and only less then 100 rows from detail - this would perform in near constant time (~logfanoutn) and would not require update to dimA_agg for every insert (reducing update penalties).
The value of 100 was just given as an example, you should find optimal value yourself (or even keep it variable, though triggers only will not be enough in that case).
Maintaining deletes and updates must fire on each operation but you can still inspect if the id of the record to be deleted or updated is in the stats already or not to avoid the unnecessary updates (will save some I/O).
Note: The analysis is done for the domain with discreet attributes; when dealing with time series the situation gets more complicated - you have to decide the granularity of the domain in which you want to keep the summary.
EDIT
There are also materialized views, 2, 3
Just a guess, but indexes won't help much since average must read all the record (in any order), indexes are usefull the find subsets of rows, ubt if you have to iterate on all rows with no special ordering indexes are not helping...
This might not be what you're looking for, but if your table has some way to order the data (e.g. by date), then you can just do incremental computations and store the results.
For example, if your data has a date column, you could compute the average for records 1 - Date1 then store the average for that batch along with Date1 and the #records you averaged. The next time you compute, you restrict your query to results Date1..Date2, and add the # of records, and update the last date queried. You have all the information you need to compute the new average.
When doing this, it would obviously be helpful to have an index on the date, or whatever column(s) you are using for the ordering.
I have a postgres database with several million rows, which drives a web app. The data is static: users don't write to it.
I would like to be able to offer users query-able aggregates (e.g. the sum of all rows with a certain foreign key value), but the size of the database now means it takes 10-15 minutes to calculate such aggregates.
Should I:
start pre-calculating aggregates in the database (since the data is static)
move away from postgres and use something else?
The only problem with 1. is that I don't necessarily know which aggregates users will want, and it will obviously increase the size of the database even further.
If there was a better solution than postgres for such problems, then I'd be very grateful for any suggestions.
You are trying to solve an OLAP (On-Line Analytical Process) data base structure problem with an OLTP (On-Line Transactional Process) database structure.
You should build another set of tables that store just the aggregates and update these tables in the middle of the night. That way your customers can query the aggregate set of tables and it won't interfere with the on-line transation proceessing system at all.
The only caveate is the aggregate data will always be one day behind.
Yes
Possibly. Presumably there are a whole heap of things you would need to consider before changing your RDBMS. If you moved to SQL Server, you would use Indexed views to accomplish this: Improving Performance with SQL Server 2008 Indexed Views
If you store the aggregates in an intermediate Object (something like MyAggragatedResult), you could consider a caching proxy:
class ResultsProxy {
calculateResult(param1, param2) {
.. retrieve from cache
.. if not found, calculate and store in cache
}
}
There are quite a few caching frameworks for java, and most like for other languages/environments such as .Net as well. These solution can take care of invalidation (how long should a result be stored in memory), and memory-management (remove old cache items when reaching memory limit, etc.).
If you have a set of commonly-queried aggregates, it might be best to create an aggregate table that is maintained by triggers (or an observer pattern tied to your OR/M).
Example: say you're writing an accounting system. You keep all the debits and credits in a General Ledger table (GL). Such a table can quickly accumulate tens of millions of rows in a busy organization. To find the balance of a particular account on the balance sheet as of a given day, you would normally have to calculate the sum of all debits and credits to that account up to that date, a calculation that could take several seconds even with a properly indexed table. Calculating all figures of a balance sheet could take minutes.
Instead, you could define an account_balance table. For each account and dates or date ranges of interest (usually each month's end), you maintain a balance figure by using a trigger on the GL table to update balances by adding each delta individually to all applicable balances. This spreads the cost of aggregating these figures over each individual persistence to the database, which will likely reduce it to a negligible performance hit when saving, and will decrease the cost of getting the data from a massive linear operation to a near-constant one.
For that data volume you shouldn't have to move off Postgres.
I'd look to tuning first - 10-15 minutes seems pretty excessive for 'a few million rows'. This ought to be just a few seconds. Note that the out-of-the box config settings for Postgres don't (or at least didn't) allocate much disk buffer memory. You might look at that also.
More complex solutions involve implementing some sort of data mart or an OLAP front-end such as Mondrian over the database. The latter does pre-calculate aggregates and caches them.
If you have a set of common aggregates you can calculate it before hand (like, well, once a week) in a separate table and/or columns and users get it fast.
But I'd seeking the tuning way too - revise your indexing strategy. As your database is read only, you don't need to worry about index updating overhead.
Revise your database configuration, maybe you can squeeze some performance of it - normally default configurations are targeted to easy the life of first-time users and become short-sighted fastly with large databases.
Maybe even some denormalization can speed up things after you revised your indexing and database configuration - and falls in the situation that you need even more performance, but try it as a last resort.
Oracle supports a concept called Query Rewrite. The idea is this:
When you want a lookup (WHERE ID = val) to go faster, you add an index. You don't have to tell the optimizer to use the index - it just does. You don't have to change the query to read FROM the index... you hit the same table as you always did but now instead of reading every block in the table, it reads a few index blocks and knows where to go in the table.
Imagine if you could add something like that for aggregation. Something that the optimizer would just 'use' without being told to change. Let's say you have a table called DAILY_SALES for the last ten years. Some sales managers want monthly sales, some want quarterly, some want yearly.
You could maintain a bunch of extra tables that hold those aggregations and then you'd tell the users to change their query to use a different table. In Oracle, you'd build those as materialized views. You do no work except defining the MV and an MV Log on the source table. Then if a user queries DAILY_SALES for a sum by month, ORACLE will change your query to use an appropriate level of aggregation. The key is WITHOUT changing the query at all.
Maybe other DB's support that... but this is clearly what you are looking for.
I'm looking into using Table Storage for storing some transactional data, however, I need to support some very high level reporting over it, basically totals per day / month.
Couple of options I have though of:
Use a partition / row key structure and dynamically perform sum
e.g. 20101101_ITEMID_XXXXXXXX (x = guid or time, to make unique)
then I would query for a months data using a portion of the row key (ITEMID_201011), and to a total on the "Cost" property in the type.
How would the query limit of 1000 records be managed by this though? (i.e. if there are more than 1000 transactions for the day, totaling would be hard)
Use another record to store the total for the day, and update this as new records are added
e.g. row key "20101101_ITEMID_TOTAL"
then query off this for the days totals, or months, or years totals.
What is the best way to do this? Is there a 'best practice' for this type of requirement using table storage?
I'm not sure what is the best practice but I can comment that we have a similar situation with AzureWatch and are definitely using pre-aggregated values in tables.
Mostly for performance reasons -- table storage is not instantaneous even if you query by single partition-key and a range in row-key. The time it takes to download the records is somewhat significant and depending on the records might spike the CPU up, because it needs to de-serialize the data into objects. If you get to travel to the table storage multiple times because of the 1000 record limit, you'll be paying more as well.
Some other thoughts to consider:
Will your aggregated totals ever change? If no, the this is another nudge toward pre-aggregation
Will you need to keep aggregated values after raw data is gone or will you ever need to purge raw data? If yes, then it is another nudge toward pre-aggregation
I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the count of rows from the two older DBs indicates that the total row count for this new table would be about 200,000,000 rows.
How do I go about dealing with this amount of data? It is data stretching back about 10 years and does need to be available. Fortunately, we don't need to pull out even 1% of it when making queries in the future, but it does all need to be accessible.
I've got ideas based around having multiple tables for year, supplier (of the source data) etc - or even having one database for each year, with the most recent 2 years in one DB (which would also contain the stored procs for managing all this.)
Any and all help, ideas, suggestions very, deeply, much appreciated,
Matt.
Most importantly. consider profiling your queries and measuring where your actual bottlenecks are (try identifying the missing indexes), you might see that you can store everything in a single table, or that buying a few extra hard disks will be enough to get sufficient performance.
Now, for suggestions, have you considered partitioning? You could create partitions per time range, or one partition with the 1% commonly accessed and another with the 99% of the data.
This is roughly equivalent to splitting the tables manually by year or supplier or whatnot, but internally handled by the server.
On the other hand, it might make more sense to actually splitting the tables in 'current' and 'historical'.
Another possible size improvement is using an int (like an epoch) instead of a datetime and provide functions to convert from datetime to int, thus having queries like
SELECT * FROM megaTable WHERE datetime > dateTimeToEpoch('2010-01-23')
This size savings will probably have a cost performance wise if you need to do complex datetime queries. Although on cubes there is the standard technique of storing, instead of an epoch, an int in YYYYMMDD format.
What's the problem with storing this data in a single table? An enterprise-level SQL server like Microsoft SQL 2005 can handle it without much pain.
By the way, do not do tables per year, tables per supplier or other things like this. If you have to store similar set of items, you need one and one only table. Setting multiple tables to store the same type of things will cause problems, like:
Queries would be extremely difficult to write, and performance will be decreased if you have to query from multiple tables.
The database design will be very difficult to understand (especially since it's not something natural to store the same type of items in different places).
You will not be able to easily modify your database (maybe it's not a problem in your case), because instead of changing one table, you would have to change every table.
It would require to automate a bunch of tasks. Let's see you have a table per year. If a new record is inserted on 2011-01-01 00:00:00.001, will a new table be created? Will you check at each insert if you must create a new table? How it would affect performance? Can you test it easily?
If there is a real, visible separation between "recent" and "old" data (for example you have to use daily the data saved the last month only, and you have to keep everything older, but you do not use it), you can build a system with two SQL servers (installed on different machines). The first, highly available server, will serve to handle recent data. The second, less available and optimized for writing, will store everything else. Then, on schedule, a program will move old data from the first one to the second.
With such a small tuple size (2 ints, 1 datetime, 1 decimal) I think you will be fine having a single table with all the results in it. SQL server 2005 does not limit the number of rows in a table.
If you go down this road and run in to performance problems, then it is time to look at alternatives. Until then, I would plow ahead.
EDIT: Assuming you are using DECIMAL(9) or smaller, your total tuple size is 21 bytes which means that you can store the entire table in less than 4 GB of memory. If you have a decent server(8+ GB of memory) and this is the primary memory user, then the table and a secondary index could be stored in memory. This should ensure super fast queries after a slower warm-up time before the cache is populated.