How to properly design a sql database to use an aggregate value?

How to properly design a sql database to use an aggregate value? - sql

In the following example
Order
-------
ID (int)
CreatedAt (smalldatetime)
....
OrderItems
-------
ID (int)
OrderID (int)
Quantity (int)
UnitPrice (decimal)
CalculationUnit (int)
TotalItemPrice (decimal)
....
I have a big dillema where and how should i keep track of TotalOrderPrice and I my main concern is speed and data consistency.
a) TotalOrderPrice could be stored in table Orders and should be updated on each OrderItem change for relevant OrderID
Could this lead to data inconsistency since data is "duplicated"?
b) I could have a view that could hold summed TotalItemPriceValues such as
OrderTotal
------------
ID (int)
TotalOrderprice (decimal)
Could this be a potencial issue when scaling application?
c) Or I could leave the original design as it is, and calculate OrderTotalPrice inside a business logic.
Could this slow down performance since all order items should have to be retrieved in order to get the total order price?
I know there are no silver bullets, but since I dont have large amount of data for testing, I just want to do facts check and see what would be a proper reasoning to find the solution here?

I would recommend against maintaining a computed column, which needs to frequently be updated, and instead to compute the order total in a query on demand, when your application needs it. You can use a query like the following which should run reasonably fast:
SELECT t1.ID, t2.OrderTotalPrice
FROM Order t1
INNER JOIN
(
SELECT OrderID, SUM(TotalItemPrice) AS OrderTotalPrice
FROM OrderItems
GROUP BY OrderID
) t2
ON t1.ID = t2.OrderID
This avoids the problem of having to maintain a computed column, which makes managing your database much easier. A strong argument against a computed column is that it doesn't really save the database any work. Rather, it always needs to be maintained, whereas computing a column on demand only needs to be done when you actually need it.

An order won't contain millions of positions, so speed shouldn't be a problem you must worry about.
Your table OrderItems contains Quantity and UnitPrice and TotalItemPrice. This already seems redundant. Isn't TotalItemPrice always Quantity x UnitPrice? Provided the UnitPrice is already the gross price to pay (and not the net price where VAT must be added to get the TotalItemPrice). And provided any item discount is already included. If there'd be another column item_discount_percent for instance we might get a result with too many digits, e.g. 105.987002. Does the order then contain 105.98 or 105.99 in this example? We may want to store that value in TotalItemPrice then to make this clear. (And to make sure a new software version would still print the exact same order.) So have this column only if some calculation may lead to prices with more than two decimal places.
As to your question and a TotalOrderPrice we can apply the same thinking: If the price is just the sum of all the order's TotalItemPrice then don't store it. If there is some calculation to be done leading to too many decimal places (e.g. an order_discount_percent) you should probably store that (rounded/truncated) value.

I would consider what the access patterns are for the data, as that is what determines the relevant pros and cons.
How often will you need to:
Place a predicate on the total order amount (compute intensive unless you store the total).
Order by the total order amount (compute intensive unless you store the total).
Modify the total order amount (compute intensive and possibly a cause of error if you store the total).
If orders are never modified after creation and you frequently place predicates on the total or order by it, then I'd be confident about storing the total in the order table.
If orders are frequently modified but you very rarely need to place predicates on the total or order by it, then I would be confident in not storing the total.
The correct approach for you depends strongly on where the balance lies between those two extremes, and the risk you're willing to adopt in either poor performance or incorrect data.

Here is my second answer, which is very different from my first one...
Usually you avoid redundancy in databases, because it can lead to inconsistencies. What would you do for instance, if you saw some day that an order's stored TotalOrderPrice doesn't match what you calculate from the positions? To avoid such inconsistencies, we avoid redundancies.
In a datawarehouse, however, you invite redundancy in order to have faster access to the data. That means you could have an order system containing the pure Order and OrderItems tables and have a datawarehouse system that gets updated in intervals and has an Order table with a column for the TotalOrderPrice.
Thinking this further ... Does an order ever change in your system? If not than why not store what you print, i.e. store the TotalOrderPrice redundantly. (You can use some database mechanisms to prevent orders from being partially deleted or updated to get this even safer.) If later the TotalOrderPrice really doesn't match what you calculate from the positions, then this even indicates a problem with your software at the time you wrote the order. So having stored the TotalOrderPrice is an advantage suddenly probably giving us the chancce to detect such errors and do corrections in our accounting.
Having said this: Usually an order gets written and not changed afterwards. As no changes are going to apply, you can easily store the TotalOrderPrice in the orders table and have both the advantages of seeing later what order price you sent/printed and retrieving the prices faster.

In general, my take is that you should avoid breaking the rules of normalisation until you need to. That means avoiding data redundancy in order to avoid update anomalies, and calculating things on the fly. I've seen a lot of terrible databases created because a developer worried that one day the database might not cope with the application load; in truth, in a well-designed, well-indexed, and well-maintained database this is rare. RDBMSes are a very good tool for dealing with large amounts of normalised data in transactional systems, if your database is designed and maintained correctly.
This doesn't mean you need to do the calculations in your application logic, though - and in fact I'd avoid that. Instead, make a view (looking like the query Tim Biegeleisen suggested in his answer) that does the calculations. If sometime down the road you find that this doesn't scale well, you can change the table and the view, plus whatever is populating this table - this minimises the disruption to your application if this change is needed. If the table is populated via a stored procedure then you might not need to make any changes to your front end application logic at all in order to switch from calculating on the fly to pre-calculated.

Related

Is it wise to store total values into the SQL database? [duplicate]

This question already has answers here:
Best practise to store total values or to sum up at runtime
(2 answers)
Closed 5 years ago.
I wanted to ask if it is a good idea to store total values into SQL database. By total values I mean stuff like total cost of an order, or the total quantity.
As an example, I'll have a table called orders which will have information like when it was created, who made the order and etc. Orders will relate to the table order_items which will hold the id of the item in the order, the quantity of the item in the order and the price at which it was sold (not the item`s original price).
In this situation, should I calculate the total cost of the order once and store it in the orders table, or should I calculate it every time I retrieve the order with the corresponding items?

There are pros and cons to each choice. If you choose to store the computed totals, you'll have to update them every time you update the details they are dependent on. To do that reliably you'll need to enforce it close to the database, in a trigger for example, rather than relying on on business logic to get it right all the time.
On the other hand if you don't store the computed value you'll have to compute it every time you want access to it. A view will make that trivial.
The decision to store vs recompute comes down to how frequently does the data change vs how often is it accessed. If it changes infrequently, then storing the value may be okay, but if it changes frequently, then it's better to recompute it each time you access it. Similarly if the data is accessed infrequently, then recomputing the values is good while if you access it frequently, then precomputing may be the better option.
Depending on the database you are using, you may be able to get the best of both worlds, by creating a materialized view.

For the particular case of an order, the total amount is traditionally stored with the order. The order items then contain the prices of each item in the order.
Why? Simple: certain amounts are available only on the order-level. For instance:
Shipping charges
Taxes (the taxes on the item levels may not add up correctly to the order level)
Order-level discounts (such as discounts based on the overall order size or bundle discounts)
In addition, orders are usually put into the database once, with no or very rare changes on the items (the calculations may not be made until the order is finalized).
Other types of entities might have other considerations. This addresses the specific issue of whether totals should be stored for orders.

Join or storing directly

I have a table A which contains entries I am regularly processing and storing the result in table B. Now I want to determine for each entry in A its latest processing date in B.
My current implementation is joining both tables and retrieving the latest date. However an alternative, maybe less flexible, approach would be to simply store the date in table A directly.
I can think of pros and cons for both cases (performance, scalability, ....), but didnt have such a case yet and would like to see whether someone here on stackoverflow had a similar situation and has a recommendation for either one for a specific reason.
Below a quick schema design.
Table A
id, some-data, [possibly-here-last-process-date]
Table B
fk-for-A, data, date
Thanks

Based on your description, it sounds like Table B is your historical (or archive) table and it's populated by batch.
I would leave Table A alone and just introduce an index on id and date. If the historical table is big, introduce an auto-increment PK for table B and have a separate table that maps the B-Pkid to A-pkid.
I'm not a fan of UPDATE on a warehouse table, that's why I didn't recommend a CURRENT_IND, but that's an alternative.

This is a fairly typical question; there are lots of reasonable answers, but there is only one correct approach (in my opinion).
You're basically asking "should I denormalize my schema?". I believe that you should denormalize your schema only if you really, really have to. The way you know you have to is because you can prove that - under current or anticipated circumstances - you have a performance problem with real-life queries.
On modern hardware, with a well-tuned database, finding the latest record in table B by doing a join is almost certainly not going to have a noticable performance impact unless you have HUGE amounts of data.
So, my recommendation: create a test system, populate the two tables with twice as much data as the system will ever need, and run the queries you have on the production environment. Check the query plans, and see if you can optimize the queries and/or indexing. If you really can't make it work, de-normalize the table.
Whilst this may seem like a lot of work, denormalization is a big deal - in my experience, on a moderately complex system, denormalized data schemas are at the heart of a lot of stupid bugs. It makes introducing new developers harder, it means additional complexity at the application level, and the extra code means more maintenance. In your case, if the code which updates table A fails, you will be producing bogus results without ever knowing about it; an undetected bug could affect lots of data.

We had a similar situation in our project tracking system where the latest state of the project is stored in the projects table (Cols: project_id, description etc.,) and the history of the project is stored in the project_history table (Cols: project_id, update_id, description etc.,). Whenever there is a new update to the project, we need find out the latest update number and add 1 to it to get the sequence number for the next update. We could have done this by grouping the project_history table on the project_id column and get the MAX(update_id), but the cost would be high considering the number of the project updates (in a couple of hundreds of thousands) and the frequency of update. So, we decided to store the value in the projects table itself in max_update_id column and keep updating it whenever there is a new update to a given project. HTH.

If I understand correctly, you have a table whose each row is a parameter and another table that logs each parameter value historically in a time series. If that is correct, I currently have the same situation in one of the products I am building. My parameter table hosts a listing of measures (29K recs) and the historical parameter value table has the value for that parameter every 1 hr - so that table currently has 4M rows. At any given point in time there will be a lot more requests FOR THE LATEST VALUE than for the history so I DO HAVE THE LATEST VALUE STORED IN THE PARAMETER TABLE in addition to it being in the last record in the parameter value table. While this may look like duplication of data, from the performance standpoint it makes perfect sense because
To get a listing of all parameters and their CURRENT VALUE, I do not have to make a join and more importantly
I do not have to get the latest value for each parameter from such a huge table
So yes, I would in your case most definitely store the latest value in the parent table and update it every time new data comes in. It will be a little slower for writing new data but a hell of a lot faster for reads.

How to handle an extremely large database table size?

In this scenario, every sales order is going to have atleast 400-500 products associated with it. Now everytime a sales order is generated, the cost and price of those products will be saved in the SalesOrderProduct table. This will cause the SalesOrderProduct table to become extremely large in a short period of time. Whats the best way to handle the size of this table?

Are you sure there is a problem?
If you have millions of rows, no sweat. A SQL database will chew that stuff up.
If you have billions of rows, you might want a key-value store instead of a SQL database. Especially for archival information like past orders which is write-once read-never (and analyze-rarely). If you can't switch from SQL, you can use a clustered database.
But before you do anything, be sure there's an issue - test the performance with a good, realistic workload. See if it'll handle your needs for the near future. Don't solve problems which aren't there.
Final note: for this particular database schema, you can eliminate the SalesOrderProduct table by keeping track of historical costs/prices for products. Then you can use the order date to backfigure the costs/prices of all ordered products, eliminating the need for that join table.

Recommendations for best SQL Performance updating and/or calculating stockonhand totals

Apologies for the length of this question.
I have a section of our database design which I am worried may begin to cause problems. It is not at that stage yet, but obviously don't want to wait until it is to resolve the issue. But before I start testing various scenarios, I would appreciate input from anyone who has experience with such a problem.
Situation is Stock Control and maintaining the StockOnHand value.
It would be possible to maintain a table hold the stock control figures which can be updated whenever a order is entered either manually or by using a database trigger.
Alternatively you can get SQL to calculate the quantities by reading and summing the actual sales values.
The program is installed on several sites some of which are using MS-SQL 2005 and some 2008.
My problem is complicated because the same design needs to cope with several scenarios,
such as :
1) Cash/Sale Point Of Sale Environment. Sale is entered and stock is reduced in one transaction. No amendments can be made to this transaction.
2) Order/Routing/Confirmation
In this environment, the order is created and can be placed on hold, released, routed, amended, delivered, and invoiced. And at any stage until it is invoiced the order can be amended. (I mention this because any database triggers may be invoked lots of time and has to determine if changes should affect the stock on hand figures)
3) Different business have a different ideas of when their StockOnHand should be reduced. For example, some consider the stock as sold once they approve an order (as they have committed to sell the goods and hence it should not be sold to another person). Others do not consider the stock as sold until they have routed it and some others only when it has been delivered or collected.
4) There can be a large variance in number of transactions per product. For example, one system has four or five products which are sold several thousand times per month, so asking SQL to perform a sum on those transactions is reading ten of thousands of transactions per year, Whereas, on the same system, there are several thousand other products where sales would only less than a thousand transactions per year per product.
5) Historical information is important. For that reason, our system does not delete or archive transactions and has several years worth of transactions.
6) The system must have the ability to warn operators if they do not have the required stock when the order is entered ( which quite often is in real time, eg telephone order).
Note that this only required for some products. (But I don't think it would be practical to sum the quantity across ten of thousands of transactions in real time).
7) Average Cost Price. Some products can be priced based on the average cost of the items in stock. The way this is implemented is that the Average Cost price is re-calculated for every goods in transaction, something like newAverageCostPrice = (((oldAverageCostPrice * oldStockOnHand) + newCostValue) / newStockOnHand) . This means the stock On Hand must be known for every goods in if the product is using AverageCost.
The way the system is currently implemented is two fold.
We have a table which holds the StockOnHand for each product and location. Whenever a sale is updated, this table is updated via the business layer of our application (C#)
This only provides the current stock on hand figures.
If you need to run a Stock Valuation for a particular date, this figure is calculated by performing a sum of the quantitys on the lines involved. This also requires a join between the sales line and the sale header tables as the quantity and product are stored in the line file and the date and status are only held in the header table.
However, there are downsides to this method, such as.
Running the stock valuation report is slow, (but not unacceptably slow), but I am not happy with it. (It works and monitoring the server does not show it overloading it, but it has the potential to cause problems and hence requires regular monitoring)
The logic of the code updating the StockOnHand table is complicated.
This table is being updated frequently. In a lot of cases this is un-necessary as the information does not need to be checked. For example, if 90% of your business is selling 4 or 5 products, you don't really need a computer to tell you are out of stock.
Database trigggers.
I have never implemented complicated triggers before, so am wary of this.
For example, as stated before we need configuration options to determine the conditions when the stock figures should be updated. This is currently read once and cached in our program. To do this inside a trigger would persumably mean reading this information for every trigger. Does this have a big impact on performance.
Also we may need a trigger on the sale header and the sale line. (This could mean that an amendment to the sale header would be forced to read the lines and update the stockonhand for the relevant products, and then later on the lines are saved and another database trigger would amend the stockonahand table again which may be in-efficient.
Another alternative would be to only update the StockOnHand table whenever the transaction is invoiced (which means no further amendments can be done) and to provide a function to calculate the stockonhand figure based on a union of this table and the un-invoiced transactions which affect stock.
Any advice would be greatly appreciated.

First of I would strongly recommend you add "StockOnHand", "ReservedStock" and "SoldStock"
to your table.
A cash sale would immediatly Subtract the sale from "StockOnHand" and add it to "SoldStock", for an order you would leave "StockOnHand" alone and merely add the sale to ReservedStock, when the stock is finally invoiced you substract the sale from StockOnHand and Reserved stock and add it to "SoldStock".
The business users can then choose whether StockOnHand is just that or StockOnHand - ReservedStock.
Using a maintaind StockOnHand figure will reduce your query times massively, versus the small risk that the figure can go out of kilter if you mess up your program logic.
If your customers are so lucky enough to experience update contention when maintaining the StockOnHand figure (i.e. are they likely to process more than five sales a second at peak times) then you can consisider the following scheme:-
Overnight calculate the StockOnHand figure by counting deliveries - sales or whatever.
When a sale is confirmed insert a row to a "Todays Sales" table.
When you need to query stock on hand total up todays sale and subtract it from the start of day figure.
You could also place a "Stock Check Threshold" on each product so if you start the day with 10,000 widgets you can set the CheckThreshold to 100 if someone is ordering less than 100 than dont bother checking the stock. If someone orders over 100 then check the stock and recalculate a new lower threshold.

Could you create a view (or views) to respresent your stock on hand? This would take the responsibility for doing the calculations out of synchronous triggers which slow down your transactions. Using multiple views could satisfy the requirement "Different business have a different ideas of when their StockOnHand should be reduced." Assuming you can meet the stringent requirements, creating an indexed view could further improve your performance.

Just some ideas off the top of my head:
Instead of a trigger (and persistent SOH data), you could use a computed column (e.g. SOH per product per store). However, the performance impact of evaluating this would likely be abysmal unless there are >> more writes to your source tables than reads from your computed column. (The trade off is that is assuming the only reason you calculate the SOH is so that you can read it again. If you update the source data for the calc much more often than you actually need to read it, then the computed col might make sense - since it is JIT evaluation only when needed. This would be unusual though - reads are usually more frequent than writes in most Systems)
I'm guessing that the reason you are looking at triggers is because the source tables for the SOH figures are updated from a large number of procs / code in order to prevent oversight (as opposed to a calling a recalc SPROC from every applicable point where the source data has been modified?)
IMHO placing complicated in DB triggers is not advised, as this will adversely affect the performance of high volume inserts / updates, and triggers aren't great for maintainability.
Does the SOH calculation need to be real time? If not, you could implement a mechanism to queue requests for recalculation (e.g. by using a trigger to indicate that a product / location balance is dirty) and then run a recalculation service every few minutes for near real-time. Mission critical calculations (e.g. financial - like your #6) could still however detect that a SOH calc is dirty and then force a recalc before doing a transaction.
Re : 3 - Ouch. Would recommend that internally you agree on a consistent (and industry acceptable) set of terminology (Stock In Hand, Stock Committed, Stock in Transit, Shrinkage etc etc) and then try to convince your customers to conform to a standard. But that is in the ideal world of course!

C# with SQL database schema question

Is it acceptable to dynamically generate the total of the contents of a field using up to 10k records instead of storing the total in a table?
I have some reasons to prefer on-demand generation of a total, but how bad is the performance price on an average home PC? (There would be some joins -ORM managed- involved in figuring the total.)
Let me know if I'm leaving out any info important to deciding the answer.
EDIT: This is a stand-alone program on a user's PC.

If you have appropriate indexing in place, it won't be too bad to do on demand calculations. The reason that I mention indexing is that you haven't specified whether the total is on all the values in a column, or on a subset - if it's a subset, then the fields that make up the filter may need to be indexed, so as to avoid table scans.

Usually it is totally acceptable and even recommended to recalculate values. If you start storing calculated values, you'll face some overhead ensuring that they are always up to date, usually using triggers.
That said, if your specific calculation query turns out to take a lot of time, you might need to go that route, but only do that if you actually hit a performance problem, not upfront.

Using a Sql query you can quickly and inexpensively get the total number of records using the max function.
It is better to generate the total then keep it as a record, the same way as you would keep a persons birth date and determine their age then keep their age.

How offten and by what number of users u must get this total value, how offten data on which total depends are updated.
Maybe only thing you need is to make this big query once a day (or once at all) and save it somewhere in db and then update it when data, on which your total consist, are changed

You "could" calculate the total with SQL (I am assuming you do not want total number of records ... the price total or whatever it is). SQL is quite good at mathematics when it gets told to do so :) No storing of total.
But, as it is all run on the client machine, I think my preference would be to total using C#. Then the business rules for calculating the total are out of the DB/SQL. By that I mean if you had a complex calculation for total that reuired adding say 5% to orders below £50 and the "business" changed it to add 10% to orders below £50 it is done in your "business logic" code rather than in your storage medium (in this case SQL).
Kindness,
Dan

I think that it should not take long, probably less than a second, to generate a sum from 8000-10000 records. Even on a single PC the query plan for this query should be dominated by a single table scan, which will generate mostly sequential I/O.
Proper indexing should make any joins reasonably efficient unless the schema is deeply flawed and unless you have (say) large blob fields in the table the total data volume for the rows should not be very large at all. If you still have performance issues going through an O/R mapper, consider re-casting the functionality as a report where you can control the SQL.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas