First, I'd like to start out expressing that I am not trying to just have someone create my table schema for me. I have spent some time weighing the options between the two possibilities in my design and I wanted to get some advice before I go and run wild with my current idea.
Here is my current schema, I will put a ? next to columns I'm considering using.
Key:
table_name
----------
col1 | col2 | col3
tax_zone
---------
tax_zone_id | tax_rate | description
sales_order
-----------
sales_order_id | tax_zone_id (FK) or tax_rate (?)
sales_order_item
-----------
sales_order_item_id | sales_order_id (FK) | selling_price | amount_tax or tax_rate (?)
So, if it wasn't already clear, the dilemma is whether or not I should store the tax data in the individual rows for an order, or use a join to pull the tax_zone information and then do something in my query like (tz.tax_rate * so.order_amount) as order_total.
At present, I was thinking of using the method I just described. There is a problem I see with this methodology though that I can't seem to figure out how to remedy. Tax rates for specific zones are subject to change. This means that if a tax rate changes for a zone and I'm using a foreign key reference, the change in the rate will reflect in past orders that were done with a different rate. This causes an issue because at present I'm using the data in this table to store both orders that have been processed and orders that are still open, therefore if someone were to go re-print a past order, the total amount for the order will have changed.
My problem with storing the specific rate or tax amount is that it means every time someone was going to edit an order, I would have to update that row again with the changes to those values.
In the process of writing this, I'm starting to move towards the latter idea being the better of the two.
Perhaps if someone can just provide me the answer to the following questions so I can go research them myself some more.
Is this a known problem in database modeling?
Are there any well known "authorities" on the subject that have published a book / article?
Any help is much appreciated, thanks!
Well, versioning and history is a well known problem in database modelling. Your solution is very common.
For a simple enumeration like VAT-rates a simple "foreign key tax_id referencing taxtable(id)" will do. The tax-table should never be updated, once a tax_id is enterered, it should stay there forever. If the tax rates are changed at the end of the year, new record should be entered into the tax_table even if records with the new value already exist.
The best search phrase for search engines is probably "temporal database".
UPDATE:
http://www.google.nl/url?sa=t&source=web&cd=2&ved=0CCMQFjAB&url=http%3A%2F%2Fwww.faapartners.com%2Fdownloads%2Foverige-publicaties%2Fpresentatie-over-tijd-in-databases%2Fat_download%2Ffile&rct=j&q=veldwijk%20temporal&ei=HQdxTtimCcKr-QansM28CQ&usg=AFQjCNEg9puU8WR1KIm90voSDp13WmE0-g&cad=rja
In the situation you describe, you will eventually have to store the tax rate in the orders table, because you will need the rate at which the order was closed.
Therefore the cleanest solution has to be to calculate the tax rate each time an order is updated unless it is closed. You could use a trigger to do this.
(Ben's answer popped up as I was writing this - seems we disagree, which is probably not helpful :-)
2 points. You're going to have to store the tax rate somewhere or you're not going to be able to add it to sales_order, or anywhere else. Secondly the tax rate can change over time so you don't want to update each time.
So you have two options.
Store tax rate in a reference table and update each order with the correct tax rate at the time of entry into the table.
Calculate everything every time you access it.
Personally I would go for option 1 BUT have a start time as part of the Primary Key in the reference table as if you ever do need to change the tax-rate you may need to know what the correct rate was at the time the order was placed.
Related
In the following example
Order
-------
ID (int)
CreatedAt (smalldatetime)
....
OrderItems
-------
ID (int)
OrderID (int)
Quantity (int)
UnitPrice (decimal)
CalculationUnit (int)
TotalItemPrice (decimal)
....
I have a big dillema where and how should i keep track of TotalOrderPrice and I my main concern is speed and data consistency.
a) TotalOrderPrice could be stored in table Orders and should be updated on each OrderItem change for relevant OrderID
Could this lead to data inconsistency since data is "duplicated"?
b) I could have a view that could hold summed TotalItemPriceValues such as
OrderTotal
------------
ID (int)
TotalOrderprice (decimal)
Could this be a potencial issue when scaling application?
c) Or I could leave the original design as it is, and calculate OrderTotalPrice inside a business logic.
Could this slow down performance since all order items should have to be retrieved in order to get the total order price?
I know there are no silver bullets, but since I dont have large amount of data for testing, I just want to do facts check and see what would be a proper reasoning to find the solution here?
I would recommend against maintaining a computed column, which needs to frequently be updated, and instead to compute the order total in a query on demand, when your application needs it. You can use a query like the following which should run reasonably fast:
SELECT t1.ID, t2.OrderTotalPrice
FROM Order t1
INNER JOIN
(
SELECT OrderID, SUM(TotalItemPrice) AS OrderTotalPrice
FROM OrderItems
GROUP BY OrderID
) t2
ON t1.ID = t2.OrderID
This avoids the problem of having to maintain a computed column, which makes managing your database much easier. A strong argument against a computed column is that it doesn't really save the database any work. Rather, it always needs to be maintained, whereas computing a column on demand only needs to be done when you actually need it.
An order won't contain millions of positions, so speed shouldn't be a problem you must worry about.
Your table OrderItems contains Quantity and UnitPrice and TotalItemPrice. This already seems redundant. Isn't TotalItemPrice always Quantity x UnitPrice? Provided the UnitPrice is already the gross price to pay (and not the net price where VAT must be added to get the TotalItemPrice). And provided any item discount is already included. If there'd be another column item_discount_percent for instance we might get a result with too many digits, e.g. 105.987002. Does the order then contain 105.98 or 105.99 in this example? We may want to store that value in TotalItemPrice then to make this clear. (And to make sure a new software version would still print the exact same order.) So have this column only if some calculation may lead to prices with more than two decimal places.
As to your question and a TotalOrderPrice we can apply the same thinking: If the price is just the sum of all the order's TotalItemPrice then don't store it. If there is some calculation to be done leading to too many decimal places (e.g. an order_discount_percent) you should probably store that (rounded/truncated) value.
I would consider what the access patterns are for the data, as that is what determines the relevant pros and cons.
How often will you need to:
Place a predicate on the total order amount (compute intensive unless you store the total).
Order by the total order amount (compute intensive unless you store the total).
Modify the total order amount (compute intensive and possibly a cause of error if you store the total).
If orders are never modified after creation and you frequently place predicates on the total or order by it, then I'd be confident about storing the total in the order table.
If orders are frequently modified but you very rarely need to place predicates on the total or order by it, then I would be confident in not storing the total.
The correct approach for you depends strongly on where the balance lies between those two extremes, and the risk you're willing to adopt in either poor performance or incorrect data.
Here is my second answer, which is very different from my first one...
Usually you avoid redundancy in databases, because it can lead to inconsistencies. What would you do for instance, if you saw some day that an order's stored TotalOrderPrice doesn't match what you calculate from the positions? To avoid such inconsistencies, we avoid redundancies.
In a datawarehouse, however, you invite redundancy in order to have faster access to the data. That means you could have an order system containing the pure Order and OrderItems tables and have a datawarehouse system that gets updated in intervals and has an Order table with a column for the TotalOrderPrice.
Thinking this further ... Does an order ever change in your system? If not than why not store what you print, i.e. store the TotalOrderPrice redundantly. (You can use some database mechanisms to prevent orders from being partially deleted or updated to get this even safer.) If later the TotalOrderPrice really doesn't match what you calculate from the positions, then this even indicates a problem with your software at the time you wrote the order. So having stored the TotalOrderPrice is an advantage suddenly probably giving us the chancce to detect such errors and do corrections in our accounting.
Having said this: Usually an order gets written and not changed afterwards. As no changes are going to apply, you can easily store the TotalOrderPrice in the orders table and have both the advantages of seeing later what order price you sent/printed and retrieving the prices faster.
In general, my take is that you should avoid breaking the rules of normalisation until you need to. That means avoiding data redundancy in order to avoid update anomalies, and calculating things on the fly. I've seen a lot of terrible databases created because a developer worried that one day the database might not cope with the application load; in truth, in a well-designed, well-indexed, and well-maintained database this is rare. RDBMSes are a very good tool for dealing with large amounts of normalised data in transactional systems, if your database is designed and maintained correctly.
This doesn't mean you need to do the calculations in your application logic, though - and in fact I'd avoid that. Instead, make a view (looking like the query Tim Biegeleisen suggested in his answer) that does the calculations. If sometime down the road you find that this doesn't scale well, you can change the table and the view, plus whatever is populating this table - this minimises the disruption to your application if this change is needed. If the table is populated via a stored procedure then you might not need to make any changes to your front end application logic at all in order to switch from calculating on the fly to pre-calculated.
Which of the following scenarios will a) provide better performance and b) be more reliable/accurate. I've simplified the process and tables used. I would provide code/working but it's fairly simple stuff. I'm using MS-SQL2008 but I would assume the question is platform independent.
1) An item is removed from stock (the stock item has a unique ID), a trigger is fired which updates [tblSold], if the ID doesn't exist it creates a record and adds a value of 1, if it does exist it adds 1 to the current value. The details of the sale are recorded elsewhere.
When stock availability is requested its calculated from this table based on the item ID.
2) When stock availability is requested it simply sums the quantity in [tblSales] based on the ID.
Stock availability will be heavily requested and for obvious reasons can't ever be wrong.
I'm going to play devil's to advocate the previous answer and suggest using a query - here are my reasons.
SQL is designed for reads, a well maintained database will have no problem with hundreds of millions of rows of data. If your data is well indexed and maintained performance shouldn't be an issue.
Triggers can be hard to trace, they're a little less explicit and update information in the background - if you forget about them they can be a nightmare. A minor point but one which has annoyed me many times in the past!
The most important point, if you use a query (assuming it's right) your data can never get out of sync and can be regenerated easily. A running count would make this very difficult.
Ultimately this is a design decision which everyone will have a different view on. At the end of the day it will come down to your preferences and design.
I would go with first approach, there is no reason to count rows, when you can have just read one value from database, trigger would not do any bad, because you will not be selling items so often as you request quantity.
I have recently been given the assignment of modelling a database fit to
store stock prices for over 140 companies. The data will be collected
every 15 min for 8.5 h each day from all these companies. The problem I'm
facing right now is how to setup the database to achieve fast search/fetch
given this data.
One solution would be to store everything in one table with the following columns:
| Company name | Price | Date | Etc... |
Or I could create a table for each company and just store the price and the date for
when the data was collected (and other parameters not known atm).
What is your thought about these kind of solutions? I hope the problem was explained
in sufficient detail, else please let me know.
Any other solution would be greatly appreciated!
I take it you're concerned about performance given the large number of records your likely to generate - 140 companies * 4 data points / hour * 8.5 hours * 250 trading days / year means you're looking at around 1.2 million data points per year.
Modern relational database systems can easily handle that number of records - subject to some important considerations - in a single table - I don't see an issue with storing 100 years of data points.
So, yes, your initial design is probably the best:
Company name | Price | Date | Etc... |
Create indexes on Company name and date; that will allow you to answer questions like:
what was the highest share price for company x
what was the share price for company x on date y
on date y, what was the highest share price
To help prevent performance problems, I'd build a test database, and populate it with sample data (tools like dbMonster make this easy), and then build the queries you (think you) will run against the real system; use the tuning tools for your database system to optimize those queries and/or indices.
On top of what has already been said, I'd like to say the following thing: Don't use "Company name" or something like "Ticker Symbol" as your primary key. As you're likely to find out, stock prices have two important characteristics that are often ignored:
some companies can be quoted on multiple stock exchanges, and therefore have different quote prices on each stock exchange.
some companies are quoted on multiple times on the same stock exchange, but in different currencies.
As a result, a properly generic solution should use the (ISIN, currency, stock exchange) triplet as identifier for a quote.
The first, more important question is what are the types and usage patterns of the queries that will be executed against this table. Is this an Online Transactional Processing (OLTP) application, where the great majority of queries are against a single record, or at most a small set of records? or is to an Online Analytical Processing application, where most queries will need to read, and process, significantly large sets of data to generate aggregations and do analysis. These two very different types of systems should be modeled in different ways.
If it is the first type of app, (OLTP), your first option is a better one, but the usage patterns and types of queries would still be important to determine the types of indices to place on the table.
If it is an OLAP application, (and a system storing billions of stock prices sounds more like an OLAP app) then the data structure you set up might be better organized to store pre-aggregated data values, or even go all the way an use a multi-dimensional database like an OLAP cube, based on a star schema.
Put them into a single table. Modern DB engines can easily handle those volumes you specified.
rowid | StockCode | priceTimeInUTC | PriceCode | AskPrice | BidPrice | Volume
rowid: Identity UniqueIdentifier.
StockCode instead of Company. Companies have multiple types of socks.
PriceTimeInUTC is to standardize any datetime into a specific timezone.
Also datetime2 (more accurate).
PriceCode is used to identify what of price it is: Options/Futures/CommonStock, PreferredStock, etc
AskPrice is the Buying price
BidPrice is the Selling price.
Volume (for buy/sell) might be useful for you.
Separately, have a StockCode table and a PriceCode table.
That is a Brute Force approach. The second you add searchable factors it can change everything. A more flexible and elegant option is a star schema, which can scale to any
amount of data. I am a private party working on this myself.
I don't know a good way to maintain sums depending on dates in a SQL database.
Take a database with two tables:
Client
clientID
name
overdueAmount
Invoice
clientID
invoiceID
amount
dueDate
paymentDate
I need to propose a list of the clients and order it by overdue amount (sum of not paid past invoices of the client). On big database it isn't possible to calculate it in real time.
The problem is the maintenance of an overdue amount field on the client. The amount of this field can change at midnight from one day to the other even if nothing changed on the invoices of the client.
This sum changes if the invoice is paid, a new invoice is created and due date is past, a due date is now past and wasn't yesterday...
The only solution I found is to recalculate every night this field on every client by summing the invoices respecting the conditions. But it's not efficient on very big databases.
I think it's a common problem and I would like to know if a best practice exists?
You should read about data warehousing. It will help you to solve this problem. It looks similar as what you just said
"The only solution I found is to recalculate every night this field
on every client by summing the invoices respecting the conditions. But
it's not efficient on very big databases."
But it has something more than that. When you read it, try to forget about normalization. Its main intention is for 'show' data, not 'manage' data. So, you would feel weird at beginning but if you understand 'why we need data warehousing', it will be very very interesting.
This is a book that can be a good start http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247 , classic one.
Firstly, I'd like to understand what you mean by "very big databases" - most RDBMS systems running on decent hardware should be able to calculate this in real time for anything less than hundreds of millions of invoices. I speak from experience here.
Secondly, "best practice" is one of those expressions that mean very little - it's often used to present someone's opinion as being more meaningful than simply an opinion.
In my opinion, by far the best option is to calculate it on the fly.
If your database is so big that you really can't do this, I'd consider a nightly batch (as you describe). Nightly batch runs are a pain - especially for systems that need to be available 24/7, but they have the benefit of keeping all the logic in a single place.
If you want to avoid nightly batches, you can use triggers to populate an "unpaid_invoices" table. When you create a new invoice record, a trigger copies that invoice to the "unpaid_invoices" table; when you update the invoice with a payment, and the payment amount equals the outstanding amount, you delete from the unpaid_invoices table. By definition, the unpaid_invoices table should be far smaller than the total number of invoices; calculating the outstanding amount for a given customer on the fly should be okay.
However, triggers are nasty, evil things, with exotic failure modes that can stump the unsuspecting developer, so only consider this if you have a ninja SQL developer on hand. Absolutely make sure you have a SQL query which checks the validity of your unpaid_invoices table, and ideally schedule it as a regular task.
Apologies for the length of this question.
I have a section of our database design which I am worried may begin to cause problems. It is not at that stage yet, but obviously don't want to wait until it is to resolve the issue. But before I start testing various scenarios, I would appreciate input from anyone who has experience with such a problem.
Situation is Stock Control and maintaining the StockOnHand value.
It would be possible to maintain a table hold the stock control figures which can be updated whenever a order is entered either manually or by using a database trigger.
Alternatively you can get SQL to calculate the quantities by reading and summing the actual sales values.
The program is installed on several sites some of which are using MS-SQL 2005 and some 2008.
My problem is complicated because the same design needs to cope with several scenarios,
such as :
1) Cash/Sale Point Of Sale Environment. Sale is entered and stock is reduced in one transaction. No amendments can be made to this transaction.
2) Order/Routing/Confirmation
In this environment, the order is created and can be placed on hold, released, routed, amended, delivered, and invoiced. And at any stage until it is invoiced the order can be amended. (I mention this because any database triggers may be invoked lots of time and has to determine if changes should affect the stock on hand figures)
3) Different business have a different ideas of when their StockOnHand should be reduced. For example, some consider the stock as sold once they approve an order (as they have committed to sell the goods and hence it should not be sold to another person). Others do not consider the stock as sold until they have routed it and some others only when it has been delivered or collected.
4) There can be a large variance in number of transactions per product. For example, one system has four or five products which are sold several thousand times per month, so asking SQL to perform a sum on those transactions is reading ten of thousands of transactions per year, Whereas, on the same system, there are several thousand other products where sales would only less than a thousand transactions per year per product.
5) Historical information is important. For that reason, our system does not delete or archive transactions and has several years worth of transactions.
6) The system must have the ability to warn operators if they do not have the required stock when the order is entered ( which quite often is in real time, eg telephone order).
Note that this only required for some products. (But I don't think it would be practical to sum the quantity across ten of thousands of transactions in real time).
7) Average Cost Price. Some products can be priced based on the average cost of the items in stock. The way this is implemented is that the Average Cost price is re-calculated for every goods in transaction, something like newAverageCostPrice = (((oldAverageCostPrice * oldStockOnHand) + newCostValue) / newStockOnHand) . This means the stock On Hand must be known for every goods in if the product is using AverageCost.
The way the system is currently implemented is two fold.
We have a table which holds the StockOnHand for each product and location. Whenever a sale is updated, this table is updated via the business layer of our application (C#)
This only provides the current stock on hand figures.
If you need to run a Stock Valuation for a particular date, this figure is calculated by performing a sum of the quantitys on the lines involved. This also requires a join between the sales line and the sale header tables as the quantity and product are stored in the line file and the date and status are only held in the header table.
However, there are downsides to this method, such as.
Running the stock valuation report is slow, (but not unacceptably slow), but I am not happy with it. (It works and monitoring the server does not show it overloading it, but it has the potential to cause problems and hence requires regular monitoring)
The logic of the code updating the StockOnHand table is complicated.
This table is being updated frequently. In a lot of cases this is un-necessary as the information does not need to be checked. For example, if 90% of your business is selling 4 or 5 products, you don't really need a computer to tell you are out of stock.
Database trigggers.
I have never implemented complicated triggers before, so am wary of this.
For example, as stated before we need configuration options to determine the conditions when the stock figures should be updated. This is currently read once and cached in our program. To do this inside a trigger would persumably mean reading this information for every trigger. Does this have a big impact on performance.
Also we may need a trigger on the sale header and the sale line. (This could mean that an amendment to the sale header would be forced to read the lines and update the stockonhand for the relevant products, and then later on the lines are saved and another database trigger would amend the stockonahand table again which may be in-efficient.
Another alternative would be to only update the StockOnHand table whenever the transaction is invoiced (which means no further amendments can be done) and to provide a function to calculate the stockonhand figure based on a union of this table and the un-invoiced transactions which affect stock.
Any advice would be greatly appreciated.
First of I would strongly recommend you add "StockOnHand", "ReservedStock" and "SoldStock"
to your table.
A cash sale would immediatly Subtract the sale from "StockOnHand" and add it to "SoldStock", for an order you would leave "StockOnHand" alone and merely add the sale to ReservedStock, when the stock is finally invoiced you substract the sale from StockOnHand and Reserved stock and add it to "SoldStock".
The business users can then choose whether StockOnHand is just that or StockOnHand - ReservedStock.
Using a maintaind StockOnHand figure will reduce your query times massively, versus the small risk that the figure can go out of kilter if you mess up your program logic.
If your customers are so lucky enough to experience update contention when maintaining the StockOnHand figure (i.e. are they likely to process more than five sales a second at peak times) then you can consisider the following scheme:-
Overnight calculate the StockOnHand figure by counting deliveries - sales or whatever.
When a sale is confirmed insert a row to a "Todays Sales" table.
When you need to query stock on hand total up todays sale and subtract it from the start of day figure.
You could also place a "Stock Check Threshold" on each product so if you start the day with 10,000 widgets you can set the CheckThreshold to 100 if someone is ordering less than 100 than dont bother checking the stock. If someone orders over 100 then check the stock and recalculate a new lower threshold.
Could you create a view (or views) to respresent your stock on hand? This would take the responsibility for doing the calculations out of synchronous triggers which slow down your transactions. Using multiple views could satisfy the requirement "Different business have a different ideas of when their StockOnHand should be reduced." Assuming you can meet the stringent requirements, creating an indexed view could further improve your performance.
Just some ideas off the top of my head:
Instead of a trigger (and persistent SOH data), you could use a computed column (e.g. SOH per product per store). However, the performance impact of evaluating this would likely be abysmal unless there are >> more writes to your source tables than reads from your computed column. (The trade off is that is assuming the only reason you calculate the SOH is so that you can read it again. If you update the source data for the calc much more often than you actually need to read it, then the computed col might make sense - since it is JIT evaluation only when needed. This would be unusual though - reads are usually more frequent than writes in most Systems)
I'm guessing that the reason you are looking at triggers is because the source tables for the SOH figures are updated from a large number of procs / code in order to prevent oversight (as opposed to a calling a recalc SPROC from every applicable point where the source data has been modified?)
IMHO placing complicated in DB triggers is not advised, as this will adversely affect the performance of high volume inserts / updates, and triggers aren't great for maintainability.
Does the SOH calculation need to be real time? If not, you could implement a mechanism to queue requests for recalculation (e.g. by using a trigger to indicate that a product / location balance is dirty) and then run a recalculation service every few minutes for near real-time. Mission critical calculations (e.g. financial - like your #6) could still however detect that a SOH calc is dirty and then force a recalc before doing a transaction.
Re : 3 - Ouch. Would recommend that internally you agree on a consistent (and industry acceptable) set of terminology (Stock In Hand, Stock Committed, Stock in Transit, Shrinkage etc etc) and then try to convince your customers to conform to a standard. But that is in the ideal world of course!