Let's say I have a database of millions of widgets with a price attribute. Widgets belong to suppliers, and I sell widgets to customers by first buying them from suppliers and then selling them to the customer. With this basic setup, if a customer asks me for every widget less than $50, it's trivial to list them.
However, I mark up the price of widgets from individual suppliers differently. So I may mark up widgets from Supplier A by 10%, and I may mark up widgets from Supplier B by a flat rate of $5. In a database, these markups would be stored in a join table with my ID, the supplier ID, a markup type (flat, percentage), and a markup rate. On top of this, suppliers may add their own markups when they sell to me (these markups would be in the same join table with the supplier's ID, my ID, and a markup type/rate).
So if I want to sell a $45 widget from Supplier A, it might get marked up by the supplier's 10% markup (to $49.50), and then my own $10 flat markup (to $59.50). This widget would not show up in the client's search for widgets costing less than $50. However, it's possible that an $80 widget could get marked down to $45 by the time it reaches the client, and should be returned in results. These markups are subject to change, and let's assume I'm one of hundreds of people in this system selling widgets to customers through suppliers, all with their own markup relationships in that markup table.
Is there any precedent for performing calculations like this quickly across millions of objects? I realize this is a huge, non-trivial problem, but I'm curious how one would start addressing a problem like this.
Add columns to your database and store the computed results, updating them with the related records change. You cannot calculate these values on the fly for millions of records.
Is there any precedent for performing calculations like this quickly across millions of
objects?
Standard. Seriously. Data warehouse, risk projections. Stuff like that - your problem is small. Precaulcuate all combinations, store them in a proper higher level database server, finished.
it is not huge - seriously. It is only huge for a small server, but once you get a calculation grid going... it is quite trivial. Millions of objects? Calculate 100.000 objects in a minute per machine, 10 million are 100 minute objects. And you dont have THAT many changes.
Related
This question already has answers here:
Best practise to store total values or to sum up at runtime
(2 answers)
Closed 5 years ago.
I wanted to ask if it is a good idea to store total values into SQL database. By total values I mean stuff like total cost of an order, or the total quantity.
As an example, I'll have a table called orders which will have information like when it was created, who made the order and etc. Orders will relate to the table order_items which will hold the id of the item in the order, the quantity of the item in the order and the price at which it was sold (not the item`s original price).
In this situation, should I calculate the total cost of the order once and store it in the orders table, or should I calculate it every time I retrieve the order with the corresponding items?
There are pros and cons to each choice. If you choose to store the computed totals, you'll have to update them every time you update the details they are dependent on. To do that reliably you'll need to enforce it close to the database, in a trigger for example, rather than relying on on business logic to get it right all the time.
On the other hand if you don't store the computed value you'll have to compute it every time you want access to it. A view will make that trivial.
The decision to store vs recompute comes down to how frequently does the data change vs how often is it accessed. If it changes infrequently, then storing the value may be okay, but if it changes frequently, then it's better to recompute it each time you access it. Similarly if the data is accessed infrequently, then recomputing the values is good while if you access it frequently, then precomputing may be the better option.
Depending on the database you are using, you may be able to get the best of both worlds, by creating a materialized view.
For the particular case of an order, the total amount is traditionally stored with the order. The order items then contain the prices of each item in the order.
Why? Simple: certain amounts are available only on the order-level. For instance:
Shipping charges
Taxes (the taxes on the item levels may not add up correctly to the order level)
Order-level discounts (such as discounts based on the overall order size or bundle discounts)
In addition, orders are usually put into the database once, with no or very rare changes on the items (the calculations may not be made until the order is finalized).
Other types of entities might have other considerations. This addresses the specific issue of whether totals should be stored for orders.
I have requirement where user wants "All" option is few fields.
1.Sites has records around 20 (Includes All option)
2.Cost Centers which are dependent on 1.Sites has total records around 540 including all Sites. Sites may have different number of Cost Centers (Includes All option)
3.Employees which are dependent on 2.Cost Centers has total records around 29000. Each Cost Center may include different number of Employees. (Includes All option)
4. Processes Which are independents of all above. It includes records around 20.(Includes All option)
Now Sites, Cost Centers, Employees and Processes have dropdown with "All" along with other options.
How would i design database table. Considering below scenarios
User selects following:
Sites : Riyadh
Cost Centers : MA - Medical
Employees : All
Processes : Travel Request and Authorization
User has gone for All in Cost Center
Sites : Jeddah
Cost Centers : All
Employees : All
Processes : All
Likewise There are few others combinations. And How user should see inserted records so that He/She can easily navigate to record and update/delete it. Right now i was thinking of inserting single records for option "All". For e.g.
User Selects:
Sites : Riyadh
Cost Centers : Nursing
Employees : All
Processes : All
This will insert just one row in database table.
User has requirement that if he has 200 Employees under selected Cost Center and he wants to apply for only 70 Employees. He needs to do more work.
How user edit the inserted records afterwards. And How view of all records should be rendered so that editing particular records become easy for user.
You don't model the ALL in your data or you have to deal with people mis-assigning an employee to a cost center named ALL under a site named all. You don't want that!
Sites have cost centers, cost centers have employees, there are processes and (I assume) employees may be assigned to them, thus implying a table that links employees to processes. Only store REAL data.
Then you be smart in your queries so that if the user selects ALL for a given drop-down they get ALL matching records, and inserted data must meet proper referential integrity. A cost center must belong to a valid site. An employee must belong to a cost center and may have one or more processes they are linked to.
but putting in "All" placeholders? You're opening yourself up for a world of hurt managing pseudo-relationships versus real relationships if you go down that route.
Actually you have two relationship between Sites and Cost Centers (I'm narrowing it down only to those two entities). Both are optional and one of them must be defined.
The first relation is (un-problematic) zero to one relationship Site to Cost Center (covering the case that the cost center is known and assigned for the Site).
The second relationship covers the case, that no cost center is assigned and the cost must be "somehow allocated". The "ALL" may mean each cost center (say) receives equal share.
This split in two relationship makes the database design more clean, but it will not address the main problem, which is in querying the relation.
The problem is manifested in OR condition in join predicates (chasing both paths) which can lead to sub-optimal performance.
So this is the touchstone of your design, collect the main queries and check how they perform on sample data.
One possible approach to attack performance problems would be to define materialized views that expand the ALL relationship to every Cost Center (as proposed by #Michael) and that can be refreshed in case of a new Cost Center definition, so you need not to handle manually such changes.
Sometimes creating a separate table would produce much more work, should I split it anyway?
for example: In my project I have a table of customers, each customer has his own special price for each product (there are only 5 products & more products are not planned in the future), each customer also have unique days of the week when the company delivers to him the products.
Many operations like changing days/price for a customer, or displaying days & prices of all customers would be much easier when the days & product prices are columns in the customers table and not separate tables, so is it refuted to create only one big customers table in such case? What are the drawbacks?
UPDATE: They just informed me that after a year or so there's a chance that they add more products, they say their business won't exceed 20-30 products in any event.
I still can't understand why in such case when product's prices has no relation (each customer has his own special price) adding rows to Products table is better then adding Columns to Customers table?
The only benefit I could think of is that customer that has only 5 products won't have to 'carry' 20 nullable products (saves space on server)? I don't have much experience so maybe I'm missing the obvious?
Clearly, just saying that one should always normalize is not pragmatic. No advice is always true.
If you can say with certainty that 5 "items" will be enough for a long time I think it is perfectly fine to just store them as columns if it saves you work.
If your prediction fails and a 6th items needs to be stored you can add a new column. As long as the number of columns doesn't get out of hand with very high probability, this should not be a problem.
Just be careful with such tactics as the ability of many programmers to predict the future turns out to be very limited.
In the end only one thing counts: Delivering the requested solution at the lowest cost. Purity of code is not a goal.
Normalization is all about data integrity (consistency), nothing else; not about hard, easy, fast, slow, efficient and other murky attributes. The current design almost certainly allows for data anomalies. If not right now, the moment you try to track price changes, invoices, orders, etc, it is a dead end.
This is my first time designing tables in a sql database and I have no idea how much server cpu this would use and whether this is a viable way of coding.
I have to create a bidding site where the gist is every time someone bids (where bids have to be bought separately at 50 cents per bid) the final price goes up by 1 cent, 2 cents, or 5 cents.
The trouble I'm facing is that I have to make a database table to keep track of the item's bid history and it seems like I have to create an individual table for each item (3 things need to be kept track of apart from the item id - bidder, bit time, cents at which it was bid on).
I'm fairly inexperienced in this and am willing to go back to the drawing board to brainstorm another table design, but was wondering if creating thousands (assuming the site will be somewhat successful) of table on a daily basis for each new item being listed is something that's alright. And I'm probably overestimating site traffic and might be more in the range of just a few hundred tables per day, but I want to prepare for the worst.
I would go back to the drawing board. Creating new tables for what is essentially the same thing is poor design. Have you heard of the DRY (Don't Repeat Yourself) principle.
Why do you think you need one table per item ?
you could design a table structure with perhaps to hold your items, their bid history with 2-3 tables for all items together... depending on the metadata it could usefull to have another 1-2 tables... always NOT pet item but per "information type" (like "item history", "item metadata").
Apologies for the length of this question.
I have a section of our database design which I am worried may begin to cause problems. It is not at that stage yet, but obviously don't want to wait until it is to resolve the issue. But before I start testing various scenarios, I would appreciate input from anyone who has experience with such a problem.
Situation is Stock Control and maintaining the StockOnHand value.
It would be possible to maintain a table hold the stock control figures which can be updated whenever a order is entered either manually or by using a database trigger.
Alternatively you can get SQL to calculate the quantities by reading and summing the actual sales values.
The program is installed on several sites some of which are using MS-SQL 2005 and some 2008.
My problem is complicated because the same design needs to cope with several scenarios,
such as :
1) Cash/Sale Point Of Sale Environment. Sale is entered and stock is reduced in one transaction. No amendments can be made to this transaction.
2) Order/Routing/Confirmation
In this environment, the order is created and can be placed on hold, released, routed, amended, delivered, and invoiced. And at any stage until it is invoiced the order can be amended. (I mention this because any database triggers may be invoked lots of time and has to determine if changes should affect the stock on hand figures)
3) Different business have a different ideas of when their StockOnHand should be reduced. For example, some consider the stock as sold once they approve an order (as they have committed to sell the goods and hence it should not be sold to another person). Others do not consider the stock as sold until they have routed it and some others only when it has been delivered or collected.
4) There can be a large variance in number of transactions per product. For example, one system has four or five products which are sold several thousand times per month, so asking SQL to perform a sum on those transactions is reading ten of thousands of transactions per year, Whereas, on the same system, there are several thousand other products where sales would only less than a thousand transactions per year per product.
5) Historical information is important. For that reason, our system does not delete or archive transactions and has several years worth of transactions.
6) The system must have the ability to warn operators if they do not have the required stock when the order is entered ( which quite often is in real time, eg telephone order).
Note that this only required for some products. (But I don't think it would be practical to sum the quantity across ten of thousands of transactions in real time).
7) Average Cost Price. Some products can be priced based on the average cost of the items in stock. The way this is implemented is that the Average Cost price is re-calculated for every goods in transaction, something like newAverageCostPrice = (((oldAverageCostPrice * oldStockOnHand) + newCostValue) / newStockOnHand) . This means the stock On Hand must be known for every goods in if the product is using AverageCost.
The way the system is currently implemented is two fold.
We have a table which holds the StockOnHand for each product and location. Whenever a sale is updated, this table is updated via the business layer of our application (C#)
This only provides the current stock on hand figures.
If you need to run a Stock Valuation for a particular date, this figure is calculated by performing a sum of the quantitys on the lines involved. This also requires a join between the sales line and the sale header tables as the quantity and product are stored in the line file and the date and status are only held in the header table.
However, there are downsides to this method, such as.
Running the stock valuation report is slow, (but not unacceptably slow), but I am not happy with it. (It works and monitoring the server does not show it overloading it, but it has the potential to cause problems and hence requires regular monitoring)
The logic of the code updating the StockOnHand table is complicated.
This table is being updated frequently. In a lot of cases this is un-necessary as the information does not need to be checked. For example, if 90% of your business is selling 4 or 5 products, you don't really need a computer to tell you are out of stock.
Database trigggers.
I have never implemented complicated triggers before, so am wary of this.
For example, as stated before we need configuration options to determine the conditions when the stock figures should be updated. This is currently read once and cached in our program. To do this inside a trigger would persumably mean reading this information for every trigger. Does this have a big impact on performance.
Also we may need a trigger on the sale header and the sale line. (This could mean that an amendment to the sale header would be forced to read the lines and update the stockonhand for the relevant products, and then later on the lines are saved and another database trigger would amend the stockonahand table again which may be in-efficient.
Another alternative would be to only update the StockOnHand table whenever the transaction is invoiced (which means no further amendments can be done) and to provide a function to calculate the stockonhand figure based on a union of this table and the un-invoiced transactions which affect stock.
Any advice would be greatly appreciated.
First of I would strongly recommend you add "StockOnHand", "ReservedStock" and "SoldStock"
to your table.
A cash sale would immediatly Subtract the sale from "StockOnHand" and add it to "SoldStock", for an order you would leave "StockOnHand" alone and merely add the sale to ReservedStock, when the stock is finally invoiced you substract the sale from StockOnHand and Reserved stock and add it to "SoldStock".
The business users can then choose whether StockOnHand is just that or StockOnHand - ReservedStock.
Using a maintaind StockOnHand figure will reduce your query times massively, versus the small risk that the figure can go out of kilter if you mess up your program logic.
If your customers are so lucky enough to experience update contention when maintaining the StockOnHand figure (i.e. are they likely to process more than five sales a second at peak times) then you can consisider the following scheme:-
Overnight calculate the StockOnHand figure by counting deliveries - sales or whatever.
When a sale is confirmed insert a row to a "Todays Sales" table.
When you need to query stock on hand total up todays sale and subtract it from the start of day figure.
You could also place a "Stock Check Threshold" on each product so if you start the day with 10,000 widgets you can set the CheckThreshold to 100 if someone is ordering less than 100 than dont bother checking the stock. If someone orders over 100 then check the stock and recalculate a new lower threshold.
Could you create a view (or views) to respresent your stock on hand? This would take the responsibility for doing the calculations out of synchronous triggers which slow down your transactions. Using multiple views could satisfy the requirement "Different business have a different ideas of when their StockOnHand should be reduced." Assuming you can meet the stringent requirements, creating an indexed view could further improve your performance.
Just some ideas off the top of my head:
Instead of a trigger (and persistent SOH data), you could use a computed column (e.g. SOH per product per store). However, the performance impact of evaluating this would likely be abysmal unless there are >> more writes to your source tables than reads from your computed column. (The trade off is that is assuming the only reason you calculate the SOH is so that you can read it again. If you update the source data for the calc much more often than you actually need to read it, then the computed col might make sense - since it is JIT evaluation only when needed. This would be unusual though - reads are usually more frequent than writes in most Systems)
I'm guessing that the reason you are looking at triggers is because the source tables for the SOH figures are updated from a large number of procs / code in order to prevent oversight (as opposed to a calling a recalc SPROC from every applicable point where the source data has been modified?)
IMHO placing complicated in DB triggers is not advised, as this will adversely affect the performance of high volume inserts / updates, and triggers aren't great for maintainability.
Does the SOH calculation need to be real time? If not, you could implement a mechanism to queue requests for recalculation (e.g. by using a trigger to indicate that a product / location balance is dirty) and then run a recalculation service every few minutes for near real-time. Mission critical calculations (e.g. financial - like your #6) could still however detect that a SOH calc is dirty and then force a recalc before doing a transaction.
Re : 3 - Ouch. Would recommend that internally you agree on a consistent (and industry acceptable) set of terminology (Stock In Hand, Stock Committed, Stock in Transit, Shrinkage etc etc) and then try to convince your customers to conform to a standard. But that is in the ideal world of course!