What is the best practice database design for transactions aggregation?

What is the best practice database design for transactions aggregation? - sql

I am designing a database which will hold transaction level data. It will work the same way as a bank account - debits/credits to an Account Number.
What is the best / most efficient way of obtaining the aggregation of these transactions.
I was thinking about using a summary table and then adding these to a list of today's transactions in order to derive how much each account has (i.e their balance).
I want this to be scalable (ie. 1 billion transactions) so don't want to have to perform database hits to the main fact table as it will need to find all the debits/credits associated with a desired account number scanning potentially a billion rows.
Thanks, any help or resources would be awesome.

(Have been working in Banks for almost 10years. Here is how it is actually done).
TLDR: your idea is good.
Every now and then you store the balance somewhere else ("carry forward balance"). E.g. every month or so (or aver a given number of transactions). To calculate the actual balance (or any balance in the past) you accumulate all relevant transactions going back in time until the most recent balance you kept ("carry forward balance"), which you need to add, of course.
The "current" balance is not kept anywhere. Just alone for the locking problems you would have if you'd update this balance all the time. (In real banks you'll hit some bank internal accounts with almost every single transactions. There are plenty of bank internal accounts to be able to get the figures required by law. These accounts are hit very often and thus would cause locking issues when you'd like to update them with every transaction. Instead every transactions is just insert — even the carry forward balances are just inserts).
Also in real banks you have many use cases which make this approach more favourable:
Being able to get back dated balances at any time - Being able to get balances based on different dates for any time (e.g. value date vs. transaction date).
Reversals/Cancellations are a fun of it's own. Imagine to reverse a transaction from two weeks ago and still keep all of the above going.
You see, this is a long story. However, the answer to your question is: yes, you cannot accumulate an ever increasing number of transactions, you need to keep intermediate balances to limit the number to accumulate if needed. Hitting the main table for a limited number of rows, should be no issue.
Make sure your main query uses an Index-Only Scan.

Do an Object Oriented Design, Create table for objects example Account, Transaction etc. Here's a good website for your reference. But there's a lot more on the web discussing OODBMS. The reference I gave is just my basis when I started doing an OODBMS.

Related

transactions and balance

I work on contracting Company database " sql server " . I'm lost what’s the best solutions to calculate their customers balance accounts.
Balance table: create table for balance and another for transactions. So my application add any transactions to transactions table and calculate the balance according to balance table value.
Calculate balance using query: so I'll create transactions table only.
Note: the records may be up to 2 million records for each year, so I think they will need to backup it every year or some thing like that.
any new ideas or comments ?!

I would have a transactions table and a balances table as well, if I were you. Let's consider for example that you have 1 000 000 users. If a user has 20 transactions on average, then getting balance from a transaction table would be roughly 20x slower than getting balance from a balances table. Also, it is better to have something than not having that something.
So, I would choose to create a balances table without thinking twice.

Comments on your 2 ways:
Good solution if you have much more queries than updates (100 times or more). So, you add new transaction, recalculate balance and store it. You can do it in one transaction but it can take a lot of time and block user action. So, you can do it later (for example, update balances onces a minute/hour/day). Pros: fast reading. Cons: possible difference between balance value and sum of transactions or increasing user action time
Good solution if you have much more updates than reads (for example, trading system with a lot of transactions). Updating current balance can take time and may be worthless, because another transaction has already came :) so, you can calculate balance at runtime, on demand. Pros: always actual balance. Cons: calculating balance can take time.
As you see, it depends on your payload profile (reads or writes). I'll advice you to begin with second variant - it's easy to implement and good DB indexies can help you to get sum very fast (2 millions per year - not so much as it looks). But, it's up to you.

Definitely you must have a separate Balance table beside transaction table. Otherwise during read balance your performance will be slower day by day as transaction increasing and transactions will be costly as other users may lock the transaction table to read balance at the same time.

This question would seem to have a lot of opinion, and I was tempted to close it.
But, in any environment where I've been where customers have "balances", a critical part of the business is knowing the current balance for each customer. This means having a historical transaction table, a current balance amount, and an auditing process to ensure that the two are aligned.
The current balance would be maintained whenever the database is changed. The "standard" method is to use triggers. My preferred method is to encapsulate data changes in stored procedures, and have the logic for the summarization in the same procedures used to modify the transaction data.

Running total - trigger or query?

Which of the following scenarios will a) provide better performance and b) be more reliable/accurate. I've simplified the process and tables used. I would provide code/working but it's fairly simple stuff. I'm using MS-SQL2008 but I would assume the question is platform independent.
1) An item is removed from stock (the stock item has a unique ID), a trigger is fired which updates [tblSold], if the ID doesn't exist it creates a record and adds a value of 1, if it does exist it adds 1 to the current value. The details of the sale are recorded elsewhere.
When stock availability is requested its calculated from this table based on the item ID.
2) When stock availability is requested it simply sums the quantity in [tblSales] based on the ID.
Stock availability will be heavily requested and for obvious reasons can't ever be wrong.

I'm going to play devil's to advocate the previous answer and suggest using a query - here are my reasons.
SQL is designed for reads, a well maintained database will have no problem with hundreds of millions of rows of data. If your data is well indexed and maintained performance shouldn't be an issue.
Triggers can be hard to trace, they're a little less explicit and update information in the background - if you forget about them they can be a nightmare. A minor point but one which has annoyed me many times in the past!
The most important point, if you use a query (assuming it's right) your data can never get out of sync and can be regenerated easily. A running count would make this very difficult.
Ultimately this is a design decision which everyone will have a different view on. At the end of the day it will come down to your preferences and design.

I would go with first approach, there is no reason to count rows, when you can have just read one value from database, trigger would not do any bad, because you will not be selling items so often as you request quantity.

Efficient database model for points system?

I'm trying to add a very simple points system to a site. In my SQL database, there is a table for awarding points, since admins can increase a user's points by any amount, along with a rationale for the points increase. So, simplified a little, this table contains the # of points awarded, rationale and userid for each time an admin awards points. So far so good.
However, there are 10 usergroups on the site that compete for highest total points. The number of points for a single usergroup can easily hit 15 000 total, as there are already more than 10 000 members of the site (admittedly, most are inactive). I want to have a leaderboard to show the competing usergroups and their total scores, but I'm worried that when implementing the system, summing the points will take too long to do each time. Here's the question: at what level (if any) should I save the points aggregate in the database? Should I have a field in the user table for total points per user and sum those up on the fly for the usergroup leaderboard? Or should I have an aggregate field for each usergroup that I update each times points are added to a single user? Before actually implementing the system, I'd like to have a good idea of how long it will take to sum these up on the fly, since a bad implementation will affect thousands of users, and I don't have much practical experience with large databases.

It depends on your hardware but summing thousands of rows should be no problem. In general though, you should avoid summing all the user scores except when you absolutely need to. I would recommend adding in a rolup table that will store the total score for each group and then run a cron nightly that will validate the total scores (basically do the summation and then store the absolutely correct values).
I suggest adding in your table that logs points awarded and reason for the award. Also, store the summed scores per user separately and update it at the same time your insert into the logging table and another table with the total score per group. That should work well at your activity level. You could also do asynchronous updates of the total group scores if it is too contentious but it should be fine.

Honestly, your aggregates will still likely compute in less than a second with less than 10k rows -you should leave your operational database atomic and just store each point transaction and compute the aggregates when you query. If you really wanted to, you could precompute your aggregates into a materialized view, but I really don't think you'd need to.
You could create a materialized view with the clause
refresh fast start with sydate
next sysdate + 1/24
-To have it refresh hourly.
You wouldn't have real-time aggregate data (it could be off by an hour), but it could increase the performance of your aggregate queries by quite a bit if the data gets huge. As your data is now, I wouldn't even bother with it though -- your performance should be alright.
edit: not sure why I'm being down-voted. I think this is a better solution than storing aggregates in tables.

Best way to calculate sum depending on dates with SQL

I don't know a good way to maintain sums depending on dates in a SQL database.
Take a database with two tables:
Client
clientID
name
overdueAmount
Invoice
clientID
invoiceID
amount
dueDate
paymentDate
I need to propose a list of the clients and order it by overdue amount (sum of not paid past invoices of the client). On big database it isn't possible to calculate it in real time.
The problem is the maintenance of an overdue amount field on the client. The amount of this field can change at midnight from one day to the other even if nothing changed on the invoices of the client.
This sum changes if the invoice is paid, a new invoice is created and due date is past, a due date is now past and wasn't yesterday...
The only solution I found is to recalculate every night this field on every client by summing the invoices respecting the conditions. But it's not efficient on very big databases.
I think it's a common problem and I would like to know if a best practice exists?

You should read about data warehousing. It will help you to solve this problem. It looks similar as what you just said
"The only solution I found is to recalculate every night this field
on every client by summing the invoices respecting the conditions. But
it's not efficient on very big databases."
But it has something more than that. When you read it, try to forget about normalization. Its main intention is for 'show' data, not 'manage' data. So, you would feel weird at beginning but if you understand 'why we need data warehousing', it will be very very interesting.
This is a book that can be a good start http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247 , classic one.

Firstly, I'd like to understand what you mean by "very big databases" - most RDBMS systems running on decent hardware should be able to calculate this in real time for anything less than hundreds of millions of invoices. I speak from experience here.
Secondly, "best practice" is one of those expressions that mean very little - it's often used to present someone's opinion as being more meaningful than simply an opinion.
In my opinion, by far the best option is to calculate it on the fly.
If your database is so big that you really can't do this, I'd consider a nightly batch (as you describe). Nightly batch runs are a pain - especially for systems that need to be available 24/7, but they have the benefit of keeping all the logic in a single place.
If you want to avoid nightly batches, you can use triggers to populate an "unpaid_invoices" table. When you create a new invoice record, a trigger copies that invoice to the "unpaid_invoices" table; when you update the invoice with a payment, and the payment amount equals the outstanding amount, you delete from the unpaid_invoices table. By definition, the unpaid_invoices table should be far smaller than the total number of invoices; calculating the outstanding amount for a given customer on the fly should be okay.
However, triggers are nasty, evil things, with exotic failure modes that can stump the unsuspecting developer, so only consider this if you have a ninja SQL developer on hand. Absolutely make sure you have a SQL query which checks the validity of your unpaid_invoices table, and ideally schedule it as a regular task.

Recommendations for best SQL Performance updating and/or calculating stockonhand totals

Apologies for the length of this question.
I have a section of our database design which I am worried may begin to cause problems. It is not at that stage yet, but obviously don't want to wait until it is to resolve the issue. But before I start testing various scenarios, I would appreciate input from anyone who has experience with such a problem.
Situation is Stock Control and maintaining the StockOnHand value.
It would be possible to maintain a table hold the stock control figures which can be updated whenever a order is entered either manually or by using a database trigger.
Alternatively you can get SQL to calculate the quantities by reading and summing the actual sales values.
The program is installed on several sites some of which are using MS-SQL 2005 and some 2008.
My problem is complicated because the same design needs to cope with several scenarios,
such as :
1) Cash/Sale Point Of Sale Environment. Sale is entered and stock is reduced in one transaction. No amendments can be made to this transaction.
2) Order/Routing/Confirmation
In this environment, the order is created and can be placed on hold, released, routed, amended, delivered, and invoiced. And at any stage until it is invoiced the order can be amended. (I mention this because any database triggers may be invoked lots of time and has to determine if changes should affect the stock on hand figures)
3) Different business have a different ideas of when their StockOnHand should be reduced. For example, some consider the stock as sold once they approve an order (as they have committed to sell the goods and hence it should not be sold to another person). Others do not consider the stock as sold until they have routed it and some others only when it has been delivered or collected.
4) There can be a large variance in number of transactions per product. For example, one system has four or five products which are sold several thousand times per month, so asking SQL to perform a sum on those transactions is reading ten of thousands of transactions per year, Whereas, on the same system, there are several thousand other products where sales would only less than a thousand transactions per year per product.
5) Historical information is important. For that reason, our system does not delete or archive transactions and has several years worth of transactions.
6) The system must have the ability to warn operators if they do not have the required stock when the order is entered ( which quite often is in real time, eg telephone order).
Note that this only required for some products. (But I don't think it would be practical to sum the quantity across ten of thousands of transactions in real time).
7) Average Cost Price. Some products can be priced based on the average cost of the items in stock. The way this is implemented is that the Average Cost price is re-calculated for every goods in transaction, something like newAverageCostPrice = (((oldAverageCostPrice * oldStockOnHand) + newCostValue) / newStockOnHand) . This means the stock On Hand must be known for every goods in if the product is using AverageCost.
The way the system is currently implemented is two fold.
We have a table which holds the StockOnHand for each product and location. Whenever a sale is updated, this table is updated via the business layer of our application (C#)
This only provides the current stock on hand figures.
If you need to run a Stock Valuation for a particular date, this figure is calculated by performing a sum of the quantitys on the lines involved. This also requires a join between the sales line and the sale header tables as the quantity and product are stored in the line file and the date and status are only held in the header table.
However, there are downsides to this method, such as.
Running the stock valuation report is slow, (but not unacceptably slow), but I am not happy with it. (It works and monitoring the server does not show it overloading it, but it has the potential to cause problems and hence requires regular monitoring)
The logic of the code updating the StockOnHand table is complicated.
This table is being updated frequently. In a lot of cases this is un-necessary as the information does not need to be checked. For example, if 90% of your business is selling 4 or 5 products, you don't really need a computer to tell you are out of stock.
Database trigggers.
I have never implemented complicated triggers before, so am wary of this.
For example, as stated before we need configuration options to determine the conditions when the stock figures should be updated. This is currently read once and cached in our program. To do this inside a trigger would persumably mean reading this information for every trigger. Does this have a big impact on performance.
Also we may need a trigger on the sale header and the sale line. (This could mean that an amendment to the sale header would be forced to read the lines and update the stockonhand for the relevant products, and then later on the lines are saved and another database trigger would amend the stockonahand table again which may be in-efficient.
Another alternative would be to only update the StockOnHand table whenever the transaction is invoiced (which means no further amendments can be done) and to provide a function to calculate the stockonhand figure based on a union of this table and the un-invoiced transactions which affect stock.
Any advice would be greatly appreciated.

First of I would strongly recommend you add "StockOnHand", "ReservedStock" and "SoldStock"
to your table.
A cash sale would immediatly Subtract the sale from "StockOnHand" and add it to "SoldStock", for an order you would leave "StockOnHand" alone and merely add the sale to ReservedStock, when the stock is finally invoiced you substract the sale from StockOnHand and Reserved stock and add it to "SoldStock".
The business users can then choose whether StockOnHand is just that or StockOnHand - ReservedStock.
Using a maintaind StockOnHand figure will reduce your query times massively, versus the small risk that the figure can go out of kilter if you mess up your program logic.
If your customers are so lucky enough to experience update contention when maintaining the StockOnHand figure (i.e. are they likely to process more than five sales a second at peak times) then you can consisider the following scheme:-
Overnight calculate the StockOnHand figure by counting deliveries - sales or whatever.
When a sale is confirmed insert a row to a "Todays Sales" table.
When you need to query stock on hand total up todays sale and subtract it from the start of day figure.
You could also place a "Stock Check Threshold" on each product so if you start the day with 10,000 widgets you can set the CheckThreshold to 100 if someone is ordering less than 100 than dont bother checking the stock. If someone orders over 100 then check the stock and recalculate a new lower threshold.

Could you create a view (or views) to respresent your stock on hand? This would take the responsibility for doing the calculations out of synchronous triggers which slow down your transactions. Using multiple views could satisfy the requirement "Different business have a different ideas of when their StockOnHand should be reduced." Assuming you can meet the stringent requirements, creating an indexed view could further improve your performance.

Just some ideas off the top of my head:
Instead of a trigger (and persistent SOH data), you could use a computed column (e.g. SOH per product per store). However, the performance impact of evaluating this would likely be abysmal unless there are >> more writes to your source tables than reads from your computed column. (The trade off is that is assuming the only reason you calculate the SOH is so that you can read it again. If you update the source data for the calc much more often than you actually need to read it, then the computed col might make sense - since it is JIT evaluation only when needed. This would be unusual though - reads are usually more frequent than writes in most Systems)
I'm guessing that the reason you are looking at triggers is because the source tables for the SOH figures are updated from a large number of procs / code in order to prevent oversight (as opposed to a calling a recalc SPROC from every applicable point where the source data has been modified?)
IMHO placing complicated in DB triggers is not advised, as this will adversely affect the performance of high volume inserts / updates, and triggers aren't great for maintainability.
Does the SOH calculation need to be real time? If not, you could implement a mechanism to queue requests for recalculation (e.g. by using a trigger to indicate that a product / location balance is dirty) and then run a recalculation service every few minutes for near real-time. Mission critical calculations (e.g. financial - like your #6) could still however detect that a SOH calc is dirty and then force a recalc before doing a transaction.
Re : 3 - Ouch. Would recommend that internally you agree on a consistent (and industry acceptable) set of terminology (Stock In Hand, Stock Committed, Stock in Transit, Shrinkage etc etc) and then try to convince your customers to conform to a standard. But that is in the ideal world of course!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas