SQL SUM expression and Lock - sql

I have a problem with right SQL solution.
Current situation:
My database contains table with bank transactions (credit and debit).
Credit transactions are signed as posivitive amount (+), and
debit transactions as negative amount (-).
Application which uses the DB is a multiuser webapp, so Transactions Table contains many rows, which reference to different users.
Some webapp actions need to check actual balance of logged user, using Transactions table and save debit Transaction (action price).
I think about architecture of this mechanism and have some questions:
Is it a good idea to calculate balance as a SUM of Transactions credits and debits each time user requests? I know it may be inefficient for db. Maybe should I save a snapshot somewhere?
How to ensure data cohesion when one user checks ""balance"" as a SUM of credit/debit transactions, and another user in the same time saves debit transaction (because he/she was faster)? I think about a pessimistic lock but what should I lock? I know that lock with aggregation (SUM) may be impossible on Postgresql (database which I use)."
Sorry for my English, I hope my problem is understandable. :)

I would consider EITHER:
Storing a balance on the account record, along with the date for which the balance is accurate.
Getting the current balance is a matter of reading the account balance, and then including any transactions since that date.
You can have a scheduled job that recalculates and timestamps that balance at an hour past midnight.
OR (and this is my preferred solution):
Every time a transaction or batch of transactions is loaded, lock the relevant account records and update them with the values from the insert as part of the same transaction.
This has the advantage of serialising access to the account, which can then help with determining whether a transaction can go ahead or not because of decisions based on the balance calculation.

If you want to avoid having the balance on the user account, something that could have a better performance, the approach I would experiment would be:
Each transaction would be related to only one account.
Each transaction would have the account balance after that transaction.
Therefore, the last transaction for that account would have the current balance.
Ex.:
TransactionId | AccountId | Datetime | Ammount | Balance
1 | 1 | 7/11/16 | 0 | 0
2 | 1 | 7/11/16 | 500 | 500
3 | 1 | 7/11/16 | -20 | 480
4 | 1 | 8/11/16 | 50 | 530
5 | 1 | 8/11/16 | -200 | 330
This way you would be able to get the account balance (last transaction with that accountId) and you would be able to provide a better view into the balance change over time.

Related

Update historical results from wrong recurrent query

Hi I am wondering how to go about the following situation: I have a recurrent query (PostgreSQL DB V12 instance) which runs and inserts data into a table for customer volume with 6 different metrics on a daily basis, creating a snapshot of the dataset over time:
26/03/2021| Active Linked Customers | 10
26/03/2021| Customers where all accounts have positive consent | 20
26/03/2021| Customer with accounts with a mixture of positive consent and > 90 days | 30
26/03/2021| Customers who have all accounts > 30 days | 40
26/03/2021| Customers who have ever connected >=1 account | 50
26/03/2021| Customer record created but customer never connected an account | 5
25/03/2021| Active Linked Customers | 7
25/03/2021| Customers where all accounts have positive consent | 9
25/03/2021| Customer with accounts with a mixture of positive consent and > 90 days | 30
25/03/2021| Customers who have all accounts > 30 days| 40
25/03/2021| Customers who have ever connected >=1 account | 50
25/03/2021| Customer record created but customer never connected an account | 4
...
I have now realised that I need to tweak / fix a where clause of my query that has been generating all results for a couple of months now. This fix is to exclude accounts that have been deleted already from the system, I just need to add the following clause:
where deleted_date is not NULL
The query has been built on 6 different CTEs that feeds the table above on a daily basis. The fix is for future results is straightforward, but how to "fix" the historical data and exclude deleted accounts from the results? Is there a way to reverse engineer the results and "go back in time" to tweak the query and rerun for the last couple of months? Or even some clever data analysis to get around this? Any ideas / suggestions?

Calculating user's balance with multiple cost tables

We're building a SaaS offering where a user can incur costs from various types of transactions for example:
Making phone calls
Sending SMS messages
Storing audio recordings
We have built our system to store the costs of each service, for example the call_audit table looks like:
Date Call ID Our Cost User Cost Currency Duration User ID
---------- -------- --------- --------- -------- -------- -------
2018-01-02 sm_123 0.01 0.02 USD 72 us_1
The sms_audit table looks like:
Date SMS ID Our Cost User Cost Currency User ID
---------- -------- --------- --------- -------- -------
2018-01-02 sm_123 0.01 0.02 USD us_1
Then there is a payment_audit table with user payments and refunds:
Date User ID Amount Currency Type
---------- -------- ------ -------- ----
2018-01-02 us_1 12 USD CHARGE
2018-01-02 us_1 -2 USD REFUND
We also have a user table with a balance column which we decrement when the user incurs a call, sms cost or refund. We increment it when the user pays into their account (CHARGE as above).
But going forward I'm thinking we need something more resilient than a single balance figure which gets updated in code.
One improvement is to update the balance figure with triggers instead of in code.
Another approach would be to calculate the user's total costs and payments across multiple tables and sum the lot. As the tables grow to many 1000s of transactions I can imagine this becoming a slow computation.
Another approach we thought of was to have a balance_transactions table with a debit, credit and running balance column. This of course incurs transitive dependencies between rows which isn't great if seeking a nicely normalized DB. It also means we're duplicating data, but in the real world is this an acceptable trade off?
You can avoid duplicating the data by using materialized views. Note, that updating the balance (in any way - either by the application, triggers, partial running balances) already duplicates the data. As such, you should have some validation procedures running to alert on discrepancies. And such validation procedures should do all the calculations, so they might as well populate materialized view.
However, the actual solution depends on frequency you need these data. If you, for example, fetch all the customers balances monthly for invoicing purposes, just don't duplicate them. But if you print the balance after each customer operation, e.g. in some kind of transaction confirmation (like PDF generated and e-mailed to customer), you might want to keep the running balance in a form that was presented to the customer, since he owns the balance evidence.

Calculating interest using SQL

I am using PostgreSQL, and have a table for a billing cycle and another for payments made in a billing cycle.
I am trying to figure out how to calculate interest based on how much amount was left after each billing cycle's last payment date. Problem is that every time a repayment is made, the interest has to be calculated on the amount remaining after that.
My thoughts on building this query are like this. Build data for all dates from last pay date of the billing cycle to today. Using partitioning, get the remaining amount for the first date. For second date, use amount from previous row and add interest to it, and then calculate interest on this one.
Unfortunately I am stuck just at the thought and can't figure out how to make this into a query!
Here's some sample data to make things easier to understand.
Billing Cycles:
id | ends_at
-----+---------------------
1 | 2017-11-30
2 | 2017-11-30
Payments:
amount | billing_cycle_id | type | created_at
-----------+------------------+---------+----------------------------
6000.0000 | 1 | payment | 2017-11-15 18:40:22.151713
2000.0000 | 1 |repayment| 2017-11-19 11:45:15.6167
2000.0000 | 1 |repayment| 2017-12-02 11:46:40.757897
So if we see, user made a repayment on the 19th, so amount due for interest post ends date(30th Nov 2017), is only 4000. So, from 30th to the 2nd, interest will be calculated daily on 4000. However, from the 2nd, interest needs to be calculated on 2000 only.
Interest Calculations(Today being 2017-12-04):
date | amount | interest
------------+---------+----------
2017-12-01 | 4000 | 100 // First day of pending dues.
2017-12-02 | 2100 | 52.5 // Second day of pending dues.
2017-12-03 | 2152.5 | 53.8125 // Third day of pending dues.
2017-12-04 |2206.3125| // Fourth's day interest will be added tomorrow
Your data is too sparse. It doesn't make any sense to need to write this query, because over time the query will get significantly more complicated. What happens when interest rates change over time?
The table itself (or a secondary table, depending on how you want to structure it) could have a running balance you add every time a deposit / withdrawal is made. (I suggest this table be add-only) Otherwise you're making both the calculation and accounting far harder on yourself than it should be. Even with the way you've presented the problem here, there's not enough information to do the calculation. (interest rate is missing) When that's the case, your stored procedure is going to be too complicated. Complicated means bugs, and people get irritated about bugs when you're talking about their money.

Run a query to check consistency in SQL Server

I need some help with a SQL query and logic in general. (Using MSSQL Server)
I need to check the consistency of payments at certain retailers over a period of three months.
So I've got a table with all my transactions and the following columns:
TransactionID , AccountNumber , Retailer, Date .... (few other irrelevant ones)
Now one Accountnumber could have many transaction IDs. (One account could decide to make several payments during one month).
I have 4 unique retailers' ids, let's call them (101,102,103,104)
Now for consistency I want to get the following data:
The count of transactions where there was only one payment per account for the month at each retailer.
So I'd have:
| # Payments For Month | Retailer | Number of Transactions
| 1 Payment | 101 | 5000
...
But I also want to see how many transactions there were from accounts that made payments at multiple retailers
So I'd want something like:
| 2 Payments | 102 & 104 | 20
Which would mean that an account made 20 payments at retailer 102 & 104.
I don't as much care about how many accounts, more the amount of transactions.
I also want it broken down by month, but I've decided to do a seperate query for each month.
I've imported the data into a local DB on my personal laptop so I could go crazy, so I'll be able to try any method.
The goal of this query is to check the consistency of payments by people (accounts) at certain retailers. How many transactions do they loyally make at one retailer every month, how many transactions are there where they've gone to two retailers? or three? or all four?

Designing a scalable points leaderboard system using SQL Server

I'm looking for suggestions for scaling a points leaderboard system. I already have a working version using a very normalized strategy. This first version was essentially a table which looked something like this.
UserPoints - PK: (UserId,Date)
+------------+--------+---------------------+
| UserId | Points | Date |
+------------+--------+---------------------+
| 1 | 10 | 2011-03-17 07:16:36 |
| 2 | 35 | 2011-03-17 08:09:26 |
| 3 | 40 | 2011-03-17 08:05:36 |
| 1 | 65 | 2011-03-17 09:01:37 |
| 2 | 16 | 2011-03-17 10:12:35 |
| 3 | 64 | 2011-03-17 12:51:33 |
| 1 | 300 | 2011-03-17 12:19:21 |
| 2 | 1200 | 2011-03-17 13:24:13 |
| 3 | 510 | 2011-03-17 17:29:32 |
+------------+--------+---------------------+
I then have a stored procedure which basically does a GroupBy UserID and Sums the Points. I can also pass #StartDate and #EndDate parameters to create a leaderboard for a specific time period. For example, time windows for Top Users for the Day / Week / Month / Lifetime.
This seemed to work well with a moderate amount of data, but things became noticeably slower as the number of points records passed a million or so. The test data I'm working with is just over a million point records created by about 500 users distributed over a timespan of 3 months.
Is there a different way to approach this? I have experimented with denormalizing the data by pre-grouping the points into hour datetime buckets to reduce the number of rows. But I'm starting to think the real problem I need to worry about is the increasing number of users that need to be accounted for in the leaderboard. The time window sizes will generally be small but more and more users will start generating points within any given window.
Unfortunately I don't have access to 'Jobs' since I'm using SQL Azure and the Agent is not available (yet). But, I am open to the idea of scaling this using a different storage system if you are convincing enough.
My past work experience tells me I should look into data warehousing since this is almost a reporting problem. But at the same time I need it to be as real-time as possible.
Update
Ultimately, I would like to support custom leaderboards that could span from Monday 8am - Friday 6pm every week. But that's down the road and why I'm trying to not get too fancy with the aggregation. I'm willing to settle with basic Day/Week/Month/Year/AllTime windows for now.
The tricky part is that I really can't store them denormalized because I need these windows to be TimeZone convertible. The system is mult-tenant and therefore all data is stored as UTC. The problem is a week starts at different hours for different customers. Aggregating the sums together will cause some points to fall into the wrong buckets.
here are a few thoughts:
Sticking with SQL Azure: you can have another table, PointsTotals. Every time you add a row to your UserPoints table, also increment the TotalPoints value for a given UserId in PointsTotals (or insert a new row if they don't have a row to increment). Now you always have totals computed for each UserId.
Going with Azure Table Storage: Create a UserPoints table, with Partition Key being userId. This keeps all of a user's points rows together, where you'd easily be able to sum them. And... you can borrow the idea from suggestion #1, creating a separate PointsTotals table, with PartitionKey being UserId and RowKey probably being the total points.
If it were my problem, I'd ignore the timestamps and store the user and points totals by day
I decided to go with the idea of storing points along with a timespan (StartDate and EndDate columns) localized to the customer's current TimeZone setting. I realized an extra benefit with this is that I can 'purge' old leaderboard round data after a few monts without affecting the lifetime total of points.