Update historical results from wrong recurrent query - sql

Hi I am wondering how to go about the following situation: I have a recurrent query (PostgreSQL DB V12 instance) which runs and inserts data into a table for customer volume with 6 different metrics on a daily basis, creating a snapshot of the dataset over time:
26/03/2021| Active Linked Customers | 10
26/03/2021| Customers where all accounts have positive consent | 20
26/03/2021| Customer with accounts with a mixture of positive consent and > 90 days | 30
26/03/2021| Customers who have all accounts > 30 days | 40
26/03/2021| Customers who have ever connected >=1 account | 50
26/03/2021| Customer record created but customer never connected an account | 5
25/03/2021| Active Linked Customers | 7
25/03/2021| Customers where all accounts have positive consent | 9
25/03/2021| Customer with accounts with a mixture of positive consent and > 90 days | 30
25/03/2021| Customers who have all accounts > 30 days| 40
25/03/2021| Customers who have ever connected >=1 account | 50
25/03/2021| Customer record created but customer never connected an account | 4
...
I have now realised that I need to tweak / fix a where clause of my query that has been generating all results for a couple of months now. This fix is to exclude accounts that have been deleted already from the system, I just need to add the following clause:
where deleted_date is not NULL
The query has been built on 6 different CTEs that feeds the table above on a daily basis. The fix is for future results is straightforward, but how to "fix" the historical data and exclude deleted accounts from the results? Is there a way to reverse engineer the results and "go back in time" to tweak the query and rerun for the last couple of months? Or even some clever data analysis to get around this? Any ideas / suggestions?

Related

SQL Database Design for monthly issued coupons

I'm struggling with a database schema for a problem I'm having.
Let's say I own a business that sells monthly services (cleaning) to different companies.
However, I give companies monthly saveable 'coupons' that act like a reduction (of 5 dollars) based on their amount of users.
Example:
It's april 2018
Company XYZ has to pay 1.000 dollars for their monthly cleaning services by my business.
XYZ, has 5 employees, so they will have 5 coupons for the month of april.
HOWEVER, since coupons can be saved (for a period of 2 months), company XYZ will not use the coupons of only april, but also of march (since they didn't use any that month and february coupons are already used up).
Result:
10 coupons are used on their april invoice (5 of march, 5 of april):
total amount to pay 950 dollars
My thing is that I want to automate this. With one click on the button, my system will have to check:
How many users there are
If there are any unused coupons from last 2 months (and use those first if they exist)
Apply coupons to their invoice.
I want to design this first in a database but i'm struggling:
This is my design
Company
CompanyID
Name
User
UserID
CompanyID
UserID
Now I'm struggling with the coupon design, how can I develop this so that I can automise my problem.
I will need to save coupons per company per month.
My idea is to do it like this:
Company_Month_Coupon
CompanyID
Coupon_Count
Month
I wasn't sure if i could do this in one table and i'm not so sure with the following problem:
what if my program user decides to cancel an invoice, how would my system know from which month the coupons came?
What design would be adviced in a coupon-sharing system?
Any advice to tackling this problem would greatly appreciated.
I would go with your idea and have 2 more tables: Invoices and Invoices_UsedCoupons
Invoices:
ID (Primary key)
CompanyID
Month
Status (to set a cancelled status on your invoice if you don't want to delete from the DB)
Invoices_UsedCoupons:
InvoiceId (foreign key to Invoices table)
Coupon_Count
Month (this field is for the used coupons from Company_Month_Coupon table)
The reasons for this:
We should still store the issued coupons (in your Company_Month_Coupon table) because for each month, the number of employees may change. It means that you have to keep track of the issued coupons whenever the number of employees changes.
With Invoices and Invoices_UsedCoupons table, you could easily calculate the actual used coupons & the remaining coupons.
what if my program user decides to cancel an invoice, how would my
system know from which month the coupons came?
All the information is available in Invoices and Invoices_UsedCoupons tables. If you want to reclaim coupons after cancelling the invoice, it's also easy to do.
"I will need to save coupons per company per month."
Maybe you can do the opposite. In the database does not store coupons that can be used, but only those that are actually used, for example in the table "used_coupons"
The idea is that the coupons are given up by default, so it makes no sense to store them. Only need to save the used coupons.
At checkout you need to find out how much users is in the company and how many "used coupons" is saved in the last two months.
If X coupons are returned then from the "used_coupons" table you need to delete the latest X coupons.

Calculating interest using SQL

I am using PostgreSQL, and have a table for a billing cycle and another for payments made in a billing cycle.
I am trying to figure out how to calculate interest based on how much amount was left after each billing cycle's last payment date. Problem is that every time a repayment is made, the interest has to be calculated on the amount remaining after that.
My thoughts on building this query are like this. Build data for all dates from last pay date of the billing cycle to today. Using partitioning, get the remaining amount for the first date. For second date, use amount from previous row and add interest to it, and then calculate interest on this one.
Unfortunately I am stuck just at the thought and can't figure out how to make this into a query!
Here's some sample data to make things easier to understand.
Billing Cycles:
id | ends_at
-----+---------------------
1 | 2017-11-30
2 | 2017-11-30
Payments:
amount | billing_cycle_id | type | created_at
-----------+------------------+---------+----------------------------
6000.0000 | 1 | payment | 2017-11-15 18:40:22.151713
2000.0000 | 1 |repayment| 2017-11-19 11:45:15.6167
2000.0000 | 1 |repayment| 2017-12-02 11:46:40.757897
So if we see, user made a repayment on the 19th, so amount due for interest post ends date(30th Nov 2017), is only 4000. So, from 30th to the 2nd, interest will be calculated daily on 4000. However, from the 2nd, interest needs to be calculated on 2000 only.
Interest Calculations(Today being 2017-12-04):
date | amount | interest
------------+---------+----------
2017-12-01 | 4000 | 100 // First day of pending dues.
2017-12-02 | 2100 | 52.5 // Second day of pending dues.
2017-12-03 | 2152.5 | 53.8125 // Third day of pending dues.
2017-12-04 |2206.3125| // Fourth's day interest will be added tomorrow
Your data is too sparse. It doesn't make any sense to need to write this query, because over time the query will get significantly more complicated. What happens when interest rates change over time?
The table itself (or a secondary table, depending on how you want to structure it) could have a running balance you add every time a deposit / withdrawal is made. (I suggest this table be add-only) Otherwise you're making both the calculation and accounting far harder on yourself than it should be. Even with the way you've presented the problem here, there's not enough information to do the calculation. (interest rate is missing) When that's the case, your stored procedure is going to be too complicated. Complicated means bugs, and people get irritated about bugs when you're talking about their money.

SQL SUM expression and Lock

I have a problem with right SQL solution.
Current situation:
My database contains table with bank transactions (credit and debit).
Credit transactions are signed as posivitive amount (+), and
debit transactions as negative amount (-).
Application which uses the DB is a multiuser webapp, so Transactions Table contains many rows, which reference to different users.
Some webapp actions need to check actual balance of logged user, using Transactions table and save debit Transaction (action price).
I think about architecture of this mechanism and have some questions:
Is it a good idea to calculate balance as a SUM of Transactions credits and debits each time user requests? I know it may be inefficient for db. Maybe should I save a snapshot somewhere?
How to ensure data cohesion when one user checks ""balance"" as a SUM of credit/debit transactions, and another user in the same time saves debit transaction (because he/she was faster)? I think about a pessimistic lock but what should I lock? I know that lock with aggregation (SUM) may be impossible on Postgresql (database which I use)."
Sorry for my English, I hope my problem is understandable. :)
I would consider EITHER:
Storing a balance on the account record, along with the date for which the balance is accurate.
Getting the current balance is a matter of reading the account balance, and then including any transactions since that date.
You can have a scheduled job that recalculates and timestamps that balance at an hour past midnight.
OR (and this is my preferred solution):
Every time a transaction or batch of transactions is loaded, lock the relevant account records and update them with the values from the insert as part of the same transaction.
This has the advantage of serialising access to the account, which can then help with determining whether a transaction can go ahead or not because of decisions based on the balance calculation.
If you want to avoid having the balance on the user account, something that could have a better performance, the approach I would experiment would be:
Each transaction would be related to only one account.
Each transaction would have the account balance after that transaction.
Therefore, the last transaction for that account would have the current balance.
Ex.:
TransactionId | AccountId | Datetime | Ammount | Balance
1 | 1 | 7/11/16 | 0 | 0
2 | 1 | 7/11/16 | 500 | 500
3 | 1 | 7/11/16 | -20 | 480
4 | 1 | 8/11/16 | 50 | 530
5 | 1 | 8/11/16 | -200 | 330
This way you would be able to get the account balance (last transaction with that accountId) and you would be able to provide a better view into the balance change over time.

Run a query to check consistency in SQL Server

I need some help with a SQL query and logic in general. (Using MSSQL Server)
I need to check the consistency of payments at certain retailers over a period of three months.
So I've got a table with all my transactions and the following columns:
TransactionID , AccountNumber , Retailer, Date .... (few other irrelevant ones)
Now one Accountnumber could have many transaction IDs. (One account could decide to make several payments during one month).
I have 4 unique retailers' ids, let's call them (101,102,103,104)
Now for consistency I want to get the following data:
The count of transactions where there was only one payment per account for the month at each retailer.
So I'd have:
| # Payments For Month | Retailer | Number of Transactions
| 1 Payment | 101 | 5000
...
But I also want to see how many transactions there were from accounts that made payments at multiple retailers
So I'd want something like:
| 2 Payments | 102 & 104 | 20
Which would mean that an account made 20 payments at retailer 102 & 104.
I don't as much care about how many accounts, more the amount of transactions.
I also want it broken down by month, but I've decided to do a seperate query for each month.
I've imported the data into a local DB on my personal laptop so I could go crazy, so I'll be able to try any method.
The goal of this query is to check the consistency of payments by people (accounts) at certain retailers. How many transactions do they loyally make at one retailer every month, how many transactions are there where they've gone to two retailers? or three? or all four?

Designing a scalable points leaderboard system using SQL Server

I'm looking for suggestions for scaling a points leaderboard system. I already have a working version using a very normalized strategy. This first version was essentially a table which looked something like this.
UserPoints - PK: (UserId,Date)
+------------+--------+---------------------+
| UserId | Points | Date |
+------------+--------+---------------------+
| 1 | 10 | 2011-03-17 07:16:36 |
| 2 | 35 | 2011-03-17 08:09:26 |
| 3 | 40 | 2011-03-17 08:05:36 |
| 1 | 65 | 2011-03-17 09:01:37 |
| 2 | 16 | 2011-03-17 10:12:35 |
| 3 | 64 | 2011-03-17 12:51:33 |
| 1 | 300 | 2011-03-17 12:19:21 |
| 2 | 1200 | 2011-03-17 13:24:13 |
| 3 | 510 | 2011-03-17 17:29:32 |
+------------+--------+---------------------+
I then have a stored procedure which basically does a GroupBy UserID and Sums the Points. I can also pass #StartDate and #EndDate parameters to create a leaderboard for a specific time period. For example, time windows for Top Users for the Day / Week / Month / Lifetime.
This seemed to work well with a moderate amount of data, but things became noticeably slower as the number of points records passed a million or so. The test data I'm working with is just over a million point records created by about 500 users distributed over a timespan of 3 months.
Is there a different way to approach this? I have experimented with denormalizing the data by pre-grouping the points into hour datetime buckets to reduce the number of rows. But I'm starting to think the real problem I need to worry about is the increasing number of users that need to be accounted for in the leaderboard. The time window sizes will generally be small but more and more users will start generating points within any given window.
Unfortunately I don't have access to 'Jobs' since I'm using SQL Azure and the Agent is not available (yet). But, I am open to the idea of scaling this using a different storage system if you are convincing enough.
My past work experience tells me I should look into data warehousing since this is almost a reporting problem. But at the same time I need it to be as real-time as possible.
Update
Ultimately, I would like to support custom leaderboards that could span from Monday 8am - Friday 6pm every week. But that's down the road and why I'm trying to not get too fancy with the aggregation. I'm willing to settle with basic Day/Week/Month/Year/AllTime windows for now.
The tricky part is that I really can't store them denormalized because I need these windows to be TimeZone convertible. The system is mult-tenant and therefore all data is stored as UTC. The problem is a week starts at different hours for different customers. Aggregating the sums together will cause some points to fall into the wrong buckets.
here are a few thoughts:
Sticking with SQL Azure: you can have another table, PointsTotals. Every time you add a row to your UserPoints table, also increment the TotalPoints value for a given UserId in PointsTotals (or insert a new row if they don't have a row to increment). Now you always have totals computed for each UserId.
Going with Azure Table Storage: Create a UserPoints table, with Partition Key being userId. This keeps all of a user's points rows together, where you'd easily be able to sum them. And... you can borrow the idea from suggestion #1, creating a separate PointsTotals table, with PartitionKey being UserId and RowKey probably being the total points.
If it were my problem, I'd ignore the timestamps and store the user and points totals by day
I decided to go with the idea of storing points along with a timespan (StartDate and EndDate columns) localized to the customer's current TimeZone setting. I realized an extra benefit with this is that I can 'purge' old leaderboard round data after a few monts without affecting the lifetime total of points.