How to calculate performance growth across varying timeframes? - sql

I have a problem that I'm wondering if you could help me with, please.
I have a table of customers and their transactions which looks a bit like this:
account_number|till_transaction_date|sales
--------------|---------------------|-----
A |2021-11-01 |10
A |2021-12-05 |25
A |2022-01-10 |40
B |2021-09-05 |15
B |2021-10-15 |10
B |2021-11-20 |20
As you can see, customer A's sales are growing at a faster rate than customer B's.
I also have a table which shows the date each customer joined a certain club (all customers from the previous table are also in this one):
account_number|join_date
--------------|----------
A |2022-01-05
B |2021-12-01
What I want to do is to quantify the sales growth of each customer in the period before they joined. So in my example, we would be looking at customer A's first two transactions, but all three of customer B's.
If it helps, the end goal of all of this is to identify which customers were showing the highest growth in performance before they joined the club.
Does anyone know what would be the best way to go about this please? Any advice would be appreciated!
EDIT: Just to clarify, the best plan I could come up with is to select from the first table all transactions before each customers' join_date, and use this to get a rolling X-day sales total per customer.
SUM(SALES) OVER (PARTITION BY ACCOUNT_NUMBER ORDER BY DAYS(TILL_TRANSACTION_DATE) RANGE BETWEEN X PRECEDING AND CURRENT ROW)
So if I set it to a 35-day total the results would look like this:
account_number|till_transaction_date|sales|35_day_total
--------------|---------------------|-----|------------
A |2021-11-01 |10 |10
A |2021-12-05 |25 |35
B |2021-09-05 |15 |15
B |2021-10-15 |10 |10
B |2021-11-20 |20 |20
I could then potentially calculate for each customer the difference between the min and max of their totals.
The problem is that this isn't a reliable method of measuring how much a customer's performance is improving - e.g. if a customer makes an expensive one-off purchase at the right time it could skew their result massively.
Theoretically, I could get the difference between each X-day total and the previous X-day total, then get the mean average of the differences, and this would provide a more accurate way of determining sales growth, but I'm not sure how to do this.
I'm not sure how to improve this plan, or even if I'm barking up the wrong tree entirely and there's a better methodology I've just not thought of.

Related

Calculating interest using SQL

I am using PostgreSQL, and have a table for a billing cycle and another for payments made in a billing cycle.
I am trying to figure out how to calculate interest based on how much amount was left after each billing cycle's last payment date. Problem is that every time a repayment is made, the interest has to be calculated on the amount remaining after that.
My thoughts on building this query are like this. Build data for all dates from last pay date of the billing cycle to today. Using partitioning, get the remaining amount for the first date. For second date, use amount from previous row and add interest to it, and then calculate interest on this one.
Unfortunately I am stuck just at the thought and can't figure out how to make this into a query!
Here's some sample data to make things easier to understand.
Billing Cycles:
id | ends_at
-----+---------------------
1 | 2017-11-30
2 | 2017-11-30
Payments:
amount | billing_cycle_id | type | created_at
-----------+------------------+---------+----------------------------
6000.0000 | 1 | payment | 2017-11-15 18:40:22.151713
2000.0000 | 1 |repayment| 2017-11-19 11:45:15.6167
2000.0000 | 1 |repayment| 2017-12-02 11:46:40.757897
So if we see, user made a repayment on the 19th, so amount due for interest post ends date(30th Nov 2017), is only 4000. So, from 30th to the 2nd, interest will be calculated daily on 4000. However, from the 2nd, interest needs to be calculated on 2000 only.
Interest Calculations(Today being 2017-12-04):
date | amount | interest
------------+---------+----------
2017-12-01 | 4000 | 100 // First day of pending dues.
2017-12-02 | 2100 | 52.5 // Second day of pending dues.
2017-12-03 | 2152.5 | 53.8125 // Third day of pending dues.
2017-12-04 |2206.3125| // Fourth's day interest will be added tomorrow
Your data is too sparse. It doesn't make any sense to need to write this query, because over time the query will get significantly more complicated. What happens when interest rates change over time?
The table itself (or a secondary table, depending on how you want to structure it) could have a running balance you add every time a deposit / withdrawal is made. (I suggest this table be add-only) Otherwise you're making both the calculation and accounting far harder on yourself than it should be. Even with the way you've presented the problem here, there's not enough information to do the calculation. (interest rate is missing) When that's the case, your stored procedure is going to be too complicated. Complicated means bugs, and people get irritated about bugs when you're talking about their money.

HR Cube in SSAS

I have to design a cube for students attendance, we have four status (Present, Absent, Late, in vacation). the cube has to let me know the number of students who are not present in a gap of time (day, month, year, etc...) and the percent of that comparing the total number.
I built a fact table like this:
City ID | Class ID | Student ID | Attendance Date | Attendance State | Total Students number
--------------------------------------------------------------------------------------------
1 | 1 | 1 | 2016-01-01 | ABSENT | 20
But in my SSRS project I couldn't use this to get the correct numbers. I have to filter by date, city and attendance status.
For example, I must know that in date X there is 12 not present which correspond to 11% of total number.
Any suggestion of a good structure to achieve this.
I assume this is homework.
Your fact table is wrong.
Don't store aggregated data (Total Students) in the fact as it can make calculations difficult.
Don't store text values like 'Absent' in the fact table. Attributes belong in the dimension.
Reading homework for you:
Difference between a Fact and Dimension and how they work together
What is the grain of a Fact and how does that affect aggregations and calculations.
There is a wealth of information at the Kimball Groups pages. Start with the lower # tips as they get more advanced as you move on.

Date role playing dimension. Compare to main date dimension

I'm using Mondrian, Pentaho and Saiku.
For example, in a simple warehouse for orders simplified to add only the interesting parts.
There is a order fact table with columns: date, customer id and money amount.
|date |customer id|amount|
|2015-04-01| 1| 50|
|2015-04-02| 1| 20|
|2015-04-02| 2| 20|
There is a dimension column for customers with: customer id, name and first order date:
|customer id|name |first order date|
| 1|Joe |2015-04-01 |
| 2|Charles|2015-04-02 |
The date for the first order of the customers is a role playing dimension.
I want to be able to have these two measures in a Mondrian cube:
Grouping by date and by first order date, give me the money amount. This one is OK with this data model
For every month/week, give me the money spent by customers whose first order ever was in this month/week.
I thought as this is mainly a modelling or schema issue, that maybe it's easier without writing the schema here, but can add it to the question if necessary.
The second indicator is hard to do because it must find a way to compare if the date is the same to the main dimension date. I'm trying to solve this issue by calculated measures and MemberToStr, but I cannot find the way. Any ideas about how to proceed?

Weekly Sales Total in Fact Table using SSIS

I have 3 sales data tables that I have converted into 4 Dimension tables and 1 Fact table.
I have everything populated properly, and am wanting a "daily-sales" and "weekly-sales" in my Fact table.
My original 3 tables are as follows is as follows:
Order (order# (PK), product-id (FK), order-date, total-order, customer# (FK))
domains are numeric, numeric, Date, Currency,numeric
Product (product-id (PK), prod-name, unit-cost, manufacture-name)
domains are numeric, nvarchar, money, nvarchar.
Customer (customer# (PK), cust-name, address, telephone#)
domains are Number, nvarchar, nvarchar, nvarchar
The Star Schema of the data warehouse is linked here:
So I have only 10 records in each table(tiny!), and am just testing concept now. "daily-order" in the Fact table is easily translated from "total-order" in the Order table. My difficulty is getting the weekly totals. I used a derived column and expression "DATEPART("wk",[order-date])" to create my week column in the time dimension.
So I guess my question is, how do I get the weekly-sales from this point? My first guess is to use another derived column after the end of my Lookup sequences used to load my Fact table. My second guess is..... ask for help on stackoverflow. Any help would be appreciated, I will supply any info needed! Thanks in advance!
For the record I tried creating the derived column as I previously described, but could not figure out syntax that would be accepted....
EDITI want to list the weekly-sales column by product level, and as I used DATEPART to derive the week column I do not need last 7 days but rather just a total for each week. My sales data stretches across only 2 consecutive weeks, so my fact table should list 1 total 7 times and a second total 3 times. I can either derive this from the original Order table, or from the tables I have in my DW environment. Preferably the latter. One note is that I am mainly restricted to SSIS and its tools (There is "Execute SQL syntax" tool, so I can still use a query)
Within your data flow for Fact-sales, it would challenging to fill the weekly-order amount as you are loading daily data. You would need to use a Multicast Transformation, add an Aggregate to that to sum your value for the product, customer, order and week. That's easy. The problem is how are you going to tie that data back to the actual detail level information in the data flow. It can be done, Merge Join, but don't.
As ElectricLlama is pointing out in the comments, what you are about to do is mix your level of granularity. If the weekly amount wasn't stored on the table, then you can add all the amounts, sliced by whatever dimensions you want. Once you put that weekly level data at the same level, your users will only be able to summarize the daily columns. The weekly values would not be additive, probably.
I say probably because you could fake it by accumulating the weekly values but only storing on them on an arbitrary record: the first day of the week, third, last, whatever. That solves the additive problem but now users have to remember which day of the week is holds the summarized value for the week. Plus, you'd be generating a fake order record for that summary row unless they placed the order on the exact day of the week you picked to solve the additive value problem.
The right, unasked for, solution is to remove the weekly-order from the fact table and beef up your data dimension to allow people to determine the weekly amounts. If you aren't bringing the data out of the warehouse and into an analytics engine, you could precompute the weekly aggregates and materialize them to a reporting table/view.
Source data
Let's look at a simple scenario. A customer places 3 orders, two on the same day for same product (maybe one was expedited shipping, other standard)
SalesDate |Customer|Product|Order|Amount
2013-01-01|1 |20 |77 |30
2013-01-01|1 |30 |77 |10
2013-01-02|1 |20 |88 |1
2013-01-02|1 |20 |99 |19
Fact table
Your user is going to point their tool of choice at the table and you'll be getting a call because the numbers aren't right. Or you'll get a call later because they restocked with 150 units of product 20 when they only needed to reorder 50. I wouldn't want to field either of those calls.
SalesDate |Customer|Product|Order|daily-order|weekly-order
2013-01-01|1 |20 |77 |30 |50
2013-01-01|1 |30 |77 |10 |10
2013-01-02|1 |20 |88 |1 |50
2013-01-02|1 |20 |99 |19 |50

Is there a set based solution for this problem?

We have a table set up as follows:
|ID|EmployeeID|Date |Category |Hours|
|1 |1 |1/1/2010 |Vacation Earned|2.0 |
|2 |2 |2/12/2010|Vacation Earned|3.0 |
|3 |1 |2/4/2010 |Vacation Used |1.0 |
|4 |2 |5/18/2010|Vacation Earned|2.0 |
|5 |2 |7/23/2010|Vacation Used |4.0 |
The business rules are:
Vacation balance is calculated by vacation earned minus vacation used.
Vacation used is always applied against the oldest vacation earned amount first.
We need to return the rows for Vacation Earned that have not been offset by vacation used. If vacation used has only offset part of a vacation earned record, we need to return that record showing the difference. For example, using the above table, the result set would look like:
|ID|EmployeeID|Date |Category |Hours|
|1 |1 |1/1/2010 |Vacation Earned|1.0 |
|4 |2 |5/18/2010|Vacation Earned|1.0 |
Note that record 2 was eliminated because it was completely offset by used time, but records 1 and 4 were only partially used, so they were calculated and returned as such.
The only way we have thought of to do this is to get all of the vacation earned records in a temporary table. Then, get the total vacation used and loop through the temporary table, deleting the oldest record and subtracting that value from the total vacation used until the total vacation used is zero. We could clean it up for when the remaining vacation used is only part of the oldest vacation earned record. This would leave us with just the outstanding vacation earned records.
This works, but it is very inefficient and performs poorly. Also, the performance will just degrade over time as more and more records are added.
Are there any suggestions for a better solution, preferable set based? If not, we'll just have to go with this.
EDIT: This is a vendor database. We cannot modify the table structure in any way.
The following should do it..
(but as others mention, the best solution would be to adjust remaining vacations as they are spent..)
select
id, employeeid, date, category,
case
when earned_so_far + hours - total_spent > hours then
hours
else
earned_so_far + hours - total_spent
end as hours
from
(
select
id, employeeid, date, category, hours,
(
select
isnull(sum(hours),0)
from
vacations
WHERE
category = 'Vacation Earned'
and
date < v.date
and
employeeid = v.employeeid
) as earned_so_far,
(
select
isnull(sum(hours),0)
from
vacations
where
category = 'Vacation Used'
and
employeeid = v.employeeid
) as total_spent
from
vacations V
where category = 'Vacation Earned'
) earned
where
earned_so_far + hours > total_spent
The logic is
calculate for each earned row, the hours earned so far
calculate the total hours used for this user
select the record if the total_hours_so_far + hours of this record - total_spent_hours > 0
In thinking about the problem, it occurred to me that the only reason you need to care about when vacation is earned is if it expires. And if that's the case, the simplest solution is to add 'vacation expired' records to the table, such that the amount of vacation remaining for an employee is always just the sum(vacation earned) - (sum(vacation expired) + sum(vacatation used)). You can even show the exact records you want by using the last vacation expired record as a starting point for the query.
But I'm guessing that's not an option. To address the problem as asked, keep in mind that whenever you find yourself using a temporary table try putting that data into CTE (common table expression) instead. Unfortunately I have a meeting right now and so I don't have time to write the query (maybe later, it sounds like fun), but this should get you started.
I find your whole result set confusing and inaccurate and I can see employees sayng, "no I earned 2 hours on Jan 25th not 1." It is not true that they earned 1 hour on that date that was only partially offset, and you will have no end of problems if you choose to display this way. I'd look at a different way to present the information. Typically you either present a list of all leave actions (earned, expired and used) with a total at the bottom or you present a summary of available for use and used.
In over 30 years in the workforce and having been under many differnt timekeeping systems (as well as having studied even more when I was a managment analyst), I have never seen anyone want to display timekeeping information this way. I'm thinking there is a reason. If this is a requirement, I'd suggest pushing back on it and explaining how it will be confusing to read the data this was as well as being difficult to get a well-performing solution. I would not accept this as a requirement without trying to convince the client that it is a poor idea.
As time passes and records are added, performance will get worse and worse unless you do something about it, such as:
Purge old rows once they're "cancelled out" (e.g. vacation earned has had equivalent vacation used rows added and accounted for; vacation used has been used set "expire" vacation earned as "expended")
Add a column that flags if a a row has been "cancelled out", and incorporate this column into your indexes
Tracking how the data changes in this fashion seems an argument to modify your table sturctures (have several, not just one), but that's outside the scope of your current problem.
As for the query itself, I'd build two aggregates, do some subtraction, make that a subquery, then join it on some clever use of one of the ranking functions. Smells like a correlated subquery in there somewhere, too. I may try and hash this out later (I'm short on time), but I bet someone beats me to it.
I'd suggest modifying the table to keep track of Balance in its own column. That way, you only need to grab the most recent record to know where the employee stands.
That way, you can satisfy the simple case ("How much vacation time do I have"), while still being able to do the awkward rollup you're looking for in your "Which bits of vacation time don't line up with other bits" report, which I'd hope is something you don't need very often.