SQL Retention based on cohort and period - sql

I have already seen all the related posts, but none have been able to help me.
I Have the following fields:
Where:
SOLD_AT is the date of each transaction
CUSTOMER_ID is a unique ID for each customer
COHORT is the date (Year-Month) of the first purchase of the user in that row
ORDER_MONTH is the date of (Year-Month) of the purchase in that row
PERIOD_NUMBER is the date difference in months between COHORT and ORDER_MONTH
N_CUSTOMERS is the number of customers in each PERIOD_NUMBER in each COHORT
In case is useful, I have the querys with which I have obtained these fields, but I think that including them would only add noise since the definition of each variable is more useful.
What I need to do and am not able to do is add an additional field for the retention of each period number of each cohort (not a pivot table by adding the period numbers of each cohort).
Specifically, I need the retention of each period number to be the division of the number of users of that period by the number of users of the previous period, in this way:
To do this in Python, I simply do:
cohort_pivot = df_cohort.pivot_table(index = 'cohort',
columns = 'period_number',
values = 'n_customers')
cohort_size = cohort_pivot.iloc[:,0]
retention_matrix1 = cohort_pivot.divide(cohort_size, axis = 0)
and I can then unpivot and take out the retention for each period of each cohort to create an additional column with this value.
One of the answers that I tried because it was the closest thing I saw was the answer chosen in this post, but I am not able to know the number of periods_numbers or historical months that I am going to have since the code has to be dynamic for any company that is loaded (For example, in DBT, which is the tool I'm using, you can create dynamic pivot tables instead of static ones that require to know this information, but as I say I need to create the field, not the pivot table)
Any ideas will be more than welcome, thank you very much

Related

Query to find average stock ... with a twist

We are trying to calculate average stock from a movements table in a single sql sentence.
As far as we are, no problem with what we thought was a standard approach, instead of adding up the daily stock and divide by the number of days, as we don’t have daily stock, we simply add (movements*remaining days) :
select sum(quantity*(END_DATE-move_date))/(END_DATE-START_DATE)
from move_table
where move_date<=END_DATE
This is a simplified example, in real life we already take care of the initial stock at the starting date. Let’s say there are no movements prior to start_date.
Quantity sign depends on move type (sale, purchase, inventory, etc).
Of course this is done grouping by product, warehouse, ... but you get the idea.
It works as expected and the calculus is fine.
But (there is always a “but”), our customer doesn’t like accounting days when there is no stock (all stock sold out). So, he doesnt like
Sum of (daily_stock) / number_of_days (which is what we calculate using a diferent math)
Instead, he would like
Sum of (daily stock) / number_of_days_in_which_stock_is_not_zero
For sure we can do this in any programming language without much effort, but I was wondering how to do it using plain sql ... and wasn’t able to come up with a solution.
Any suggestion?
Consider creating a new table called something like Stock_EndOfDay_History that has the following columns.
stock#
date
stock_count_eod
This table would get a new row for each stock item at the start of a new day for the prior day. Rows could then be purged from this table once the applicable date value went outside the date window of interest.
To get the "number_of_days_in_which_stock_is_not_zero", use this.
SELECT COUNT(*) AS 'Not_Zero_Stock_Days' FROM Stock_EndOfDay_History
WHERE stock# = <stock#_value>
AND <date_window_clause>
Other approaches might attempt to just add a new column to the existing stock table to maintain a cumulative sum of the " number_of_days_in_which_stock_is_not_zero". But inevitably, questions will be asked as to how did the non-zero stock days count get calculated? Using this new table approach will address those questions better than the new column approach.

SQL GROUPING SETS averages with multiple many-to-many dimensions

I have a table of data with the following:
User,Platform,Dt,Activity_Flag,Total_Purchases
1,iOS,05/05/2016,1,1
1,Android,05/05/2016,1,2
2,iOS,05/05/2016,1,0
2,Android,05/05/2016,1,2
3,iOS,05/05/2016,1,1
3,Android,06/05/2016,1,3
1,iOS,06/05/2016,1,2
4,Android,06/05/2016,1,2
1,Android,06/05/2016,1,0
3,iOS,07/05/2016,1,2
2,iOS,08/05/2016,1,0
I want to do a GROUPING SETS (Platform,Dt,(Platform,Dt),()) aggregation to be able to find for each combination of Platform and Dt the following:
Total Purchases
Total Unique Users
Average Purchases per User per Day
The first two are simple as these can be achieved via a sum(Total_Purchases) and count(distinct user) respectively.
The problem I have is with the last metric. The result set should look like this but I don't know how to get the last column to be calculated correctly:
Platform,Dt,Total_Purchases,Total_Unique_Users,Average_Purchases_Per_User_Per_Day
Android,05/05/2016,4,2,2.0
iOS,05/05/2016,2,3,0.7
Android,06/05/2016,5,3,1.7
iOS,06/05/2016,2,1,2.0
iOS,07/05/2016,2,1,2.0
iOS,08/05/2016,0,1,0.0
,05/05/2016,6,3,2.0
,06/05/2016,7,3,2.3
,07/05/2016,1,1,1.0
,08/05/2016,1,1,1.0
Android,,9,4,1.8
iOS,,6,3,1.2
,,15,4,1.6
For the first ten rows we see that getting the Average purchase per user per day is a simple division of the first two columns as the dimension in these rows represent a single date only. But when we look at the final 3 rows we see that the division is not the way to achieve the desired result. This is because it needs to take an average for each day in turn to get the overall per day amount.
If this isn't clear please let me know and I'll be happy to explain better. This is my first post on this site!

Filter PowerPivot based on multiple Date Criteria

I am trying to apply some Time Intelligence functions in my PowerPivot workbook concerning projects and money received for them. I have three relevant tables; Matters, Payments, and a Date Table.
Each matter has a creationDate, and a closureDate(from a linked table). Likewise, each payment has a date. I have reporting set up decently, but am now trying to use Time intelligence to filter this a bit more clearly.
How can I set a PowerPivot Pivot Table up so that the only Matters which show are those which existed within the period selected. e.g. If I select a slicer for 2014, I don't want to show a matter created in 2015, or one which was closed in 2013. The matter should have been active during the period specified.
Is this possible?
You want to show all the matters EXCEPT those where the CreationDate is after the upper limit of the date range you are looking at or the ClosureDate is before the lower limit of the date range you are looking at.
Assuming you have a data structure like this, where the left-hand table is the Matters and the right-hand one is the Payments:
If you have a calculated field called [Total Payments] that just adds up all the payments in the Payments table, a formula similar to this would work:-
[Payment in Range]:=IF(OR(MIN(Matters[Creation Date])>MAX('Reporting Dates'[Date]),MAX(Matters[Closure Date])<MIN('Reporting Dates'[Date])),BLANK(),[Total Payments])
Here is the result with one month selected in the timeline:
Or with one year selected in the year slicer:
NOTE: in my example, I have used a disconnected date table.
Also, you will see that the Grand Total adds up all the payments because it takes the lowest of all the creation dates and the highest of all the closure dates to determine whether to show a total payment value. If it is important that the Grand Total shows correctly, then an additional measure is required:
[Fixed Totals Payment in Range]:=IF(COUNTROWS(VALUES(Matters[Matter]))=1,[Payment in Range],SUMX(VALUES(Matters[Matter]),[Payment in Range]))
Replace the [Payment in Range] in your pivot table with this new measure and the totals will show correctly, however, this will only work if Matters[Matter] is used as one of the fields in the pivot table.
Use filters & the calculate function.
So, if you're Summing payments, it would look like.....
Payments 2014:= CALCULATE( SUM([Payments]), DateTable[Year]=2014)
The Sum function takes the entirety of payments & the filter function will only capture payments w/in 2014, based on the data connected to your date table.

minimum date among several columns

I'm trying to get a count of a number of policies issued per month. This is close to returning the correct information:
SELECT count(policy_no), left(issue_date,6)
FROM table_a
WHERE indicator = 'fln'
GROUP BY left(issue date,6)
the indicator is narrowing it down to the types of policies I want. The only problem I'm having is that there will be an entry with an identical policy number every year as the policy renews. I need to only count the lowest issue date for each policy, not every policy every time. If a policy was issued in November of 2010, I want it to count that one time, not once for November 2010,2011,2012, etc. The issue dates are in the format of yyyymmdd. Only year and month are relevant.
I'm sure this is an easy one for the more experienced among you, I haven't been able to piece it together by other questions on this forum. Any help would be appreciated!
Something like this will get what you want:
SELECT LEFT(FirstIssued, 6) AS YYMM, COUNT(DISTINCT Policy_No) AS NumPolicies
FROM
(
SELECT Policy_No, MIN(issue_date) AS FirstIssued
FROM table_a
WHERE indicator = 'fln'
GROUP BY Policy_No
) A
GROUP BY LEFT(FirstIssued,6)
The key is to first find the min date for each policy, before aggregating the counts. Note that the only months you will have appear are those with at least one policy, so if you would prefer to have 0s you need to add in a date generator.

Aggregating 15-minute data into weekly values

I'm currently working on a project in which I want to aggregate data (resolution = 15 minutes) to weekly values.
I have 4 weeks and the view should include a value for each week AND every station.
My dataset includes more than 50 station.
What I have is this:
select name, avg(parameter1), avg(parameter2)
from data
where week in ('29','30','31','32')
group by name
order by name
But it only displays the avg value of all weeks. What I need is avg values for each week and each station.
Thanks for your help!
The problem is that when you do a 'GROUP BY' on just name you then flatten the weeks and you can only perform aggregate functions on them.
Your best option is to do a GROUP BY on both name and week so something like:
select name, week, avg(parameter1), avg(parameter2)
from data
where week in ('29','30','31','32')
group by name, week
order by name
PS - It' not entirely clear whether you're suggesting that you need one set of results for stations and one for weeks, or whether you need a set of results for every week at every station (which this answer provides the solution for). If you require the former then separate queries are the way to go.