Calculating growth and retention rate with sql - google-bigquery

So I wrote a query to calculate the retention, new and returning student growth rate. The code below returns a result similar to this.
Row visit_month student_type numberofstd growth
1 2013 new 574 null
2 2014 new 220 -62%
3 2014 retained 442 245%
4 2015 new 199 -10%
5 2015 retained 533 21%
6 2016 new 214 8%
7 2016 retained 590 11%
8 2016 returning 1 -100%
Query I have tried.
with visit_log AS (
SELECT studentid,
cast(substr(session, 1, 4) as numeric) as visit_month,
FROM abaresult
GROUP BY 1,
2
ORDER BY 1,
2),
time_lapse_2 AS (
SELECT studentid,
Visit_month,
lag(visit_month, 1) over (partition BY studentid ORDER BY studentid, visit_month) lag
FROM visit_log),
time_diff_calculated_2 AS (
SELECT studentid,
visit_month,
lag,
visit_month - lag AS time_diff
FROM time_lapse_2),
student_categorized AS (
SELECT studentid,
visit_month,
CASE
WHEN time_diff=1 THEN 'retained'
WHEN time_diff>1 THEN 'returning'
WHEN time_diff IS NULL THEN 'new'
END AS student_type,
FROM time_diff_calculated_2)
SELECT visit_month,
student_type,
Count(distinct studentid) as numberofstd,
ROUND(100 * (COUNT(student_type) - LAG(COUNT(student_type), 1) OVER (ORDER BY student_type)) / LAG(COUNT(student_type), 1) OVER (ORDER BY student_type),0) || '%' AS growth
FROM student_categorized
group by 1,2
order by 1,2
The query above calculates the retention, new and returning rate based on the figures of the last session student_type category.
I am looking for a way to calculate these figures based on the total number of students in each visit_month and not from each category. Is there a way I can achieve this?
I am trying to get a table similar to this
Row visit_month student_type totalstd numberofstd growth
1 2013 new 574 574 null
2 2014 new 662 220 62%
3 2014 retained 662 442 22%
4 2015 new 732 199 10%
5 2015 retained 732 533 21%
6 2016 new 804 214 8%
7 2016 retained 804 590 11%
8 2016 returning 804 1 100%
Note:
The totalstd is the total number of student in each session and is gotten by new+retention+returning.
The growth calculation was assumed.
Please help!
Thank you.

While I do not have your source data, I am relying myself in the query you shared and the output results.
I created some extra code in order to output the desired result. I would like to point that I did not have access to BigQuery's compilation because I did not have the data. Thus, I have tried to prevent any possible errors the query myself. In addition, the queries between ** are unchanged and were copied from your code. Below is the code (it is a mix of yours and the extra bits I created):
#*****************************************************************
with visit_log AS (
SELECT studentid,
cast(substr(session, 1, 4) as numeric) as visit_month,
FROM abaresult
GROUP BY 1,
2
ORDER BY 1,
2),
time_lapse_2 AS (
SELECT studentid,
Visit_month,
lag(visit_month, 1) over (partition BY studentid ORDER BY studentid, visit_month) lag
FROM visit_log),
time_diff_calculated_2 AS (
SELECT studentid,
visit_month,
lag,
visit_month - lag AS time_diff
FROM time_lapse_2),
student_categorized AS (
SELECT studentid,
visit_month,
CASE
WHEN time_diff=1 THEN 'retained'
WHEN time_diff>1 THEN 'returning'
WHEN time_diff IS NULL THEN 'new'
END AS student_type,
FROM time_diff_calculated_2)
#**************************************************************
#Code I added
#each unique visit_month will have its count
WITH total_stud AS (
SELECT visit_month, count(distinct studentid) as totalstd FROM visit_log
GROUP BY 1
ORDER BY visit_month
),
#After you have your student_categorized temp table, create another one
#It will have the count of the number of students per visit_month per student_type
number_std_monthType AS (
SELECT visit_month,student_type, Count(distinct studentid) as numberofstd from student_categorized
GROUP BY 1, 2
),
#You will have one row per combination of visit_month and student_type
student_categorized2 AS(
SELECT DISTINCT visit_month,student_type FROM student_categorized2
GROUP BY 1,2
),
#before calculation, create the table with all the necessary data
#you have the desired table without the growth
#notice that I used two keys to join t1 and t3 so the results are correct
final_t AS (
SELECT t1.visit_month,
t1.student_type,
t2.totalstd as totalstd,
t3.numberofstd
FROM student_categorized2 t1
LEFT JOIN total_stud AS t2 ON t1.visit_month = t2.visit_month
LEFT JOIN number_std_monthType t3 ON (t1.visit_month = t3.visit_month and t1.student_type = t3.student_type)
ORDER BY
)
#Now all the necessary values to calculate growth are in the temp table final_t
SELECT visit_month, student_type, totalstd, numberofstd,
ROUND(100 * (totalstd - LAG(totalstd) OVER (PARTITION BY visit_month ORDER BY visit_month ASC) /LAG(totalstd) OVER (PARTITION BY visit_month ORDER BY visit_month ASC) || '%' AS growth
FROM final_t
Notice that I used LEFT JOIN in order to have the proper counts in the final table, once each count was calculated in a different temp table. Also, I did not use your final SELECT statement.
If you have any issues with the code, do not hesitate to ask.

Related

Progressive Select Query in Oracle Database

I want to write a select query that selects distinct rows of data progressively.
Explaining with an example,
Say i have 5000 accounts selected for repayment of loan, these accounts are ordered in descending order( Account 1st has highest outstanding while account 5000nd will have the lowest).
I want to select 1000 unique accounts 5 times such that the total outstanding amount of repayment in all 5 cases are similar.
i have tried out a few methods by trying to select rownums based on odd/even or other such way, but it's only good for upto 2 distributions. I was expecting more like a A.P. as in maths that selects data progressively.
A naïve method of splitting sets into (for example) 5 bins, numbered 0 to 4, is give each row a unique sequential numeric index and then, in order of size, assign the first 10 rows to bins 0,1,2,3,4,4,3,2,1,0 and then repeat for additional sets of 10 rows:
WITH indexed_values (value, rn) AS (
SELECT value,
ROW_NUMBER() OVER (ORDER BY value DESC) - 1
FROM table_name
),
assign_bins (value, rn, bin) AS (
SELECT value,
rn,
CASE WHEN MOD(rn, 2 * 5) >= 5
THEN 5 - MOD(rn, 5) - 1
ELSE MOD(rn, 5)
END
FROM indexed_values
)
SELECT bin,
COUNT(*) AS num_values,
SUM(value) AS bin_size
FROM assign_bins
GROUP BY bin
Which, for some random data:
CREATE TABLE table_name ( value ) AS
SELECT FLOOR(DBMS_RANDOM.VALUE(1, 1000001)) FROM DUAL CONNECT BY LEVEL <= 1000;
May output:
BIN
NUM_VALUES
BIN_SIZE
0
200
100012502
1
200
100004633
2
200
99980342
3
200
99976774
4
200
100005756
It will not get the bins to have equal values but it is relatively simple and will get a close approximation if your values are approximately evenly distributed.
If you want to select values from a certain bin then:
WITH indexed_values (value, rn) AS (
SELECT value,
ROW_NUMBER() OVER (ORDER BY value DESC) - 1
FROM table_name
),
assign_bins (value, rn, bin) AS (
SELECT value,
rn,
CASE WHEN MOD(rn, 2 * 5) >= 5
THEN 5 - MOD(rn, 5) - 1
ELSE MOD(rn, 5)
END
FROM indexed_values
)
SELECT value
FROM assign_bins
WHERE bin = 0
fiddle

SQL query to get average sum of other rows and store in current rows

I have table like this, and I want to query to store the average of others row points.
USER_ID POINTS
------------- --------
a14e43e4f851 134
1e86e5adedbf 40
3c66730edf69 149
32e24082f97b 67
b33e3100a7be 124
274ee414ad8f 85
bdeef25fc797 172
For example - for user_id = a14e43e4f851, the average sum of points should be
avg(40+149+67+124+85+172) .
PS - not taken the points (134) in calculation for user a14e43e4f851.
Output should look like this --
USER_ID POINTS AVG
------------- ------- ------
a14e43e4f851 134 106 which is avg(40+149+67+124+85+172)
1e86e5adedbf 40 avg(134+149+67+124+85+172)
3c66730edf69 149 avg(134+40+67+124+85+172)
32e24082f97b 67 avg(134+40+149+124+85+172)
b33e3100a7be 124 ...
274ee414ad8f 85 ...
bdeef25fc797 172 ...
You could use a correlated subquery:
select t.*,
(select avg(t1.points) from mytable t1 where t1.user_id <> t.user_id) as average
from mytable t
An alternative uses window functions:
select t.*,
(sum(points) over() - points) / nullif(count(*) - 1, 0) as average
from mytable t
Note: avg obviously conflicts with a language keyword, I use average instead.
If you wanted an update statement:
update mytable t
set t.average = (
select avg(t1.points) from mytable t1 where t1.user_id <> t.user_id
)
However, I would not recommend actually storing this value; this is derived information, that can easily be computed on the fly whenever needed, using the first statement. If you are going to run the query often, you could create a view:
create view myview as
select t.*,
(sum(points) over() - points) / nullif(count(*) - 1, 0) as average
from mytable t
I'm assuming user_id is the PK.
WITH q AS (SELECT sum(points) AS s, count(*) AS n FROM mytable)
UPDATE table SET average = (q.s-points)/(q.n-1);
The idea is that
average of all other user's score is sum(score)/count(*)
sum of all user scores except this one is equal to sum of all scores minus this user's score
average of all other users' score except this one is (sum(score)-score_for_this_user)/(count(*)-1)
Nice thing is it only has to calculate the sum() and count() once.
To handle the case where there is only one row in the table:
WITH q AS (SELECT sum(points) AS s, NULLIF(count(*),0) AS n FROM mytable)
UPDATE table SET average = (q.s-points)/(q.n-1);
This makes the count NULL instead of 0, so the average updated should be NULL too.

Hive query to select only records in certain percentile

I have table with two columns - ID and total duration:
id tot_dur
123 1
124 2
125 5
126 8
I want to have a Hive query that select only 75th percentile. It should be only the last record:
id tot_dur
126 8
This is what I have, but its hard for me to understand the use of OVER() and PARTITIONED BY() functions, since from what I researched, this are the functions I should use. Before I get the tot_dur column I should sum and group by column duration. Not sure if percentile is the correct function, because I found use cases with percentile_approx.
select k1.id as id, percentile(cast(tot_dur as bigint),0.75) OVER () as tot_dur
from (
SELECT id, sum(duration) as tot_dur
FROM data_source
GROUP BY id) k1
group by id
If I've got you right, this is what you want:
with data as (select stack(4,
123, 1,
124, 2,
125, 5,
126, 8) as (id, tot_dur))
-----------------------------------------------------------------------------
select data.id, data.tot_dur
from data
join (select percentile(tot_dur, 0.75) as threshold from data) as t
where data.tot_dur >= t.threshold;

SQL get value from next row

I'm looking for an SQL way to get the value from the next row.
The data I have looks like:
CUST PROD From_Qty Disc_Pct
23 Brush 1 0
23 Brush 13 1
23 Brush 52 4
77 Paint 1 0
77 Paint 22 7
What I need to end up with is this, (I want to create the To_Qty row):
CUST PROD From_Qty To_Qty Disc_Pct
23 Brush 1 12 0
23 Brush 13 51 1 #13 is 12+1
23 Brush 52 99999 4 #52 is 51+1
77 Paint 1 21 0 #1 is 99999+1
77 Paint 22 99999 7 #22 is 21+1
I've got 100K+ rows to do this to, and it has to be SQL because my ETL application allows SQL but not stored procedures etc.
How can I get the value from the next row so I can create To_Qty?
SELECT *,
LEAD([From_Qty], 1, 100000) OVER (PARTITION BY [CUST] ORDER BY [From_Qty]) - 1 AS To_Qty
FROM myTable
LEAD() will get you the next value based on the order of [From_Qty].. you use PARTITION BY [CUST] to reset when [Cust] changes values
or you can use a CTE and Row_Number.
WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [CUST] ORDER BY [From_Qty]) Rn
FROM myTable
)
SELECT t1.*,
ISNULL(t2.From_Qty - 1, 99999) To_Qty
FROM cte t1
LEFT JOIN cte t2 ON t1.Cust = t2.Cust AND t1.Rn + 1 = t2.Rn
SELECT
CUST,
PROD,
FROM_QTY ,
COALESCE(MIN(FROM_QTY) OVER (PARTITION BY CUST, PROD ORDER BY FROM_QTY DESC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) , 10000)-1,
DISC_PCT
FROM <tablename>
ORDER BY CUST, PROD, FROM_QTY
If you are running SQL Server 2012 or later versions, you can use the LAG and LEAD functions for accessing prior or subsequent rows along with the current row.
You can use LEAD and FIRST_VALUE analytic functions to generate the result you mentioned. By using LEAD() function the next value with in the customer group can be retrieved an the FIRST_VALUE() will give the first value with in the customer group.
Say for eg. CUST=23... LEAD will return 13 and FIRST_VALUE will return 1... TO_QTY= LEAD - FIRST_VALUE i.e.. 13-1=12. In similar way the formula mentioned below will compute for all the 100k rows in your table.
SELECT CUST,
PROD,
FROM_QTY,
CASE WHEN LEAD( FROM_QTY,1 ) OVER ( PARTITION BY CUST ORDER BY FROM_QTY ) IS NOT NULL
THEN
LEAD( FROM_QTY,1 ) OVER ( PARTITION BY CUST ORDER BY FROM_QTY ) -
FIRST_VALUE( FROM_QTY ) OVER ( PARTITION BY CUST ORDER BY FROM_QTY )
ELSE 99999
END AS TO_QTY,
DISC_PCT
FROM Yourtable;
Insert the data into a temp table with the same columns but an id auto increment field added. Insert them ordered, I'm assuming by cust, prod, then from_qty.
Now you can run an update statement on the temp table.
UPDATE #mytable
SET To_Qty = (SELECT From_Qty - 1 FROM #mytable AS next WHERE next.indexfield = #mytable.indexfield + 1 AND next.cust = #mytable.cust and next.prod = #mytable.prod)
and then another one to do the 99999 with a not exists clause.
Then insert the data back to your new or modified table.
declare #Table table(CUST int, PROD varchar(50), From_Qty int, Disc_Pct int)
insert into #Table values
(23, 'Brush', 1, 0)
,(23, 'Brush', 13, 1)
,(23, 'Brush', 52, 4)
,(77, 'Paint', 1, 0)
,(77, 'Paint', 22, 7)
SELECT CUST, Prod, From_qty,
LEAD(From_Qty,1,100000) OVER(PARTITION BY cust ORDER BY from_qty)-1 AS To_Qty,
Disc_Pct
FROM #Table

Reset counter during output with row_number

I have the following SQL Query part:
SELECT car,
grouper,
yearout,
rn
FROM (SELECT Row_number() OVER(partition BY grouper ORDER BY yearout) AS rn,
car,
grouper,
yearout
FROM (SELECT DISTINCT res.sn AS car,
res.groupin AS groupin,
Year(res.yearin) AS yearin
FROM (inner stuff)) res
WHERE Year(res.yearin) BETWEEN '2015' AND '2030'
ORDER BY res.groupin) temp) ALL
WHERE ALL.rn <= (SELECT taken
FROM (SELECT p1.nos - p2.gone AS taken,
yearout,
group_key
FROM (SELECT count(sn) AS nos,
group_key
FROM d_fm_cars
GROUP BY group_key) p1
JOIN (SELECT taken AS gone,
group_key_all,
yearout
FROM settings_keys) p2
ON p1.group_key = p2.group_key_all) counterin
WHERE counterin.yearout = ALL.yearout
AND counterin.group_key_all = ALL.grouper)
The values for taken are like
5
15
for grouper 1 for the years 2015 and 2016 the problem now is that the result i get from the query just runs 1-15 and not like it should from 1-5 and than again from 1-15
The available cars per year are
2015 3
2016 10
2017 25
so the output must run over the years limit and put 2 cars from 2015 into the 2015 output.
When partitioning with grouper and yearout that is not the case anymore. But leaving yearout out it just runs through and does not reset the counter at 5.
Any idea on how to fix this?
Thanks for any help you can provide.