snowflake ntile of engagement rate over month - sql

I'm trying to write a query that ranks a customer's engagement rate by decile within a month, against all other customers.
I tried:
ntile(10) over (partition by rec_month order by engagement_rate) as decile
but I don't think that is getting me what I need. It appears to just be splitting the vhosts up into 10 equally sized groups. I want percentiles.
I also tried:
ntile(10) over (partition by rec_month, vhost order by engagement_rate) as decile
But that is only calculating it within customer (vhost) within the month.
How do I calculate the engagement_rate decile against all other customers (Vhosts) within the month?

At first I suspect you a rank function, like DENSE_RANK or RANK (which is sparse, the on I'd use), and then turn that to a percentage by dividing by the count(*) and then truncating into deciles by (trunc(V/10)*10), but then I suspect you might want the output of the PERCENT_RANK function, but the snowflake documentation example is not as clarifying as I would hope, to know it solves your problem.
select column1,
column2,
round(percent_rank() over (order by column2),3) as p_rank,
trunc(p_rank*10)*10 as decile
from values ('a',1),('b',2),('c',3),('d',4),('e',5),('f',6),('g',7);
gives
COLUMN1 COLUMN2 P_RANK DECILE
a 1 0 0
b 2 0.167 10
c 3 0.333 30
d 4 0.5 50
e 5 0.667 60
f 6 0.833 80
g 7 1 100
but maybe you want to use ntile instead of the truncating of the percentage. The round is just there to make the answers above less verbose.

Related

Partition rows based on percent value of the data range

What will be the select query if I want to rank and partition rows based on percent range of the partitioning column?
For example, Let say I have the below table structure (the column 'Rank' needs to be populated).
And I want to rank the rows based on the order of score for the row, but only within the +/- 10% of the amount range from the current row's amount. That is, for the first row, the amount is 2.3, and +/-10% of 2.3 is: 2.07 - 2.53. So while ranking first row, I should rank based on the score and consider only those rows which has the amount in the range 2.07 - 2.53 (in this case its id's 1,5,11). Based on this logic the percentile rank is populated in the last column, and the rank for first row will be 0.5. Similarly, perform the same steps for each row.
Question is how can I do this with PERCENT_RANK() or RANK() or NTILE() with partition clause as part of a select query? The original table does not have the last column, that is the column that needs to be populated. I need the percentile ranking of the row based on the score within the 10% range of amount.
PRODUCT_ID
Amount
Score
Percent_Rank
1
2.3
45
0.5
2
2.7
30
0
3
2.0
40
0.5
4
2.6
50
1
5
2.2
35
0
6
5.1
25
0
7
4.8
40
1
8
6.1
60
0
9
22.1
70
0.33
10
8.2
20
0
11
2.1
50
1
12
22.2
60
0
13
22.3
80
1
14
22.4
75
0.66
I tried using the PERCENT_RANK() over partition() but its not considering the range. I cannot use range unbounded preceding and following in the partition clause because I need the range to be within 10% of the amount in the current row.
You may try PERCENT_RANK() with a self join as the following:
SELECT PRODUCT_ID, Amount, Score, Percent_Rank
FROM
(
SELECT A.PRODUCT_ID, A.Amount ,A.Score, B.Amount AS B_Amount,
PERCENT_RANK() OVER (PARTITION BY A.PRODUCT_ID ORDER BY B.SCORE) Percent_Rank
FROM table_name A JOIN table_name B
ON B.Amount BETWEEN A.Amount-A.Amount *0.1 AND A.Amount+A.Amount*0.1
) T
WHERE Amount=B_Amount
See a demo.
I think that you can just nest your percent_rank in a subquery once you have calculated the bucket number based on equally spaced scores.
The trickiest part of this example is actually getting the fixed width buckets. It might be simpler if we could use width_bucket() but some databases don't support that, so I had to compute manually (in 'bucketed' inline table).
Here is the example. I used postgres to create the mockup test table, because it has a very nice generate_series(), but the actual example SQL should run on any database.
create table product_scores as (
select
product_id,
score
from
generate_series(1,2) product_id,
generate_series(1,50) score);
This created a table with two product ids and 50 scores for each one.
with ranges as (
select
product_id,
(max(score)-min(score))*(1+1e-10) range,
min(score) minscore from product_scores
group by product_id),
bucketed as (
select
ranges.product_id,
score,
floor((score-minscore)*10.0/range) bucket
from
ranges
inner
join product_scores
on
ranges.product_id=product_scores.product_id)
select
product_id,
score,
bucket,
percent_rank()
over (partition by product_id,bucket order by score) from bucketed;
No the 1e-10 is not a joke. Unfortunately roundoff error would assign the highest value to a bucket all by itself unless we expand the range by a tiny amount. But once we have a workable range we can then calculate the partition easily enough by checking range.
Then having the partition number you can to the percent_rank() as usual, as shown.

Running calculations on different percent amounts of data - SQL

I'm brainstorming on ways to find trends over a dataset containing transaction amounts that spans a year.
I'd like to run an average of top 25% observations of data and bottom 75% observations of data and viceversa.
If the entire dataset contains 1000 observations, I'd like to run:
An average of the top 25% and then separately, an average of the bottom 75% and find the resulting average of this.
Inversely, top 75% average, then bottom 25%, then the average of the 2.
For the overall average I have: avg(transaction_amount)
I am aware that in order for the sectioning averages to be useful, I will have to order the data according to the date which I already have accounted for in my SQL code:
select avg(transaction_amount)
from example.table
order by transaction_date
I am now struggling to find a way to split the data between 25% and 75% based on the number of observations.
Thanks.
If you're using MSSQL, it's pretty trivial depending on exactly the output you're looking for.
SELECT TOP 25 PERCENT
*
FROM (
SELECT
AVG(transaction_amount) as avg_amt
FROM example.table
) AS sub
ORDER BY sub.avg_amt DESC
Use PERCENT_RANK in order to see which percentage block a row belongs to. Then use this to group your data:
with data as
(
select t.*, percent_rank() over (order by transaction_amount) as pr
from example.table t
)
select
case when pr <= 0.75 then '0-75%' else '75-100%' end as percent,
avg(transaction_amount) as avg,
avg(avg(transaction_amount)) over () as avg_of_avg
from data
group by case when pr <= 0.75 then '0-75%' else '75-100%' end
union all
select
case when pr <= 0.25 then '0-25%' else '25-100%' end as percent,
avg(transaction_amount) as avg,
avg(avg(transaction_amount)) over () as avg_of_avg
from data
case when pr <= 0.25 then '0-25%' else '25-100%' end;

SQL- calculate ratio and get max ratio with corresponding user and date details

I have a table with user, date and a col each for messages sent and messages received:
I want to get the max of messages_sent/messages_recieved by date and user for that ratio. So this is the output I expect:
Andrew Lean 10/2/2020 10
Andrew Harp 10/1/2020 6
This is my query:
SELECT
ds.date, ds.user_name, max(ds.ratio) from
(select a.user_name, a.date, a.message_sent/ a.message_received as ratio
from messages a
group by a.user_name, a.date) ds
group by ds.date
But the output I get is:
Andrew Lean 10/2/2020 10
Jalinn Kim 10/1/2020 6
In the above output 6 is the correct max ratio for the date grouped but the user is wrong. What am I doing wrong?
With a recent version of most databases, you could do something like this.
This assumes, as in your data, there's one row per user per day. If you have more rows per user per day, you'll need to provide a little more detail about how to combine them or ignore some rows. You could want to SUM them. It's tough to know.
WITH cte AS (
select a.user_name, a.date
, a.message_sent / a.message_received AS ratio
, ROW_NUMBER() OVER (PARTITION BY a.date ORDER BY a.message_sent / a.message_received DESC) as rn
from messages a
)
SELECT t.user_name, t.date, t.ratio
FROM cte AS t
WHERE t.rn = 1
;
Note: There's no attempt to handle ties, where more than one user has the same ratio. We could use RANK (or other methods) for that, if your database supports it.
Here, I am just calculating the ratio for each column in the first CTE.
In the second part, I am getting the maximum results of the ratio calculated in the first part on date level. This means I am assuming each user will have one row for each date.
The max() function on date level will ensure that we always get the highest ratio on date level.
There could be ties, between the ratios for that we can use ROW_NUMBER' OR RANK()` to set a rank for each row based on the criteria that we would like to pass in case of ties and then filter on the rank generated.
with data as (
select
date,
user_id,
messages_sent / messages_recieved as ratio
from [table name]
)
select
date,
max(ratio) as higest_ratio_per_date
from data
group by 1,2

Moving Average in SQL Server

I have the below table with values:
Row_ID FQFY Average
1 2018-Q1 70%
2 2018-Q2 60%
3 2018-Q3 50%
4 2018-Q4 90%
5 2019-Q1 70%
6 2019-Q2 80%
7 2019-Q3 20%
8 2019-Q4 NULL
9 2020-Q1 30%
Starting from 4th row, I have a requirement to calculate the moving average of preceding 4 row values. And if there is any NULL value, then the requirement is to ignore this NULL while doing the average
Can someone please help me here with the code in SQL Server?
Use AVG with an appropriate window frame:
SELECT *, AVG(Average) OVER (ORDER BY FQFY ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) rollingAverage
FROM yourTable;
Regarding the NULL requirement, AVG by default already ignores NULL values.

Rank based on sequence of dates

I am having data as below
**Heading Date**
A 2009-02-01
B 2009-02-03
c 2009-02-05
d 2009-02-06
e 2009-02-08
I need rank as below
Heading Date Rank
A 2009-02-01 1
B 2009-02-03 2
c 2009-02-05 1
d 2009-02-06 2
e 2009-02-07 3
As I need rank based on date. If the date is continuous the rank should be 1, 2, 3 etc. If there is any break on dates I need to start over with 1, 2, ...
Can any one help me on this?
SELECT heading, thedate
,row_number() OVER (PARTITION BY grp ORDER BY thedate) AS rn
FROM (
SELECT *, thedate - (row_number() OVER (ORDER BY thedate))::int AS grp
FROM demo
) sub;
While you speak of "rank" you seem to want the result of the window function row_number().
Form groups of consecutive days (same date in grp) in subquery sub.
Number rows with another row_number() call, this time partitioned by grp.
One subquery is the bare minimum here, since window functions cannot be nested.
SQL Fiddle.
Note that I went with the second version of your contradictory sample data. And the result is as #mu suggested in his comment.
Also assuming that there are no duplicate dates. You'd have to aggregate first in this case.
Hi this is not correct answer, I am trying.. It is interesting..:) I am posting what I got so far: sqlfiddle
SELECT
rank() over (order by thedate asc) as rank,
heading, thedate
FROM
demo
Order by
rank asc;
Now I am trying to get the break in dates. I don't know how? But may be these links useful
SQL — computing end dates from a given start date with arbitrary
breaks
How to rank in postgres query
I will update if I got anything.
Edit:
I got this for mysql, I am posting this because it may helpful. Check Emulate Row_Number()
Here
Given a table with two columns i and j, generate a resultset that has
a derived sequential row_number column taking the values 1,2,3,... for
a defined ordering of j which resets to 1 when the value of i changes
Bangalore BLR - Bagmane Tech Park 2013-10-11 Data Centre 0
Bangalore BLR - Bagmane Tech Park 2013-10-11 BMS 0
Bangalore BLR - Bagmane Tech Park 2013-10-12 BMS 0
Bangalore BLR - Bagmane Tech Park 2013-10-15 BMS 3
I am having data lyk this..
If last column is zero the rank should be made based on all columns..If the date is continuous
like 2013-10-11 ,2013-10-12 rank should be 1,2...
If there is any break in date 2013-10-11 ,2013-10-12 and 2013-10-15 again the rank should start from 1 for 2013-10-15