Percentile using SQL - sql

I have 3 columns in my data set:
Monetary
Recency
Frequency
I want to create 3 more columns like M_P, R_Q, F_Q containing the percentile value of each of the values of Monetary, Recency, and Frequency using SQL.
Thank you in advance.
Customer_ID Frequency Recency Monetary R_Q F_Q M_Q
112 1 39 7.05 0.398 0.789 0.85873
143 1 23 0.1833 0.232 0.7895 0.1501
164 1 52 0.416 0.508 0.789 0.295
123 1 118 1.1 0.98 0.789 0.52

The function you are looking for is the ANSI standard function ntile():
select t.*,
ntile(100) over (order by monetary) as percentile_monetary,
ntile(100) over (order by recency) as percentile_recency,
ntile(100) over (order by frequency) as percentile_frequency
from t;
This is available in most databases.
You can calculate the percentile using rank() and count(). Depending on how you want to handle ties and whether you want values from 1-100 or 0-100, the following should a good starting point:
select t.*,
(1 + rank_monetary * 100.0 / cnt) as percentile_monetary,
(1 + rank_recency * 100.0 / cnt) as percentile_recency,
(1 + rank_frequency * 100.0 / cnt) as percentile_frequency
from (select t.*,
count(*) over () as cnt,
rank() over (order by monetary) - 1 as rank_monetary,
rank() over (order by recency) - 1 as rank_recency,
rank() over (order by frequency) - 1 as rank_frequency
from t
) t;

Related

How to calculate compound running total in SQL

I have a table like this:
Year, DividendYield
1950, .1
1951, .2
1952, .3
I now want to calculate the total running shares. In other words, if the dividend is re-invested in new shares, it will look now like this:
Original Number of Shares purchased Jan 1, 1950 is 1
1950, .1, 1.1 -- yield of .1 reinvested in new shares results in .1 new shares, totaling 1.1
1951, .2, 1.32 -- (1.1 (Prior Year Total shares) * .2 (dividend yield) + 1.1 = 1.32)
1953, .3, 1.716 -- (1.32 * .3 + 1.32 = 1.716)
The closest I have been able to come up with is this:
declare #startingShares int = 1
; with cte_data as (
Select *,
#startingShares * DividendYield as NewShares,
(#startingShares * DividendYield) + #startingShares as TotalShares from DividendTest
)
select *, Sum(TotalShares) over (order by id) as RunningTotal from cte_data
But only the first row is correct.
Id Year DividendYield NewShares TotalShares RunningTotal
1 1950 0.10 0.10 1.10 1.10
2 1951 0.20 0.20 1.20 2.30
3 1953 0.30 0.30 1.30 3.60
How do I do this with SQL? I was trying not to resort to a loop to process this.
You want a cumulative multiplication. I think a correlated CTE is actually the simplest solution:
with tt as (
select t.*, row_number() over (order by year) as seqnum
from t
),
cte as (
select tt.year, convert(float, tt.yield) as yield, tt.seqnum
from tt
where seqnum = 1
union all
select tt.year, (tt.yield + 1) * (cte.yield + 1) - 1, tt.seqnum
from cte join
tt
on tt.seqnum = cte.seqnum + 1
)
select cte.*
from cte;
Here is a db<>fiddle.
You can also phrase this using logs and exponents:
select t.*,
exp(sum(log(1 + yield)) over (order by year)) - 1
from t;
This should be fine for most purposes, but I find that for longer series this introduces numerical errors more quickly than the recursive CTE.

Find the longest streak of perfect scores per player

I have a the following result from a SELECT query with ORDER BY player_id ASC, time ASC in PostgreSQL database:
player_id points time
395 0 2018-06-01 17:55:23.982413-04
395 100 2018-06-30 11:05:21.8679-04
395 0 2018-07-15 21:56:25.420837-04
395 100 2018-07-28 19:47:13.84652-04
395 0 2018-11-27 17:09:59.384-05
395 100 2018-12-02 08:56:06.83033-05
399 0 2018-05-15 15:28:22.782945-04
399 100 2018-06-10 12:11:18.041521-04
454 0 2018-07-10 18:53:24.236363-04
675 0 2018-08-07 20:59:15.510936-04
696 0 2018-08-07 19:09:07.126876-04
756 100 2018-08-15 08:21:11.300871-04
756 100 2018-08-15 16:43:08.698862-04
756 0 2018-08-15 17:22:49.755721-04
756 100 2018-10-07 15:30:49.27374-04
756 0 2018-10-07 15:35:00.975252-04
756 0 2018-11-27 19:04:06.456982-05
756 100 2018-12-02 19:24:20.880022-05
756 100 2018-12-04 19:57:48.961111-05
I'm trying to find each player's longest streak where points = 100, with the tiebreaker being whichever streak began most recently. I also need to determine the time at which that player's longest streak began. The expected result would be:
player_id longest_streak time_began
395 1 2018-12-02 08:56:06.83033-05
399 1 2018-06-10 12:11:18.041521-04
756 2 2018-12-02 19:24:20.880022-05
A gaps-and-islands problem indeed.
Assuming:
"Streaks" are not interrupted by rows from other players.
All columns are defined NOT NULL. (Else you have to do more.)
This should be simplest and fastest as it only needs two fast row_number() window functions:
SELECT DISTINCT ON (player_id)
player_id, count(*) AS seq_len, min(ts) AS time_began
FROM (
SELECT player_id, points, ts
, row_number() OVER (PARTITION BY player_id ORDER BY ts)
- row_number() OVER (PARTITION BY player_id, points ORDER BY ts) AS grp
FROM tbl
) sub
WHERE points = 100
GROUP BY player_id, grp -- omit "points" after WHERE points = 100
ORDER BY player_id, seq_len DESC, time_began DESC;
db<>fiddle here
Using the column name ts instead of time, which is a reserved word in standard SQL. It's allowed in Postgres, but with limitations and it's still a bad idea to use it as identifier.
The "trick" is to subtract row numbers so that consecutive rows fall in the same group (grp) per (player_id, points). Then filter the ones with 100 points, aggregate per group and return only the longest, most recent result per player.
Basic explanation for the technique:
Select longest continuous sequence
We can use GROUP BY and DISTINCT ON in the same SELECT, GROUP BY is applied before DISTINCT ON. Consider the sequence of events in a SELECT query:
Best way to get result count before LIMIT was applied
About DISTINCT ON:
Select first row in each GROUP BY group?
This is a gap and island problem, you can try to use SUM condition aggravated function with window function, getting gap number.
then use MAX and COUNT window function again.
Query 1:
WITH CTE AS (
SELECT *,
SUM(CASE WHEN points = 100 THEN 1 END) OVER(PARTITION BY player_id ORDER BY time) -
SUM(1) OVER(ORDER BY time) RN
FROM T
)
SELECT player_id,
MAX(longest_streak) longest_streak,
MAX(cnt) longest_streak
FROM (
SELECT player_id,
MAX(time) OVER(PARTITION BY rn,player_id) longest_streak,
COUNT(*) OVER(PARTITION BY rn,player_id) cnt
FROM CTE
WHERE points > 0
) t1
GROUP BY player_id
Results:
| player_id | longest_streak | longest_streak |
|-----------|-----------------------------|----------------|
| 756 | 2018-12-04T19:57:48.961111Z | 2 |
| 399 | 2018-06-10T12:11:18.041521Z | 1 |
| 395 | 2018-12-02T08:56:06.83033Z | 1 |
One way to do this is to look at how many rows between the previous and next non-100 results. To get the lengths of the streaks:
with s as (
select s.*,
row_number() over (partition by player_id order by time) as seqnum,
count(*) over (partition by player_id) as cnt
from scores s
)
select s.*,
coalesce(next_seqnum, cnt + 1) - coalesce(prev_seqnum, 0) - 1 as length
from (select s.*,
max(seqnum) filter (where score <> 100) over (partition by player_id order by time) as prev_seqnum,
max(seqnum) filter (where score <> 100) over (partition by player_id order by time) as next_seqnum
from s
) s
where score = 100;
You can then incorporate the other conditions:
with s as (
select s.*,
row_number() over (partition by player_id order by time) as seqnum,
count(*) over (partition by player_id) as cnt
from scores s
),
streaks as (
select s.*,
coalesce(next_seqnum - prev_seqnum) over (partition by player_id) as length,
max(next_seqnum - prev_seqnum) over (partition by player_id) as max_length,
max(next_seqnum) over (partition by player_id) as max_next_seqnum
from (select s.*,
coalesce(max(seqnum) filter (where score <> 100) over (partition by player_id order by time), 0) as prev_seqnum,
coalesce(max(seqnum) filter (where score <> 100) over (partition by player_id order by time), cnt + 1) as next_seqnum
from s
) s
where score = 100
)
select s.*
from streaks s
where length = max_length and
next_seqnum = max_next_seqnum;
Here is my answer
select
user_id,
non_streak,
streak,
ifnull(non_streak,streak) strk,
max(time) time
from (
Select
user_id,time,
points,
lag(points) over (partition by user_id order by time) prev_point,
case when points + lag(points) over (partition by user_id order by time) = 100 then 1 end as non_streak,
case when points + lag(points) over (partition by user_id order by time) > 100 then 1 end as streak
From players
) where ifnull(non_streak,streak) is not null
group by 1,2,3
order by 1,2
) group by user_id`

Oracle Order By Seconds

I have a table like the following:
TIME Quantity
200918 122
200919 333
200919 500
181222 32
181223 43
The output I would like for the above data is:
Time Quantity
200919 955
181223 75
Essentially I want to group based on a tolerance of one second and sum up the quantity taking the latest time. Any pointers?
Thanks
You can use lag() and cumulative group by:
select max(time), sum(quantity)
from (select t.*,
sum(case when prev_time < time - 1 then 1 else 0 end) over (order by time) as grp
from (select t.*, lag(time) over (order by time) as prev_time
from t
) t
)
group by grp;

How can I group this in SQL

I have this selection
Trip Sequence Shipment Place
=================================
102 10 4798 Amsterdam
102 20 4823 Utrecht
102 30 4831 Utrecht
102 40 4830 Rotterdam
102 50 4790 Rotterdam
102 60 4840 Utrecht
102 70 4810 Amsterdam
I want this grouped like this:
Trip Group Place
==========================
102 1 Amsterdam
102 2 Utrecht
102 3 Rotterdam
102 4 Utrecht
102 5 Amsterdam
How can I achieve this in SQL server?
Answer of #Giorgos Betos is great. Follow-up question: How can I assign these groupnumbers to the rows of the original table?
Try this:
SELECT Trip, Place,
ROW_NUMBER() OVER (ORDER BY MinSequence) AS [Group]
FROM (
SELECT Trip, Place, MIN(Sequence) AS MinSequence
FROM (
SELECT Trip, Place, Sequence,
ROW_NUMBER() OVER (ORDER BY Sequence) -
ROW_NUMBER() OVER (PARTITION BY Trip, Place
ORDER BY Sequence) AS grp
FROM mytable) AS x
GROUP BY Trip, Place, grp) AS t
Demo here
Edit:
To get the ranking numbers in the original, ungrouped, table you can use DENSE_RANK:
SELECT Trip, Place, Sequence,
DENSE_RANK() OVER (ORDER BY grp) AS [Group]
FROM (
SELECT Trip, Place, Sequence,
MIN(Sequence) OVER (PARTITION BY Trip, Place, grp) AS grp
FROM (
SELECT Trip, Place, Sequence,
ROW_NUMBER() OVER (ORDER BY Sequence) -
ROW_NUMBER() OVER (PARTITION BY Trip, Place
ORDER BY Sequence) AS grp
FROM mytable) AS t) AS x
ORDER BY Sequence
Demo here

SQL percentile function and joining 2 queries:

I am trying to retrieve the max, min and the 90th percentile from a table of results.
I want the 90th percentile for duration based on the timestamp_ in asc order.
My Table looks like this:
TIMESTAMP_ DURATION
24/01/2000 12:04:45.120 454
26/10/200 12:13:49.440 301
06/01/2001 15:12:05.760 245
23/01/2001 10:56:55.680 462
16/02/2001 12:10:39.360 376
19/04/2001 09:22:45.120 53
13/05/2001 12:36:34.560 330
30/05/2001 14:47:45.600 796
07/08/2001 08:51:47.520 471
25/08/2001 14:24:08.640 821
I have 2 queries to retrive this info, but is there a simpler solution by using one query. here are my queries:
Select MIN(DURATION), MAX(DURATION)
From t
;
Select DURATION as nthPercentile from t
Where TIMESTAMP_ =
(
Select
Percentile_disc(0.90) within group (order by TIMESTAMP_) AS nth
From t
)
Thanks
Here is one approach:
Select MAX(case when TIMESTAMP_ = nth then DURATION end) as nthPercentile,
MAX(MAXDUR) as MAXDUR, MAX(MINDUR) as MINDUR
from (Select DURATION, TIMESTAMP_, MIN(DURATION) as MINDUR, MAX(DURATION) as MAXDUR
Percentile_disc(0.90) within group (order by TIMESTAMP_) AS nth
from t
) tsum join
t
on t.TIMESTAMP_ = tsum.TIMESTAMP_;
You can use analytics:
SQL> SELECT duration, max_duration, min_duration
2 FROM (SELECT duration, ts,
3 Percentile_disc(0.90) within GROUP(ORDER BY ts) OVER() nth,
4 MAX(duration) OVER() max_duration,
5 MIN(duration) OVER() min_duration
6 FROM t)
7 WHERE ts = nth;
DURATION MAX_DURATION MIN_DURATION
---------- ------------ ------------
471 821 53
However, I'm not really sure that this is the result you want. Why would you want to order by timestamp? The resulting duration is the duration of the 90th percentile row based on timestamp, not duration.
A more straightforward and logical result may be what you need:
SQL> SELECT Percentile_disc(0.90) WITHIN GROUP(ORDER BY duration) nth,
2 MAX(duration) max_duration,
3 MIN(duration) min_duration
4 FROM t;
NTH MAX_DURATION MIN_DURATION
---------- ------------ ------------
796 821 53