How to output different 25th, 50th, 75th percentiles in single Teradata query? - sql

I had got stuck few hours back on around something similar and worked out a less messy code for outputting 25th, 50th, 75th percentiles in a single Teradata query. Can be further extended to produce a "5 point summary". For minimum and maximum change static values according to your population estimate.
Somewhere someone had asked for an elegant approach. Sharing mine.
Here's the code:
SELECT MAX(PER_MIN) AS PER_MIN,
MAX(PER_25) AS PER_25,
MAX(PER_50) AS PER_50,
MAX(PER_75) AS PER_75,
MAX(PER_MAX) AS PER_MAX
FROM (SELECT CASE WHEN ROW_NUMBER() OVER(ORDER BY DURATION_MACRO_CURR ASC) = CAST(COUNT(*) OVER() * 0.01 AS INT) THEN DURATION_MACRO_CURR END AS PER_MIN,
CASE WHEN ROW_NUMBER() OVER(ORDER BY DURATION_MACRO_CURR ASC) = CAST(COUNT(*) OVER() * 0.25 AS INT) THEN DURATION_MACRO_CURR END AS PER_25,
CASE WHEN ROW_NUMBER() OVER(ORDER BY DURATION_MACRO_CURR ASC) = CAST(COUNT(*) OVER() * 0.50 AS INT) THEN DURATION_MACRO_CURR END AS PER_50
CASE WHEN ROW_NUMBER() OVER(ORDER BY DURATION_MACRO_CURR ASC) = CAST(COUNT(*) OVER() * 0.75 AS INT) THEN DURATION_MACRO_CURR END AS PER_75
CASE WHEN ROW_NUMBER() OVER(ORDER BY DURATION_MACRO_CURR ASC) = CAST(COUNT(*) OVER() * 0.99 AS INT) THEN DURATION_MACRO_CURR END AS PER_MAX
FROM PROD_EXP_DL_CVM.PROD_CVM
WHERE PW_END_DATE = '2016-10-18'
) BASE
Here's the desired output:

I would do this using conditional aggregation:
select min(DURATION_MACRO_CURR) as min_val,
min(case when seqnum / 0.25 >= cnt then DURATION_MACRO_CURR end) as 25_percentile,
min(case when seqnum / 0.50 >= cnt then DURATION_MACRO_CURR end) as 50_percentile,
min(case when seqnum / 0.75 >= cnt then DURATION_MACRO_CURR end) as 75_percentile,
max(DURATION_MACRO_CURR) as max_val
from (select pc.*,
row_number() over (order by DURATION_MACRO_CURR) as seqnum,
count(*) over () as cnt
from PROD_EXP_DL_CVM.PROD_CVM pc
where pc.PW_END_DATE = '2016-10-18'
) pc;

Related

SQL calculation with previous row + current row

I want to make a calculation based on the excel file. I succeed to obtain 2 of the first records with LAG (as you can check on the 2nd screenshot). Im out of ideas how to proceed from now and need help. I just need the Calculation column take its previous data. I want to automatically calculate it over all the dates. I also tried to make a LAG for the calculation but manually and the result was +1 row more data instead of NULL. This is a headache.
LAG(Data ingested, 1) OVER ( ORDER BY DATE ASC ) AS LAG
You seem to want cumulative sums:
select t.*,
(sum(reconciliation + aves - microa) over (order by date) -
first_value(aves - microa) over (order by date)
) as calculation
from CalcTable t;
Here is a SQL Fiddle.
EDIT:
Based on your comment, you just need to define a group:
select t.*,
(sum(reconciliation + aves - microa) over (partition by grp order by date) -
first_value(aves - microa) over (partition by grp order by date)
) as calculation
from (select t.*,
count(nullif(reconciliation, 0)) over (order by date) as grp
from CalcTable t
) t
order by date;
Imo this could be solved using a "gaps and islands" approach. When Reconciliation>0 then create a gap. SUM(GAP) OVER converts the gaps into island groupings. In the outer query the 'sum_over' column (which corresponds to the 'Calculation') is a cumumlative sum partitioned by the island groupings.
with
gap_cte as (
select *, case when [Reconciliation]>0 then 1 else 0 end gap
from CalcTable),
grp_cte as (
select *, sum(gap) over (order by [Date]) grp
from gap_cte)
select *, sum([Reconciliation]+
(case when gap=1 then 0 else Aves end)-
(case when gap=1 then 0 else Microa end))
over (partition by grp order by [Date]) sum_over
from grp_cte;
[EDIT]
The CASE statement could be CROSS APPLY'ed instead
with
grp_cte as (
select c.*, v.gap, sum(v.gap) over (order by [Date]) grp
from #CalcTable c
cross apply (values (case when [Reconciliation]>0 then 1 else 0 end)) v(gap))
select *, sum([Reconciliation]+
(case when gap=1 then 0 else Aves end)-
(case when gap=1 then 0 else Microa end))
over (partition by grp order by [Date]) sum_over
from grp_cte;
Here is a fiddle

Calculate percent changes in contiguous ranges in Postgresql

I need to calculate price percent change in contiguous ranges. For example if price start moving up or down and I have sequence of decreasing or increasing values I need to grab first and last value of that sequence and calculate the change.
I'm using window lag function to calculate direction, my problem- I can't generate unique RANK for the sequences to calculate percent changes.
I tired combination of RANK, ROW_NUMBER, etc. with no luck.
Here's my query
WITH partitioned AS (
SELECT
*,
lag(price, 1) over(ORDER BY time) AS lag_price
FROM prices
),
sequenced AS (
SELECT
*,
CASE
WHEN price > lag_price THEN 'up'
WHEN price < lag_price THEN 'down'
ELSE 'equal'
END
AS direction
FROM partitioned
),
ranked AS (
SELECT
*,
-- Here's is the problem
-- I need to calculate unique rnk value for specific sequence
DENSE_RANK() OVER ( PARTITION BY direction ORDER BY time) + ROW_NUMBER() OVER ( ORDER BY time DESC) AS rnk
-- DENSE_RANK() OVER ( PARTITION BY seq ORDER BY time),
-- ROW_NUMBER() OVER ( ORDER BY seq, time DESC),
-- ROW_NUMBER() OVER ( ORDER BY seq),
-- RANK() OVER ( ORDER BY seq)
FROM sequenced
),
changed AS (
SELECT *,
FIRST_VALUE(price) OVER(PARTITION BY rnk ) first_price,
LAST_VALUE(price) OVER(PARTITION BY rnk ) last_price,
(LAST_VALUE(price) OVER(PARTITION BY rnk ) / FIRST_VALUE(price) OVER(PARTITION BY rnk ) - 1) * 100 AS percent_change
FROM ranked
)
SELECT
*
FROM changed
ORDER BY time DESC;
and SQLFiddle with sample data
If anyone interested here's solution, form another forum:
with ct1 as /* detecting direction: up, down, equal */
(
select
price, time,
case
when lag(price) over (order by time) < price then 'down'
when lag(price) over (order by time) > price then 'up'
else 'equal'
end as dir
from
prices
)
, ct2 as /* setting reset points */
(
select
price, time, dir,
case
when coalesce(lag(dir) over (order by time), 'none') <> dir
then 1 else 0
end as rst
from
ct1
)
, ct3 as /* making groups */
(
select
price, time, dir,
sum(rst) over (order by time) as grp
from
ct2
)
select /* calculates min, max price per group */
price, time, dir,
min(price) over (partition by grp) as min_price,
max(price) over (partition by grp) as max_price
from
ct3
order by
time desc;

NTILE() in BigQuery for non-uniform buckets

I'm trying to perform RFM segmentation on the Google Merchandise Store sample dataset on BigQuery. In my SQL query, NTILE(5) divides the rows into 5 buckets based on row ordering and returns the bucket number that is assigned to each row. In this case, each bucket are of equal size. Would like to find out how to create buckets of different sizes instead. For example, bucket 1 contains the bottom 10%, bucket 2 contains the next 20% of records etc. Thank you!
#standard SQL
SELECT
fullVisitorId,
NTILE(5) OVER (ORDER BY last_order_date) AS rfm_recency,
NTILE(5) OVER (ORDER BY count_order) AS rfm_frequency,
NTILE(5) OVER (ORDER BY avg_amount) AS rfm_monetary
FROM (
SELECT
fullVisitorId,
MAX(date) AS last_order_date,
COUNT(*) AS count_order,
AVG(totals.totalTransactionRevenue)/1000000 AS avg_amount
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170*`
WHERE
_table_suffix BETWEEN "101"
AND "801"
AND totals.totalTransactionRevenue IS NOT NULL
GROUP BY
fullVisitorId )
You can use row_number() and count(*) to define your own buckets:
SELECT fullVisitorId,
(CASE WHEN seqnum_r <= 0.1 * cnt THEN 1
WHEN seqnum_r <= 0.3 * cnt THEN 2
ELSE 3
END) as bin_r,
. . .
FROM (SELECT fullVisitorId,
MAX(date) AS last_order_date,
COUNT(*) AS count_order,
(AVG(totals.totalTransactionRevenue) / 1000000) AS avg_amount,
COUNT(*) OVER () as cnt,
ROW_NUMBER() OVER (ORDER BY MAX(date)) as seqnum_r,
ROW_NUMBER() OVER (ORDER BY COUNT(*)) as seqnum_f,
ROW_NUMBER() OVER (ORDER BY AVG(totals.totalTransactionRevenue)) as seqnum_m
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170*`
WHERE _table_suffix BETWEEN "101" AND "801" AND
totals.totalTransactionRevenue IS NOT NULL
GROUP BY fullVisitorId
) rfm
Below is for BigQuery Standard SQL and assumes your initial query works for for you, SQL UDF NON_UNIFORM_BUCKET() does the trick for you
#standard SQL
CREATE TEMP FUNCTION NON_UNIFORM_BUCKET(i INT64) AS (
CASE
WHEN i = 1 THEN 1
WHEN i IN (2, 3) THEN 2
WHEN i IN (4, 5, 6) THEN 3
WHEN i = 7 THEN 4
ELSE 5
END
);
SELECT
fullVisitorId,
NON_UNIFORM_BUCKET(NTILE(10) OVER (ORDER BY last_order_date)) AS rfm_recency,
NON_UNIFORM_BUCKET(NTILE(10) OVER (ORDER BY count_order)) AS rfm_frequency,
NON_UNIFORM_BUCKET(NTILE(10) OVER (ORDER BY avg_amount)) AS rfm_monetary
FROM (
SELECT
fullVisitorId,
MAX(date) AS last_order_date,
COUNT(*) AS count_order,
AVG(totals.totalTransactionRevenue)/1000000 AS avg_amount
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170*`
WHERE
_table_suffix BETWEEN "101"
AND "801"
AND totals.totalTransactionRevenue IS NOT NULL
GROUP BY
fullVisitorId )

Obtaining multiple percentiles (percentile_cont equivalent) in one pass within Teradata

I understand that we can rewrite percentile_cont within Teradata as:
SELECT
part_col
,data_col
+ ((MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col)
* (((COUNT(*) OVER (PARTITION BY part_col) - 1) * x) MOD 1)) AS percentile_cont
FROM tab
QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col)
= CAST((COUNT(*) OVER (PARTITION BY part_col) - 1) * x AS INT) + 1;
See this very helpful discussion for more information.
Understanding that replacing x with 0.90 would return the 90th percentile, is there an elegant way of extending this and returning multiple percentiles in one pass?
For example, say I want to extend this example and return the 25th, 50th, and 75th percentiles in one pass? Is this possible? Seems like I would need multiple QUALIFY statements? Similarly, if I desire multiple GROUP BY equivalents, is this akin to passing more columns in the PARTITION BY?
-- SQL:2008 Equivalent pseudo-code
SELECT
part_col_a
,part_col_b
,PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY order_col) AS p25
,PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY order_col) AS p50
,PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY order_col) AS p75
FROM tab
GROUP BY
part_col_a
,part_col_b
You should fully read my blog, the final query is doing exactly what you want :-)
SELECT part_col
,MIN(pc25) OVER (PARTITION BY part_col) AS quartile_1
,MIN(pc50) OVER (PARTITION BY part_col) AS quartile_2
,MIN(pc75) OVER (PARTITION BY part_col) AS quartile_3
FROM
(
SELECT
part_col
,COUNT(*) OVER (PARTITION BY part_col) - 1 AS N
,ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY data_col) - 1 AS rowno
,MIN(data_col) OVER (PARTITION BY part_col ORDER BY data_col ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING) - data_col AS diff
,CASE
WHEN rowno = CAST(N * 0.25 AS INT)
THEN data_col +(((N * 0.25) MOD 1) * diff)
END AS pc25
,CASE
WHEN rowno = CAST(N * 0.50 AS INT)
THEN data_col +(((N * 0.50) MOD 1) * diff)
END AS pc50
,CASE
WHEN rowno = CAST(N * 0.75 AS INT)
THEN data_col +(((N * 0.75) MOD 1) * diff)
END AS pc75
FROM tab
QUALIFY rowno = CAST(N * 0.25 AS INT)
OR rowno = CAST(N * 0.50 AS INT)
OR rowno = CAST(N * 0.75 AS INT)
) AS dt
QUALIFY ROW_NUMBER() OVER (PARTITION BY part_col ORDER BY part_col) = 1

Doing a comparison using the previous row?

I'm trying to work out an efficient way of comparing two rows in SQL Server 2008. I need to write a query which finds all rows in the Movement table which have Speed < 10 N consecutive times.
The structure of the table is:
EventTime
Speed
If the data were:
2012-02-05 13:56:36.980, 2
2012-02-05 13:57:36.980, 11
2012-02-05 13:57:46.980, 2
2012-02-05 13:59:36.980, 2
2012-02-05 14:06:36.980, 22
2012-02-05 15:56:36.980, 2
Then it would return rows 3/4 (13:57:46.980 / 13:59:36.980) if I looked for 2 consecutive rows, and would return nothing if I looked for three consecutive rows. The order of the data is EventTime/DateTime only.
Any help you could give me would be great. I'm considering using cursors but they're usually pretty inefficient. Also, this table is approximately 10m rows in size, so the more efficient the better! :)
Thanks!
DECLARE
#n INT,
#speed_limit INT
SELECT
#n = 5,
#speed_limit = 10
;WITH
partitioned AS
(
SELECT
*,
CASE WHEN speed < #speed_limit THEN 1 ELSE 0 END AS PartitionID
FROM
Movement
)
,
sequenced AS
(
SELECT
ROW_NUMBER() OVER ( ORDER BY EventTime) AS MasterSeqID,
ROW_NUMBER() OVER (PARTITION BY PartitionID ORDER BY EventTime) AS PartIDSeqID,
*
FROM
partitioned
)
,
filter AS
(
SELECT
MasterSeqID - PartIDSeqID AS GroupID,
MIN(MasterSeqID) AS GroupFirstMastSeqID,
MAX(MasterSeqID) AS GroupFinalMastSeqID
FROM
sequenced
WHERE
PartitionID = 1
GROUP BY
MasterSeqID - PartIDSeqID
HAVING
COUNT(*) >= #n
)
SELECT
sequenced.*
FROM
filter
INNER JOIN
sequenced
ON sequenced.MasterSeqID >= filter.GroupFirstMastSeqID
AND sequenced.MasterSeqID <= filter.GroupFinalMastSeqID
Alternative final steps (inspired by #t-clausen-dk), to avoid an additional JOIN. I would test both to see which is more performant.
,
filter AS
(
SELECT
MasterSeqID - PartIDSeqID AS GroupID,
COUNT(*) OVER (PARTITION BY MasterSeqID - PartIDSeqID) AS GroupSize,
*
FROM
sequenced
WHERE
PartitionID = 1
)
SELECT
*
FROM
filter
WHERE
GroupSize >= #n
declare #t table(EventTime datetime, Speed int)
insert #t values('2012-02-05 13:56:36.980', 2)
insert #t values('2012-02-05 13:57:36.980', 11)
insert #t values('2012-02-05 13:57:46.980', 2)
insert #t values('2012-02-05 13:59:36.980', 2)
insert #t values('2012-02-05 14:06:36.980', 22)
insert #t values('2012-02-05 15:56:36.980', 2)
declare #N int = 1
;with a as
(
select EventTime, Speed, row_number() over (order by EventTime) rn from #t
), b as
(
select EventTime, Speed, 1 grp, rn from a where rn = 1
union all
select a.EventTime, a.Speed, case when a.speed < 10 and b.speed < 10 then grp else grp + 1 end, a.rn
from a join b on a.rn = b.rn+1
), c as
(
select EventTime, Speed, count(*) over (partition by grp) cnt from b
)
select * from c
where cnt > #N
OPTION (MAXRECURSION 0) -- Thx Dems
Almost the same ideea as Dems, a little bit different:
select * from (
select eventtime, speed, rnk, new_rnk,
rnk - new_rnk,
max(rnk) over (partition by speed, new_rnk-rnk) -
min(rnk) over (partition by speed, new_rnk-rnk) + 1 as no_consec
from (
select eventtime, rnk, speed,
row_number() over (partition by speed order by eventtime) as new_rnk
from (
select eventtime, speed,
row_number() over (order by eventtime) as rnk
from a
) a
where a.speed < 5
)
order by eventtime
)
where no_consec >= 2;
5 is speed limit and 2 is min number of consecutive events.
I put date as number for simplicity of writing the create database.
SQLFIDDLE
EDIT:
To answer to comments, I've added three columns in the first inner query. To get only the first row you need to add an pos_in_group = 1 to WHERE clause and the distance is at your fingers.
SQLFIDDLE
select eventtime, speed, min_date, max_date, pos_in_group
from (
select eventtime, speed, rnk, new_rnk,
rnk - new_rnk,
row_number() over (partition by speed, new_rnk-rnk order by eventtime) pos_in_group,
min(eventtime) over (partition by speed, new_rnk-rnk) min_date,
max(eventtime) over (partition by speed, new_rnk-rnk) max_date,
max(rnk) over (partition by speed, new_rnk-rnk) -
min(rnk) over (partition by speed, new_rnk-rnk) + 1 as no_consec
from (
select eventtime, rnk, speed,
row_number() over (partition by speed order by eventtime) as new_rnk
from (
select eventtime, speed,
row_number() over (order by eventtime) as rnk
from a
) a
where a.speed < 5
)
order by eventtime
)
where no_consec > 1;