SQL percentile function and joining 2 queries: - sql

I am trying to retrieve the max, min and the 90th percentile from a table of results.
I want the 90th percentile for duration based on the timestamp_ in asc order.
My Table looks like this:
TIMESTAMP_ DURATION
24/01/2000 12:04:45.120 454
26/10/200 12:13:49.440 301
06/01/2001 15:12:05.760 245
23/01/2001 10:56:55.680 462
16/02/2001 12:10:39.360 376
19/04/2001 09:22:45.120 53
13/05/2001 12:36:34.560 330
30/05/2001 14:47:45.600 796
07/08/2001 08:51:47.520 471
25/08/2001 14:24:08.640 821
I have 2 queries to retrive this info, but is there a simpler solution by using one query. here are my queries:
Select MIN(DURATION), MAX(DURATION)
From t
;
Select DURATION as nthPercentile from t
Where TIMESTAMP_ =
(
Select
Percentile_disc(0.90) within group (order by TIMESTAMP_) AS nth
From t
)
Thanks

Here is one approach:
Select MAX(case when TIMESTAMP_ = nth then DURATION end) as nthPercentile,
MAX(MAXDUR) as MAXDUR, MAX(MINDUR) as MINDUR
from (Select DURATION, TIMESTAMP_, MIN(DURATION) as MINDUR, MAX(DURATION) as MAXDUR
Percentile_disc(0.90) within group (order by TIMESTAMP_) AS nth
from t
) tsum join
t
on t.TIMESTAMP_ = tsum.TIMESTAMP_;

You can use analytics:
SQL> SELECT duration, max_duration, min_duration
2 FROM (SELECT duration, ts,
3 Percentile_disc(0.90) within GROUP(ORDER BY ts) OVER() nth,
4 MAX(duration) OVER() max_duration,
5 MIN(duration) OVER() min_duration
6 FROM t)
7 WHERE ts = nth;
DURATION MAX_DURATION MIN_DURATION
---------- ------------ ------------
471 821 53
However, I'm not really sure that this is the result you want. Why would you want to order by timestamp? The resulting duration is the duration of the 90th percentile row based on timestamp, not duration.
A more straightforward and logical result may be what you need:
SQL> SELECT Percentile_disc(0.90) WITHIN GROUP(ORDER BY duration) nth,
2 MAX(duration) max_duration,
3 MIN(duration) min_duration
4 FROM t;
NTH MAX_DURATION MIN_DURATION
---------- ------------ ------------
796 821 53

Related

View and complex query count distinct locations employee stayed in SQL

I have a view which looks like this view_1:
id Office Begin_dt Last_dt Days
1 Office1 2019-09-02 2019-09-08 6
1 Office2 2019-09-09 2019-09-30 21
1 Office1 2019-10-01 2019-10-31 30
5 Office3 2017-10-01 2017-10-16 15
5 Office2 2017-10-17 2017-10-30 13
5 Office2 2017-11-01 2017-11-31 30
I want to find the office where employee stayed for max time and also the number of Distinct Office locations he stayed in.
Expected output
id Max_time_in_Office Days Distinct_office_locations
1 Office1 36 2
5 Office2 43 2
So id 1 spends 6 and 30, overall 36 days in office 1. Max time is spent in office 1 by him. Distinct locations are 2.
id 5 spends 13 and 30 , 43 days in office. Max time is spent in office 2. Distinct locations are 2.
Code tried
select v.*
from (select v.id, v.office, sum(days) as Max_time_in_Office, count(Office) as Distinct_office_locations,
rank() over (partition by id order by sum(days) desc) as seqnum
from view_1 v
group by id, office
) v
where seqnum = 1;
Output obtained
id Max_time_in_Office Days Distinct_office_locations
1 Office1 36 1
5 Office2 43 1
So I am getting wrong output. Can someone pls help
Close. You want a window function:
select v.*
from (select v.id, v.office, sum(days) as Max_time_in_Office,
count(*) over (partition by id) as Distinct_office_locations,
rank() over (partition by id order by sum(days) desc) as seqnum
from view_1 v
group by id, office
) v
where seqnum = 1;
Basically the window function is counting the number of rows returned after the aggregation -- and there is one row per office.
You could use the apply operator to achieve that:
select V.Id,
T.Max_Time_Office,
T.Days,
Distinct_office_locations = count(distinct V.Office)
from view_1 V
Cross apply
(
Select top 1 Id,
Max_Time_Office = Office,
Days = sum(Days)
From view_1 VG
where V.Id = VG.Id
group by VG.Id, VG.Office
order by sum(Days) desc
) T
group by V.Id, T.Max_Time_Office, T.Days
Basically, you are getting the Office with most days in the order by sum(Days) desc inside the Cross apply, and using that in the outer expression. I then just did a count(distinct V.Office) to get the distinct offices.

Do partial row in BigQuery to get last data and order by id

i want to get last id and their rank (based on order by date_update asc and then order by again by id desc ) and show id and rank of id. i do the query like below:
SELECT id as data,
RANK() OVER (ORDER BY date_update) AS rank
FROM `test.sample`
ORDER BY id DESC
LIMIT 1
and it's work for other table but didn't work some table with large data and get notice:
Resources exceeded during query execution: The query could not be executed in the allotted memory.
i have done read Troubleshooting Error Big Query
and try to remove ORDER BY but still can't running, what should i do ?
sample data:
id date_update
22 2019-10-04
14 2019-10-01
24 2019-10-03
13 2019-10-02
process :
Rank() Over (Order by date_update)
id date_update rank
14 2019-10-01 1
13 2019-10-02 2
24 2019-10-03 3
22 2019-10-04 4
order by id desc based on above
id date_update rank
24 2019-10-03 3
22 2019-10-04 4
14 2019-10-01 1
13 2019-10-02 2
this is the expected result:
id rank
24 3
You can use the query below. It basically finds the row with max ID (latest ID), then queries the source table again using date_value of max id row as a filter.
WITH
`test.sample` AS
(
select 22 AS id, DATE('2019-10-04') as date_update union all
select 14 AS id, DATE('2019-10-01') as date_update union all
select 24 AS id, DATE('2019-10-03') as date_update union all
select 13 AS id, DATE('2019-10-02') as date_update
),
max_id_row AS
(
SELECT ARRAY_AGG(STRUCT(id, date_update) ORDER BY id DESC LIMIT 1)[OFFSET(0)] vals
FROM `test.sample`
)
SELECT m.vals.id, m.vals.date_update, COUNT(*) as rank
FROM `test.sample` as t
JOIN max_id_row as m
ON t.date_update <= m.vals.date_update
GROUP BY 1,2
Below is for BigQuery Standard SQL and should scale to whatever "large" data you have
#standardSQL
SELECT b.id, COUNT(1) + 1 AS `rank`
FROM `project.dataset.table` a
JOIN (
SELECT ARRAY_AGG(STRUCT(id, date_update) ORDER BY id DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
) b
ON a.date_update < b.date_update
GROUP BY id
If to apply for sample data in your question -
WITH `project.dataset.table` AS (
SELECT 22 id, DATE '2019-10-04' date_update UNION ALL
SELECT 14, '2019-10-01' UNION ALL
SELECT 24, '2019-10-03' UNION ALL
SELECT 13, '2019-10-02'
)
result is
Row id rank
1 24 3
The "trick" here is in changing focus from not scalable code with non or badly parallelized operations (RANK) to something that is as simple as COUNT'ing
So, your case (at least as it is presented in question's "process" section) can be rephrased as finding number of rows before the day with highest id - that simple - thus above simple query. Obviously adding "1" to that count gives you exactly what would RANK gave you if worked

Find the longest streak of perfect scores per player

I have a the following result from a SELECT query with ORDER BY player_id ASC, time ASC in PostgreSQL database:
player_id points time
395 0 2018-06-01 17:55:23.982413-04
395 100 2018-06-30 11:05:21.8679-04
395 0 2018-07-15 21:56:25.420837-04
395 100 2018-07-28 19:47:13.84652-04
395 0 2018-11-27 17:09:59.384-05
395 100 2018-12-02 08:56:06.83033-05
399 0 2018-05-15 15:28:22.782945-04
399 100 2018-06-10 12:11:18.041521-04
454 0 2018-07-10 18:53:24.236363-04
675 0 2018-08-07 20:59:15.510936-04
696 0 2018-08-07 19:09:07.126876-04
756 100 2018-08-15 08:21:11.300871-04
756 100 2018-08-15 16:43:08.698862-04
756 0 2018-08-15 17:22:49.755721-04
756 100 2018-10-07 15:30:49.27374-04
756 0 2018-10-07 15:35:00.975252-04
756 0 2018-11-27 19:04:06.456982-05
756 100 2018-12-02 19:24:20.880022-05
756 100 2018-12-04 19:57:48.961111-05
I'm trying to find each player's longest streak where points = 100, with the tiebreaker being whichever streak began most recently. I also need to determine the time at which that player's longest streak began. The expected result would be:
player_id longest_streak time_began
395 1 2018-12-02 08:56:06.83033-05
399 1 2018-06-10 12:11:18.041521-04
756 2 2018-12-02 19:24:20.880022-05
A gaps-and-islands problem indeed.
Assuming:
"Streaks" are not interrupted by rows from other players.
All columns are defined NOT NULL. (Else you have to do more.)
This should be simplest and fastest as it only needs two fast row_number() window functions:
SELECT DISTINCT ON (player_id)
player_id, count(*) AS seq_len, min(ts) AS time_began
FROM (
SELECT player_id, points, ts
, row_number() OVER (PARTITION BY player_id ORDER BY ts)
- row_number() OVER (PARTITION BY player_id, points ORDER BY ts) AS grp
FROM tbl
) sub
WHERE points = 100
GROUP BY player_id, grp -- omit "points" after WHERE points = 100
ORDER BY player_id, seq_len DESC, time_began DESC;
db<>fiddle here
Using the column name ts instead of time, which is a reserved word in standard SQL. It's allowed in Postgres, but with limitations and it's still a bad idea to use it as identifier.
The "trick" is to subtract row numbers so that consecutive rows fall in the same group (grp) per (player_id, points). Then filter the ones with 100 points, aggregate per group and return only the longest, most recent result per player.
Basic explanation for the technique:
Select longest continuous sequence
We can use GROUP BY and DISTINCT ON in the same SELECT, GROUP BY is applied before DISTINCT ON. Consider the sequence of events in a SELECT query:
Best way to get result count before LIMIT was applied
About DISTINCT ON:
Select first row in each GROUP BY group?
This is a gap and island problem, you can try to use SUM condition aggravated function with window function, getting gap number.
then use MAX and COUNT window function again.
Query 1:
WITH CTE AS (
SELECT *,
SUM(CASE WHEN points = 100 THEN 1 END) OVER(PARTITION BY player_id ORDER BY time) -
SUM(1) OVER(ORDER BY time) RN
FROM T
)
SELECT player_id,
MAX(longest_streak) longest_streak,
MAX(cnt) longest_streak
FROM (
SELECT player_id,
MAX(time) OVER(PARTITION BY rn,player_id) longest_streak,
COUNT(*) OVER(PARTITION BY rn,player_id) cnt
FROM CTE
WHERE points > 0
) t1
GROUP BY player_id
Results:
| player_id | longest_streak | longest_streak |
|-----------|-----------------------------|----------------|
| 756 | 2018-12-04T19:57:48.961111Z | 2 |
| 399 | 2018-06-10T12:11:18.041521Z | 1 |
| 395 | 2018-12-02T08:56:06.83033Z | 1 |
One way to do this is to look at how many rows between the previous and next non-100 results. To get the lengths of the streaks:
with s as (
select s.*,
row_number() over (partition by player_id order by time) as seqnum,
count(*) over (partition by player_id) as cnt
from scores s
)
select s.*,
coalesce(next_seqnum, cnt + 1) - coalesce(prev_seqnum, 0) - 1 as length
from (select s.*,
max(seqnum) filter (where score <> 100) over (partition by player_id order by time) as prev_seqnum,
max(seqnum) filter (where score <> 100) over (partition by player_id order by time) as next_seqnum
from s
) s
where score = 100;
You can then incorporate the other conditions:
with s as (
select s.*,
row_number() over (partition by player_id order by time) as seqnum,
count(*) over (partition by player_id) as cnt
from scores s
),
streaks as (
select s.*,
coalesce(next_seqnum - prev_seqnum) over (partition by player_id) as length,
max(next_seqnum - prev_seqnum) over (partition by player_id) as max_length,
max(next_seqnum) over (partition by player_id) as max_next_seqnum
from (select s.*,
coalesce(max(seqnum) filter (where score <> 100) over (partition by player_id order by time), 0) as prev_seqnum,
coalesce(max(seqnum) filter (where score <> 100) over (partition by player_id order by time), cnt + 1) as next_seqnum
from s
) s
where score = 100
)
select s.*
from streaks s
where length = max_length and
next_seqnum = max_next_seqnum;
Here is my answer
select
user_id,
non_streak,
streak,
ifnull(non_streak,streak) strk,
max(time) time
from (
Select
user_id,time,
points,
lag(points) over (partition by user_id order by time) prev_point,
case when points + lag(points) over (partition by user_id order by time) = 100 then 1 end as non_streak,
case when points + lag(points) over (partition by user_id order by time) > 100 then 1 end as streak
From players
) where ifnull(non_streak,streak) is not null
group by 1,2,3
order by 1,2
) group by user_id`

Insert the table data based on grouping of two columns

I have a oracle table with the following format,
For eg:
JLID Dcode SID TDT QTY
8295783 3119255 9842 3/5/2018 14
8269771 3119255 9842 3/6/2018 11
8302211 3119255 1126 3/1/2018 19
Here I have different SID for the same Dcode, now I need to get the SID with the maximum Qty. (i.e) for SID 9842 - (14+11)=25, for SID 1126 it is 19, then the results should be on SID 9842. So, our query should returns the following results
JLID Dcode START_DT END_DT SID
111 3119255 3/1/2018 3/31/2018 12:00 9842
Startdate and enddate should be calculated from TDT (i.e) start date is the first date of the month and the end date is the last date of the month
Can anyone please suggest me some ideas to do it.
It might be as simple as this:
SELECT Dcode, start_date, end_date, SID FROM (
SELECT Dcode, SID, TRUNC(start_date, 'MONTH') AS start_date
, LAST_DAY(end_date) AS end_date
, ROW_NUMBER() OVER ( PARTITION BY Dcode ORDER BY total_qty DESC ) AS rn
FROM (
SELECT Dcode, SID, MIN(TDT) AS start_date, MAX(TDT) AS end_date
, SUM(QTY) AS total_qty
FROM mytable
GROUP BY Dcode, SID
)
) WHERE rn = 1
In the inner most subquery I aggregation to get the range of dates and total quantity for particular values of Dcode and SID. Then I use an anaylitic (window) function to get the row for which total quantity is the greatest. (You would want to use RANK() in place of ROW_NUMBER() in the event you want to return more than one value of SID with the same quantity.)
Here's one option which doesn't contain JLID = 111 in the final result as I have no idea where you took it from.
SQL> with test (jlid, dcode, sid, tdt, qty) as
2 (select 8295783, 3119255, 9842, date '2018-03-05', 14 from dual union
3 select 8269771, 3119255, 9842, date '2018-08-22', 11 from dual union
4 select 8302211, 3119255, 1126, date '2018-03-01', 19 from dual union
5 --
6 select 1234567, 1112223, 1000, date '2018-06-16', 88 from dual
7 )
8 select dcode,
9 min (trunc (tdt, 'mm')) start_dt, --> MIN
10 max (last_day (tdt)) end_dt, --> MAX
11 sid
12 from (select dcode,
13 sid,
14 tdt,
15 sqty,
16 rank () over (partition by dcode order by sqty desc) rnk
17 from (select dcode,
18 sid,
19 tdt,
20 sum (qty) over (partition by dcode, sid) sqty
21 from test))
22 where rnk = 1
23 group by dcode, sid; --> GROUP BY
DCODE START_DT END_DT SID
---------- ---------------- ---------------- ----------
1112223 01.06.2018 00:00 30.06.2018 00:00 1000
3119255 01.03.2018 00:00 31.08.2018 00:00 9842
SQL>

Duplicating records to fill gap between dates in Google BigQuery

So I've found similar resources that address how to do this in SQL, like this:
Duplicating records to fill gap between dates
I understand that BigQuery may not be the best place to do this, so I'm trying to see if it's at all possible. When trying to run some of the methods in the link above above I'm hitting a wall as some of the functions aren't supported within BigQuery.
If a table exists with data structured like so:
MODIFY_DATE SKU STORE STOCK_ON_HAND
08/01/2016 00:00:00 1120010 21 100
08/05/2016 00:00:00 1120010 21 75
08/07/2016 00:00:00 1120010 21 40
How can I build a query within Google BigQuery that yields an output like the one below? A value at a given date is repeated until the next change for the dates in between:
MODIFY_DATE SKU STORE STOCK_ON_HAND
08/01/2016 00:00:00 1120010 21 100
08/02/2016 00:00:00 1120010 21 100
08/03/2016 00:00:00 1120010 21 100
08/04/2016 00:00:00 1120010 21 100
08/05/2016 00:00:00 1120010 21 75
08/06/2016 00:00:00 1120010 21 75
08/07/2016 00:00:00 1120010 21 40
I know I need to generate a table that has all the dates within a given range, but I'm having a hard time understanding if this can be done. Any ideas?
How can I build a query within Google BigQuery that yields an output like the one below? A value at a given date is repeated until the next change for the dates in between
See example below
SELECT
MODIFY_DATE,
MAX(SKU_TEMP) OVER(PARTITION BY grp) AS SKU,
MAX(STORE_TEMP) OVER(PARTITION BY grp) AS STORE,
MAX(STOCK_ON_HAND_TEMP) OVER(PARTITION BY grp) AS STOCK_ON_HAND,
FROM (
SELECT
DAY AS MODIFY_DATE, SKU AS SKU_TEMP, STORE AS STORE_TEMP, STOCK_ON_HAND AS STOCK_ON_HAND_TEMP,
COUNT(SKU) OVER(ORDER BY DAY ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS grp,
FROM (
SELECT DATE(DATE_ADD(TIMESTAMP("2016-08-01"), pos - 1, "DAY")) AS DAY
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP("2016-08-07"), TIMESTAMP("2016-08-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))
) AS DATES
LEFT JOIN (
SELECT DATE(MODIFY_DATE) AS MODIFY_DATE, SKU, STORE, STOCK_ON_HAND
FROM
(SELECT "2016-08-01" AS MODIFY_DATE, "1120010" AS SKU, 21 AS STORE, 75 AS STOCK_ON_HAND),
(SELECT "2016-08-05" AS MODIFY_DATE, "1120010" AS SKU, 22 AS STORE, 100 AS STOCK_ON_HAND),
(SELECT "2016-08-07" AS MODIFY_DATE, "1120011" AS SKU, 23 AS STORE, 40 AS STOCK_ON_HAND),
) AS TABLE_WITH_GAPS
ON TABLE_WITH_GAPS.MODIFY_DATE = DATES.DAY
)
ORDER BY MODIFY_DATE
I need to generate a table that has all the dates within a given range, but I'm having a hard time understanding if this can be done. Any ideas?
SELECT DATE(DATE_ADD(TIMESTAMP("2016-08-01"), pos - 1, "DAY")) AS DAY
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF(TIMESTAMP("2016-08-07"), TIMESTAMP("2016-08-01")), '.'),'') AS h
FROM (SELECT NULL)),h
)))