How to use SQL PARTITION BY GROUPS? - sql

I'm working with PostgreSQL 12, but the question is standard SQL.
I have a table like this:
| timestamp | raw_value |
| ------------------------ | --------- |
| 2015-06-27T03:52:50.000Z | 0 |
| 2015-06-27T03:53:00.000Z | 0 |
| 2015-06-27T03:53:10.000Z | 1 |
| 2015-06-27T03:53:20.000Z | 1 |
| 2015-06-27T04:22:20.000Z | 1 |
| 2015-06-27T04:22:30.000Z | 0 |
| 2015-06-27T05:33:40.000Z | 1 |
| 2015-06-27T05:33:50.000Z | 1 |
I need to get the first and last timestamp of each group with raw_value = 1, i.e. needed result :
| start_time | end_time |
| ------------------------ | ------------------------ |
| 2015-06-27T03:53:10.000Z | 2015-06-27T04:22:20.000Z |
| 2015-06-27T05:33:40.000Z | 2015-06-27T05:33:50.000Z |
My best effort so far looks like this:
SELECT timestamp, raw_value, row_number() over w as rn, first_value(obt) OVER w AS start_time, last_value(obt) OVER w AS end_time
FROM mytable
WINDOW w AS (PARTITION BY raw_value ORDER BY timestamp GROUPS CURRENT ROW )
ORDER BY timestamp;
Google doesn't have much info about it, but according to the docs the "GROUPS" clause is exactly what I need, but the end result is wrong, because window functions simply copy value from the timestamp column:
| timestamp | raw_value | rn | start_time | end_time |
| ------------------------ | --------- | --- | ------------------------ | ------------------------ |
| 2015-06-27T03:52:50.000Z | 0 | 1 | 2015-06-27T03:52:50.000Z | 2015-06-27T03:52:50.000Z |
| 2015-06-27T03:53:00.000Z | 0 | 2 | 2015-06-27T03:53:00.000Z | 2015-06-27T03:53:00.000Z |
| 2015-06-27T03:53:10.000Z | 1 | 1 | 2015-06-27T03:53:10.000Z | 2015-06-27T03:53:10.000Z |
| 2015-06-27T03:53:20.000Z | 1 | 2 | 2015-06-27T03:53:20.000Z | 2015-06-27T03:53:20.000Z |
| 2015-06-27T04:22:20.000Z | 1 | 3 | 2015-06-27T04:22:20.000Z | 2015-06-27T04:22:20.000Z |
| 2015-06-27T04:22:30.000Z | 0 | 3 | 2015-06-27T04:22:30.000Z | 2015-06-27T04:22:30.000Z |
| 2015-06-27T05:33:40.000Z | 1 | 4 | 2015-06-27T05:33:40.000Z | 2015-06-27T05:33:40.000Z |
| 2015-06-27T05:33:50.000Z | 1 | 5 | 2015-06-27T05:33:50.000Z | 2015-06-27T05:33:50.000Z |
At line#6 I'd expect the row number to reset to 1, but it doesn't! I tried using BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING as well without luck.
I have created a DB Fiddle link for your convenience as well.
If there is any other way to achieve the same result in SQL (ok to be PG-specific) without window functions, I'd like to know.

Identify groups using row_number() - sum() trick, then choose min and max time for each identified group.
with grp as (
select obt, raw_value
, row_number() over w - sum(raw_value) over w as g
from tm_series
window w as (order by obt)
)
select min(obt), max(obt)
from grp
where raw_value = 1
group by g;
DB fiddle here.
(The GROUPS clause depends on window ordering and seems to have nothing common with your problem.)

Your updated fiddle here.
For an gaps and islands approach, first mark your transitions from raw_value = 0 to raw_value = 1
with mark_changes as (
select obt, raw_value,
case
when raw_value = 0 then 0
when raw_value = lag(raw_value) over (order by obt) then 0
else 1
end as transition
from tm_series
),
Keep only the raw_value = 1 rows, and sum() the preceding transition markers to place each row into a group.
id_groups as (
select obt, raw_value,
sum(transition) over (order by obt) as grp_num
from mark_changes
where raw_value = 1
)
Use group by on these grp_num values to get your desired result.
select min(obt) as start_time,
max(obt) as end_time
from id_groups
group by grp_num
order by min(obt);

Related

How to add records for each user based on another existing row in BigQuery?

Posting here in case someone with more knowledge than may be able to help me with some direction.
I have a table like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201125 | 1 | 0 |
-----------------------------------
| 4 | 20201114 | 2 | 32 |
-----------------------------------
| 5 | 20201116 | 2 | 0 |
-----------------------------------
| 6 | 20201120 | 2 | 23 |
-----------------------------------
However, from this, I need to have a record for each user for each day where if a day is missing for a user, then the last score recorded should be maintained then I would have something like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201122 | 1 | 14 |
-----------------------------------
| 4 | 20201123 | 1 | 14 |
-----------------------------------
| 5 | 20201124 | 1 | 14 |
-----------------------------------
| 6 | 20201125 | 1 | 0 |
-----------------------------------
| 7 | 20201114 | 2 | 32 |
-----------------------------------
| 8 | 20201115 | 2 | 32 |
-----------------------------------
| 9 | 20201116 | 2 | 0 |
-----------------------------------
| 10 | 20201117 | 2 | 0 |
-----------------------------------
| 11 | 20201118 | 2 | 0 |
-----------------------------------
| 12 | 20201119 | 2 | 0 |
-----------------------------------
| 13 | 20201120 | 2 | 23 |
-----------------------------------
I'm trying to to this in BigQuery using StandardSQL. I have an idea of how to keep the same score across following empty dates, but I really don't know how to add new rows for missing dates for each user. Also, just to keep in mind, this example only has 2 users, but in my data I have more than 1500.
My end goal would be to show something like the average of the score per day. For background, because of our logic, if the score wasn't recorded in a specific day, this means that the user is still in the last score recorded which is why I need a score for every user every day.
I'd really appreciate any help I could get! I've been trying different options without success
Below is for BigQuery Standard SQL
#standardSQL
select date, user_id,
last_value(score ignore nulls) over(partition by user_id order by date) as score
from (
select user_id, format_date('%Y%m%d', day) date,
from (
select user_id, min(parse_date('%Y%m%d', date)) min_date, max(parse_date('%Y%m%d', date)) max_date
from `project.dataset.table`
group by user_id
) a, unnest(generate_date_array(min_date, max_date)) day
)
left join `project.dataset.table` b
using(date, user_id)
-- order by user_id, date
if applied to sample data from your question - output is
One option uses generate_date_array() to create the series of dates of each user, then brings the table with a left join.
select d.date, d.user_id,
last_value(t.score ignore nulls) over(partition by d.user_id order by d.date) as score
from (
select t.user_id, d.date
from mytable t
cross join unnest(generate_date_array(min(date), max(date), interval 1 day)) d(date)
group by t.user_id
) d
left join mytable t on t.user_id = d.user_id and t.date = d.date
I think the most efficient method is to use generate_date_array() but in a very particular way:
with t as (
select t.*,
date_add(lead(date) over (partition by user_id order by date), interval -1 day) as next_date
from t
)
select row_number() over (order by t.user_id, dte) as id,
t.user_id, dte, t.score
from t cross join join
unnest(generate_date_array(date,
coalesce(next_date, date)
interval 1 day
)
) dte;

Combine PARTITION BY and GROUP BY

I have a (mssql) table like this:
+----+----------+---------+--------+--------+
| id | username | date | scoreA | scoreB |
+----+----------+---------+--------+--------+
| 1 | jim | 01/2020 | 100 | 0 |
| 2 | max | 01/2020 | 0 | 200 |
| 3 | jim | 01/2020 | 0 | 150 |
| 4 | max | 02/2020 | 150 | 0 |
| 5 | jim | 02/2020 | 0 | 300 |
| 6 | lee | 02/2020 | 100 | 0 |
| 7 | max | 02/2020 | 0 | 200 |
+----+----------+---------+--------+--------+
What I need is to get the best "combined" score per date. (With "combined" score I mean the best scores per user and per date summarized)
The result should look like this:
+----------+---------+--------------------------------------------+
| username | date | combined_score (max(scoreA) + max(scoreB)) |
+----------+---------+--------------------------------------------+
| jim | 01/2020 | 250 |
| max | 02/2020 | 350 |
+----------+---------+--------------------------------------------+
I came this far:
I can group the scores by user like this:
SELECT
username, (max(scoreA) + max(scoreB)) AS combined_score,
FROM score_table
GROUP BY username
ORDER BY combined_score DESC
And I can get the best score per date with PARTITION BY like this:
SELECT *
FROM
(SELECT t.*, row_number() OVER (PARTITION BY date ORDER BY scoreA DESC) rn
FROM score_table t) as tmp
WHERE tmp.rn = 1
ORDER BY date
Is there a proper way to combine these statements and get the result I need? Thank you!
Btw. Don't care about possible ties!
You can combine window functions and aggregation functions like this:
SELECT s.*
FROM (SELECT username, date, (max(scoreA) + max(scoreB)) AS combined_score,
ROW_NUMBER() OVER (PARTITION BY date ORDER BY max(scoreA) + max(scoreB) DESC) as seqnum
FROM score_table
GROUP BY username, date
) s
ORDER BY combined_score DESC;
Note that date needs to be part of the aggregation.

SQL get max value with date smaller date

I have a table like this:
| colA | date | num |
| x | 1.7. | 2 |
| x | 3.7. | 1 |
| x | 4.7. | 3 |
| z | 1.8. | 0 | (edit)
| z | 2.8. | 1 |
| z | 5.8. | 2 |
And I want a result like this:
| colA | date | maxNum |
| x | 1.7. | null |
| x | 3.7. | 2 |
| x | 4.7. | 2 |
| z | 1.8. | null | (edit)
| z | 2.8. | 0 |
| z | 5.8. | 1 |
So I want to have the max(num) for every row where the date is smaller the date grouped by colA.
Is this somehow possible with a simple query? It would be part of a bigger query needed for some calculations on big databases.
Edit: maxNum should be null if there is no value before a date in the group
Thanks in advance.
Use MAX..KEEP syntax.
select cola,
adate,
max(num) keep ( dense_rank first order by adate ) over (partition by cola ) maxnum,
case when adate = min(adate) over ( partition by cola )
then null
else max(num) keep ( dense_rank first order by adate ) over (partition by cola ) end maxnum_op
from input;
+------+-------+--------+-----------+
| COLA | ADATE | MAXNUM | MAXNUM_OP |
+------+-------+--------+-----------+
| x | 1.7 | 2 | |
| x | 3.7 | 2 | 2 |
| x | 4.7 | 2 | 2 |
| z | 2.8 | 1 | |
| z | 5.8 | 1 | 1 |
+------+-------+--------+-----------+
The MAXNUM_OP column shows the results you wanted, but you never explained why some of the values were supposed to be null. The MAXNUM column shows the results that I think you described in the text of your post.
You can use first_value and row_number analytical function as following:
Select cola,
date,
case when row_number() over (partition by cola order by date) > 1 then
first_value(num) over (partition by cola order by date)
end as maxnum
From your_table;
Cheers!!
One way is to use a subquery.
SELECT t1.cola,
t1.date,
(SELECT max(t2.num)
FROM elbat t2
WHERE t2.cola = t1.cola
AND t2.date < t1.date) maxnum
FROM elbat t1;

SQL Count In Range

How could I count data in range which could be configured
Something like this,
CAR_AVBL
+--------+-----------+
| CAR_ID | DATE_AVBL |
+--------------------|
| JJ01 | 1 |
| JJ02 | 1 |
| JJ03 | 3 |
| JJ04 | 10 |
| JJ05 | 13 |
| JJ06 | 4 |
| JJ07 | 10 |
| JJ08 | 1 |
| JJ09 | 23 |
| JJ10 | 11 |
| JJ11 | 20 |
| JJ12 | 3 |
| JJ13 | 19 |
| JJ14 | 22 |
| JJ15 | 7 |
+--------------------+
ZONE_CFG
+--------+------------+
| DATE | ZONE_DESCR |
+--------+------------+
| 15 | GREEN_ZONE |
| 25 | YELLOW_ZONE|
| 30 | RED_ZONE |
+--------+------------+
Table ZONE_CFG is configurable, so I could not use static value for this
The DATE column mean maximum date for each ZONE
And the result what I expected :
+------------+----------+
| ZONE_DESCR | AVBL_CAR |
+------------+----------+
| GREEN_ZONE | 11 |
| YELLOW_ZONE| 4 |
| RED_ZONE | 0 |
+------------+----------+
Please could someone help me with this
You can use LAG and group by as following:
SELECT
ZC.ZONE_DESCR,
COUNT(1) AS AVBL_CAR
FROM
CAR_AVBL CA
JOIN ( SELECT
ZONE_DECR,
COALESCE(LAG(DATE) OVER(ORDER BY DATE) + 1, 0) AS START_DATE,
DATE AS END_DATE
FROM ZONE_CFG ) ZC
ON ( CA.DATE_AVBL BETWEEN ZC.START_DATE AND ZC.END_DATE )
GROUP BY
ZC.ZONE_DESCR;
Note: Don't use oracle preserved keywords (DATE, in your case) as the name of the columns. Try to change it to something like DATE_ or DATE_START or etc..
Cheers!!
If you want the zero 0, I might suggest a correlated subquery instead:
select z.*,
(select count(*)
from car_avbl c
where c.date_avbl >= start_date and
c.date_avbl <= date
) as avbl_car
from (select z.*,
lag(date, 1, 0) as start_date
from zone_cfg z
) z;
In Oracle 12C, can phrase this using a lateral join:
select z.*,
(c.cnt - lag(c.cnt, 1, 0) over (order by z.date)) as cnt
from zone_cfg z left join lateral
(select count(*) as cnt
from avbl_car c
where c.date_avbl <= z.date
) c
on 1=1

No rowid or key need most recent row

I am trying my hardest to get a list of the most recent rows by date in a DB2 file. The file has no unique id, so I am trying to get the entries by matching a set of columns. I need DESCGA most importantly as that changes often. When it does they keep another row for historical reasons.
SELECT B.COGA, B.COMSUBGA, B.ACCTGA, B.PRFXGA, B.DESCGA
FROM mylib.myfile B
WHERE
(
SELECT COUNT(*)
FROM
(
SELECT A.COGA,A.COMSUBGA,A.ACCTGA,A.PRFXGA,MAX(A.DATEGA) AS EDATE
FROM mylib.myfile A
GROUP BY A.COGA, A.COMSUBGA, A.ACCTGA, A.PRFXGA
) T
WHERE
(B.ACCTGA = T.ACCTGA AND
B.COGA = T.COGA AND
B.COMSUBGA = T.COMSUBGA AND
B.PRFXGA = T.PRFXGA AND
B.DATEGA = T.EDATE)
) > 1
This is what I am trying and so far I get 0 results.
If I remove
B.ACCTGA = T.ACCTGA AND
It will return results (of course wrong).
I am using ODBC in VS 2013 to structure this query.
I have a table with the following
| a | b | descri | date |
-----------------------------
| 1 | 0 | string | 20140102 |
| 2 | 1 | string | 20140103 |
| 1 | 1 | string | 20140101 |
| 1 | 1 | string | 20150101 |
| 1 | 0 | string | 20150102 |
| 2 | 1 | string | 20150103 |
| 1 | 1 | string | 20150103 |
and i need
| 1 | 0 | string | 20150102 |
| 2 | 1 | string | 20150103 |
| 1 | 1 | string | 20150103 |
You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by a, b order by date desc) as seqnum
from mylib.myfile t
) t
where seqnum = 1;