BigQuery array_concat_agg over a window function - google-bigquery

I wanted to use BigQuery's array aggregate function, array_concat_agg over a window function. It doesn't look like this is possible - and actually, I might have a second issue regarding the accessibility of my window function to an inner query.
Here is my current SQL:
WITH data AS (
SELECT 1 AS id, 1 AS iteration_recency, DATE("2022-07-09") AS start_date, DATE("2022-07-31") AS end_date
UNION ALL
SELECT 1 AS id, 2 AS iteration_recency, DATE("2022-08-01") AS start_date, DATE("2022-08-15") AS end_date
UNION ALL
SELECT 1 AS id, 3 AS iteration_recency, DATE("2022-07-01") AS start_date, DATE("2022-07-04") AS end_date
UNION ALL
SELECT 1 AS id, 4 AS iteration_recency, DATE("2022-07-25") AS start_date, DATE("2022-08-04") AS end_date
UNION ALL
SELECT 1 AS id, 5 AS iteration_recency, DATE("2022-07-01") AS start_date, DATE("2022-07-31") AS end_date
UNION ALL
SELECT 2 AS id, 1 AS iteration_recency, DATE("2022-08-01") AS start_date, DATE("2022-10-30") AS end_date
UNION ALL
SELECT 2 AS id, 2 AS iteration_recency, DATE("2022-07-05") AS start_date, DATE("2022-07-22") AS end_date
UNION ALL
SELECT 2 AS id, 3 AS iteration_recency, DATE("2022-08-06") AS start_date, DATE("2022-08-24") AS end_date
)
SELECT
id,
iteration_recency,
(
SELECT MIN(`dates` IN UNNEST(
ARRAY_CONCAT_AGG(GENERATE_DATE_ARRAY(`start_date`, `end_date`)) OVER `newer_iterations`
)
)
FROM UNNEST(GENERATE_DATE_ARRAY(`start_date`, `end_date`)) AS `dates`
) AS date_range_contained_in_more_recent_iterations
FROM data
WINDOW `newer_iterations` AS (
PARTITION BY id
ORDER BY iteration ASC
ROWS BETWEEN 1 PRECEDING AND UNBOUNDED PRECEDING
)
The purpose of this query is to determine whether a date range is fully represented by more recent iterations of the same id. Regarding the use case, you can imagine this would be used in some monitoring whereby when iteration 3 fails for some date range but that date range is covered by more recent iteration(s), it's not a problem. I can't do something clever with min/max because more recent iterations may have overlapped the failed date range but perhaps not completely covered it between them.
The slightly crazy MIN in UNEST() stuff draws inspiration from this answer which provides a neat way for working out if all items from arrayA are in arrayB.
Currently, I get the error, Unrecognized window alias newer_iterations at [24:76]
I was actually expecting to get the error (paraphrased), "the OVER clause is not supported for ARRAY_CONCAT_AGG" because according to the docs, it is not - but it looks like I'm misunderstanding the availability of the outer window function in the inner function.
Maybe there's a way of doing this with a join but I think the logical requirement of ARRAY_CONCAT_AGG operating over a frame clause of 1 ROW PRECEDING AND UNBOUNDED PRECEDING seems unavoidable to me.
The result I was expecting was:
id
iteration_recency
date_range_contained_in_more_recent_iterations
1
1
false
1
2
false
1
3
false
1
4
true
1
5
false
2
1
false
2
2
false
2
3
true
P.s. this table renders fine in the edit question window - no idea what is wrong with it so if someone can edit my text to fix, I'd be grateful.
Grateful for any pointers, thanks!

You might consider below.
SELECT * EXCEPT(days, concat_days), NOT EXISTS (
SELECT d FROM t.days d -- current iteration
EXCEPT DISTINCT -- MINUS
SELECT d FROM t.concat_days c, c.days d -- more recent iterations
) date_range_contained_in_more_recent_iterations
FROM (
SELECT *, ARRAY_AGG(STRUCT(days)) OVER w0 AS concat_days
FROM data, UNNEST([STRUCT(GENERATE_DATE_ARRAY(start_date, end_date) AS days)])
WINDOW w0 AS (PARTITION BY id ORDER BY iteration_recency ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
) AS t
;
Query results

I found an alternative which draws inspiration from the combined use of ARRAY_AGG and STRUCT from Jaytiger's answer:
WITH data AS (
SELECT *, ARRAY_AGG(STRUCT(dates)) OVER w0 AS concat_dates
FROM (
SELECT 1 AS id, 1 AS iteration_recency, GENERATE_DATE_ARRAY("2022-07-09", "2022-07-31") AS dates
UNION ALL
SELECT 1 AS id, 2 AS iteration_recency, GENERATE_DATE_ARRAY("2022-08-01", "2022-08-15") AS dates
UNION ALL
SELECT 1 AS id, 3 AS iteration_recency, GENERATE_DATE_ARRAY("2022-07-01", "2022-07-04") AS dates
UNION ALL
SELECT 1 AS id, 4 AS iteration_recency, GENERATE_DATE_ARRAY("2022-07-25", "2022-08-04") AS dates
UNION ALL
SELECT 1 AS id, 5 AS iteration_recency, GENERATE_DATE_ARRAY("2022-07-01", "2022-07-31") AS dates
UNION ALL
SELECT 2 AS id, 1 AS iteration_recency, GENERATE_DATE_ARRAY("2022-08-01", "2022-10-30") AS dates
UNION ALL
SELECT 2 AS id, 2 AS iteration_recency, GENERATE_DATE_ARRAY("2022-07-05", "2022-07-22") AS dates
UNION ALL
SELECT 2 AS id, 3 AS iteration_recency, GENERATE_DATE_ARRAY("2022-08-06", "2022-08-24") AS dates
)
WINDOW w0 AS (PARTITION BY id ORDER BY iteration_recency ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
)
SELECT
*, (
SELECT MIN(d IN UNNEST(concat_dates.dates))
FROM UNNEST(dates) AS d
)
FROM data

Related

How can I obtain the minimum date for a value that is equal to the maximum date?

I am trying to obtain the minimum start date for a query, in which the value is equal to its maximum date. So far, I'm able to obtain the value in it's maximum date, but I can't seem to obtain the minimum date where that value remains the same.
Here is what I got so far and the query result:
select a.id, a.end_date, a.value
from database1 as a
inner join (
select id, max(end_date) as end_date
from database1
group by id
) as b on a.id = b.id and a.end_date = b.end_date
where value is not null
order by id, end_date
This result obtains the most recent record, but I'm looking to obtain the most minimum end date record where the value remains the same as the most recent.
In the following sample table, this is the record I'd like to obtain the record from the row where id = 3, as it has the minimum end date in which the value remains the same:
id
end_date
value
1
02/12/22
5
2
02/13/22
5
3
02/14/22
4
4
02/15/22
4
Another option that just approaches the problem somewhat as described for the sample data as shown - Get the value of the maximum date and then the minimum id row that has that value:
select top(1) t.*
from (
select top(1) Max(end_date)d, [value]
from t
group by [value]
order by d desc
)d
join t on t.[value] = d.[value]
order by t.id;
DB<>Fiddle
I'm most likely overthinking this as a Gaps & Island problem, but you can do:
select min(end_date) as first_date
from (
select *, sum(inc) over (order by end_date desc) as grp
from (
select *,
case when value <> lag(value) over (order by end_date desc) then 1 else 0 end as inc
from t
) x
) y
where grp = 0
Result:
first_date
----------
2022-02-14
See running example at SQL Fiddle.
with data as (
select *,
row_number() over (partition by value) as rn,
last_value(value) over (order by end_date) as lv
from T
)
select * from data
where value = lv and rn = 1
This isn't looking strictly for streaks of consecutive days. Any date that happened to have the same value as on final date would be in contention.

SQL group rows into pairs

I'm trying to add some sort of unique identifier (uid) to partitions made of pairs of rows, i.e. generate some uid/tag for each two rows of (identifier1,identifier2) in a window partition with size = 2 rows.
So, for example, the first 2 rows for ID X would get uid A, the next two rows for the same ID would get uid B and, if there is only one single row left in the partition for ID X, it would get id C.
Here's what I'm trying to accomplish, the picture illustrates the table's structure, I manually added the expectedIdentifier to illustrate the goal:
This is my current SQL, ntile doesn't solve it because the partition size varies:
select
rowId
, ntile(2) over (partition by firstIdentifier, secondIdentifier order by timestamp asc) as ntile
, *
from log;
Already tried ntile( (count(*) over partition...) / 2), but that doesn't work.
Generating the UID can be done with md5() or similar, but I'm having trouble tagging the rows as illustrated above (so I can md5 the generated tag/uid)
While count(*) is not supported within a Snowflake window function, count(1) is supported and can be used to create the unique identifier. Below is an example of an integer unique ID matching pairs of rows and handling "odd" row groups:
select
ntile(2) over (partition by firstIdentifier, secondIdentifier order by timestamp asc) as ntile
,ceil(count(1) over( partition by firstIdentifier, secondIdentifier order by timestamp asc) / 2) as id
, *
from log;
select *, char(65 + (row_number() over(partition by
firstidentifier,secondidentifier order by timestamp)-1)/2)
expectedidentifier from log
order by firstidentifier, timestamp
Here is the Sql Server Version
with log (firstidentifier,secondidentifier, timestamp)
as (
select 15396, 14460, 1 union all
select 15396, 14460, 1 union all
select 19744, 14451, 1 union all
select 19744, 14451, 1 union all
select 19744, 14451, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1
)
select *, char(65 + (row_number() over(partition by
firstidentifier,secondidentifier order by timestamp)-1)/2)
expectedidentifier from log
order by firstidentifier,secondidentifier,timestamp

SQL query to count number of objects in each state on each day

Given a set of database records that record the date when an object enters a particular state, I would like to produce a query that shows how many objects are in each state on any particular date. The results will be used to produce trend reports showing how the number of objects in each state changes over time.
I have a table like the following that records the date when an object enters a particular state:
ObjID EntryDate State
----- ---------- -----
1 2014-11-01 A
1 2014-11-04 B
1 2014-11-06 C
2 2014-11-01 A
2 2014-11-03 B
2 2014-11-10 C
3 2014-11-03 B
3 2014-11-08 C
There are an arbitrary number of objects and states.
I need to produce a query that returns the number of objects in each state on each date. The result would look like the following:
Date State Count
---------- ----- -----
2014-11-01 A 2
2014-11-01 B 0
2014-11-01 C 0
2014-11-02 A 2
2014-11-02 B 0
2014-11-02 C 0
2014-11-03 A 1
2014-11-03 B 2
2014-11-03 C 0
2014-11-04 A 0
2014-11-04 B 3
2014-11-04 C 0
2014-11-05 A 0
2014-11-05 B 3
2014-11-05 C 0
2014-11-06 A 0
2014-11-06 B 2
2014-11-06 C 1
2014-11-07 A 0
2014-11-07 B 2
2014-11-07 C 1
2014-11-08 A 0
2014-11-08 B 1
2014-11-08 C 2
2014-11-09 A 0
2014-11-09 B 1
2014-11-09 C 2
2014-11-10 A 0
2014-11-10 B 0
2014-11-10 C 3
I'm working with an Oracle database.
I haven't been able to find an example that matches my case. The following questions look like they are asking for solutions to similar but different problems:
SQL Count Of Open Orders Each Day Between Two Dates
Mysql select count per category per day
Any help or hints that can be provided would be much appreciated.
SELECT EntryDate AS "Date", State, COUNT(DISTINCT ObjectId) AS "Count" GROUP BY EntryDate, State ORDER BY EntryDate, State;
I'm going to do a quick and dirty way to get numbers. You can choose your preferred method . . . using recursive CTEs, connect by, or a numbers table. So, the following generates the all combinations of the dates and states. It then uses a correlated subquery to count the number of objects in each state on each date:
with n as (
select rownum - 1 as n
from table t
),
dates as (
select mind + n.n
from (select min(date) as mind, max(date) as maxd from table) t
where mind + n.n <= maxd
)
select d.date, s.state,
(select count(*)
from (select t2.*, lead(date) over (partition by ObjId order by date) as nextdate
from table t2
) t2
where d.date >= t2.date and (d.date < t2.nextdate or t2.nextdate is null) and
d.state = t2.state
) as counts
from dates d cross join
(select distinct state from table t)
This query will list how many objects ENTERED a particular state on each day, assuming each object only changes state ONCE a day. If objects change state more than once a day, you would need to use count(distinct objid):
select entrydate, state, count(objid)
from my_table
group by entrydate, state
order by entrydate, state
However, you are asking how many objects ARE in a particular state on each day, thus you would need a very different query to show that. Since you only provide that particular table in your example, I'll work with that table only:
select alldatestates.entrydate, alldatestates.state, count(statesbyday.objid)
from
(
select alldates.entrydate, allstates.state
from (select distinct entrydate from mytable) alldates,
(select distinct state from mytable) allstates
) alldatestates
left join
(
select alldates.entrydate, allobjs.objid, (select min(state) as state from mytable t1
where t1.objid = allobjs.objid and
t1.entrydate = (select max(entrydate) from mytable t2
where t2.objid = t1.objid and
t2.entrydate <= alldates.entrydate)) as state
from (select distinct entrydate from mytable) alldates,
(select distinct objid from mytable) allobjs
) statesbyday
on alldatestates.entrydate = statesbyday.entrydate and alldatestates.state = statesbyday.state
group by alldatestates.entrydate, alldatestates.state
order by alldatestates.entrydate, alldatestates.state
Of course, this query will be much simpler if you have a table for all the possible states and another one for all the possible object ids.
Also, probably you could find a query simpler than that, but this one works. The downside is, it could very quickly become an optimizer's nightmare! :)
As each state is not recorded every date , you need to do CROSS JOIN to get the unique states and then do GROUP BY.
SELECT EntryDate,
C.State,
SUM(case when C.state = Table1.state then 1 else 0 end) as Count
FROM Table1
CROSS JOIN ( SELECT DISTINCT State FROM Table1) C
GROUP BY EntryDate, C.State
ORDER BY EntryDate
Try this query :
select EntryDate As Date, State, COUNT(ObjID) AS Count from table_name
GROUP BY EntryDate , State
ORDER BY State
You can try this with analytic function as well:
Select
Date,
State,
count(distinct obj) OVER (PARTITION BY EntryDate, State) count
from table
order by 1;
Select EntryDate as Date, State, Count(Distinct ObjID) as Count From Table_1
Group by EntryDate, State
Working out of SQL SERVER because I'm more familiar, but here's what I've got so far:
fiddle example (SQL SERVER but the only difference should be the date functions I think...): http://sqlfiddle.com/#!3/8b9748/2
WITH zeroThruNine AS (SELECT 0 AS n UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9),
nums AS (SELECT 10*b.n+a.n AS n FROM zeroThruNine a, zeroThruNine b),
Dates AS (
SELECT DATEADD(d,n.n,(SELECT MIN(t.EntryDate) FROM #tbl t)) AS Date
FROM nums n
WHERE DATEADD(d,n.n,(SELECT MIN(t.EntryDate) FROM #tbl t))<=(SELECT MAX(t.EntryDate) FROM #tbl t)
), Data AS (
SELECT d.Date, t.ObjID, t.State, ROW_NUMBER() OVER (PARTITION BY t.ObjID, d.Date ORDER BY t.EntryDate DESC) as r
FROM Dates d, #tbl t
WHERE d.Date>=t.EntryDate
)
SELECT t.Date, t.State, COUNT(*)
FROM Data t
WHERE t.r=1
GROUP BY t.Date, t.State
ORDER BY t.Date, t.State
First, start off making a numbers table (see http://web.archive.org/web/20150411042510/http://sqlserver2000.databases.aspfaq.com/why-should-i-consider-using-an-auxiliary-numbers-table.html) for examples. There are different ways to create number tables in different databases, so the first two WITH expressions I've created are just to create a view of the numbers 0 through 99. I'm sure there are other ways, and you may need more than just 100 numbers (representing 100 dates between the first and last dates you provided)
So, once you get to the Dates CTE, the main part is the Data CTE
It finds each date from the Dates cte, and pairs it with the values of the #tbl table (your table) with any States that were recorded after said date. It also marks the order of which states/per objid in decreasing order. That way, in the final query, we can just use WHERE t.r=1 to get the max state for each objid per date
One issue, this gets data for all dates, even those where nothing was recorded, but for zero-counts, it doesn't return anything. If you wanted to, you could left join this result with a view of distinct states and take 0 when no join was made

Oracle SQL query : finding the last time a data was changed

I want to retrieve elapsed days since the last time the data of the specific column was changed, for example :
TABLE_X contains
ID PDATE DATA1 DATA2
A 10-Jan-2013 5 10
A 9-Jan-2013 5 10
A 8-Jan-2013 5 11
A 7-Jan-2013 5 11
A 6-Jan-2013 14 12
A 5-Jan-2013 14 12
B 10-Jan-2013 3 15
B 9-Jan-2013 3 15
B 8-Jan-2013 9 15
B 7-Jan-2013 9 15
B 6-Jan-2013 14 15
B 5-Jan-2013 14 8
I simplify the table for example purpose.
The result should be :
ID DATA1_LASTUPDATE DATA2_LASTUPDATE
A 4 2
B 2 5
which says,
- data1 of A last update is 4 days ago,
- data2 of A last update is 2 days ago,
- data1 of B last update is 2 days ago,
- data2 of B last update is 5 days ago.
Using query below is OK but it takes too long to complete if I apply it to the real table which have lots of records and add 2 more data columns to find their latest update days.
I use LEAD function for this purposes.
Any other alternatives to speed up the query?
with qdata1 as
(
select ID, pdate from
(
select a.*, row_number() over (partition by ID order by pdate desc) rnum from
(
select a.*,
lead(data1,1,0) over (partition by ID order by pdate desc) - data1 as data1_diff
from table_x a
) a
where data1_diff <> 0
)
where rnum=1
),
qdata2 as
(
select ID, pdate from
(
select a.*, row_number() over (partition by ID order by pdate desc) rnum from
(
select a.*,
lead(data2,1,0) over (partition by ID order by pdate desc) - data2 as data2_diff
from table_x a
) a
where data2_diff <> 0
)
where rnum=1
)
select a.ID,
trunc(sysdate) - b.pdate data1_lastupdate,
trunc(sysdate) - c.pdate data2_lastupdate,
from table_master a, qdata1 b, qdata2 c
where a.ID=b.ID(+) and a.ID=b.ID(+)
and a.ID=c.ID(+) and a.ID=c.ID(+)
Thanks a lot.
You can avoid the multiple hits on the table and the joins by doing both lag (or lead) calculations together:
with t as (
select id, pdate, data1, data2,
lag(data1) over (partition by id order by pdate) as lag_data1,
lag(data2) over (partition by id order by pdate) as lag_data2
from table_x
),
u as (
select t.*,
case when lag_data1 is null or lag_data1 != data1 then pdate end as pdate1,
case when lag_data2 is null or lag_data2 != data2 then pdate end as pdate2
from t
),
v as (
select u.*,
rank() over (partition by id order by pdate1 desc nulls last) as rn1,
rank() over (partition by id order by pdate2 desc nulls last) as rn2
from u
)
select v.id,
max(trunc(sysdate) - (case when rn1 = 1 then pdate1 end))
as data1_last_update,
max(trunc(sysdate) - (case when rn2 = 1 then pdate2 end))
as data2_last_update
from v
group by v.id
order by v.id;
I'm assuming that you meant your data to be for Jun-2014, not Jan-2013; and that you're comparing the most recent change dates with the current date. With the data adjusted to use 10-Jun-2014 etc., this gives:
ID DATA1_LAST_UPDATE DATA2_LAST_UPDATE
-- ----------------- -----------------
A 4 2
B 2 5
The first CTE (t) gets the actual table data and adds two extra columns, one for each of the data columns, using lag (whic his the the same as lead ordered by descending dates).
The second CTE (u) adds two date columns that are only set when the data columns are changed (or when they are first set, just in case they have never changed). So if a row has data1 the same as the previous row, its pdate1 will be blank. You could combine the first two by repeating the lag calculation but I've left it split out to make it a bit clearer.
The third CTE (v) assigns a ranking to those pdate columns such that the most recent is ranked first.
And the final query works out the difference from the current date to the highest-ranked (i.e. most recent) change for each of the data columns.
SQL Fiddle, including all the CTEs run individually so you can see what they are doing.
Your query wasn't returning the right results for me, maybe I missed something, but I got the correct results also with the below query (you can check this SQLFiddle demo):
with ranked as (
select ID,
data1,
data2,
rank() over(partition by id order by pdate desc) r
from table_x
)
select id,
sum(DATA1_LASTUPDATE) DATA1_LASTUPDATE,
sum(DATA2_LASTUPDATE) DATA2_LASTUPDATE
from (
-- here I get when data1 was updated
select id,
count(1) DATA1_LASTUPDATE,
0 DATA2_LASTUPDATE
from ranked
start with r = 1
CONNECT BY (PRIOR data1 = data1)
and PRIOR r = r - 1
group by id
union
-- here I get when data2 was updated
select id,
0 DATA1_LASTUPDATE,
count(1) DATA0_LASTUPDATE
from ranked
start with r = 1
CONNECT BY (PRIOR data2 = data2)
and PRIOR r = r - 1
group by id
)
group by id

"Group" some rows together before sorting (Oracle)

I'm using Oracle Database 11g.
I have a query that selects, among other things, an ID and a date from a table. Basically, what I want to do is keep the rows that have the same ID together, and then sort those "groups" of rows by the most recent date in the "group".
So if my original result was this:
ID Date
3 11/26/11
1 1/5/12
2 6/3/13
2 10/15/13
1 7/5/13
The output I'm hoping for is:
ID Date
3 11/26/11 <-- (Using this date for "group" ID = 3)
1 1/5/12
1 7/5/13 <-- (Using this date for "group" ID = 1)
2 6/3/13
2 10/15/13 <-- (Using this date for "group" ID = 2)
Is there any way to do this?
One way to get this is by using analytic functions; I don't have an example of that handy.
This is another way to get the specified result, without using an analytic function (this is ordering first by the most_recent_date for each ID, then by ID, then by Date):
SELECT t.ID
, t.Date
FROM mytable t
JOIN ( SELECT s.ID
, MAX(s.Date) AS most_recent_date
FROM mytable s
WHERE s.Date IS NOT NULL
GROUP BY s.ID
) r
ON r.ID = t.ID
ORDER
BY r.most_recent_date
, t.ID
, t.Date
The "trick" here is to return "most_recent_date" for each ID, and then join that to each row. The result can be ordered by that first, then by whatever else.
(I also think there's a way to get this same ordering using Analytic functions, but I don't have an example of that handy.)
You can use the MAX ... KEEP function with your aggregate to create your sort key:
with
sample_data as
(select 3 id, to_date('11/26/11','MM/DD/RR') date_col from dual union all
select 1, to_date('1/5/12','MM/DD/RR') date_col from dual union all
select 2, to_date('6/3/13','MM/DD/RR') date_col from dual union all
select 2, to_date('10/15/13','MM/DD/RR') date_col from dual union all
select 1, to_date('7/5/13','MM/DD/RR') date_col from dual)
select
id,
date_col,
-- For illustration purposes, does not need to be selected:
max(date_col) keep (dense_rank last order by date_col) over (partition by id) sort_key
from sample_data
order by max(date_col) keep (dense_rank last order by date_col) over (partition by id);
Here is the query using analytic functions:
select
id
, date_
, max(date_) over (partition by id) as max_date
from table_name
order by max_date, id
;