How to use analytic functions to find the next-most-recent timestamp in the same table - sql

I am currently using a self-join to calculate the next-most-recent timestamp for any given row:
SELECT t.COLUMN1,
t.SOME_OTHER_COLUMN,
t.TIMESTAMP_COLUMN,
MAX(pt.TIMESTAMP_COLUMN) AS PREV_TIMESTAMP_COLUMN
FROM Table1 t
LEFT JOIN Table1 pt ON pt.COLUMN1 = t.COLUMN1
AND pt.TIMESTAMP_COLUMN < t.TIMESTAMP_COLUMN
AND pt.SOME_OTHER_COLUMN = SOME_LITERAL_VALUE
GROUP BY t.COLUMN1,
t.SOME_OTHER_COLUMN,
t.TIMESTAMP_COLUMN
The problem is, I need to do this multiple times, for multiple comparisons, which will require multiple nested self-joins, which will be very ugly code, and probably very slow to execute.
How do you accomplish this same thing, but using analytic functions instead?
I started writing some code, but it looks wrong:
SELECT DISTINCT t.COLUMN1,
t.SOME_OTHER_COLUMN,
t.TIMESTAMP_COLUMN,
MAX(CASE WHEN t.TIMESTAMP_COLUMN < t.TIMESTAMP_COLUMN
AND t.SOME_OTHER_COLUMN = SOME_LITERAL_VALUE
THEN t.TIMESTAMP END) OVER
(PARTITION BY t.COLUMN1) AS PREV_TIMESTAMP_COLUMN1,
MAX(CASE WHEN t.TIMESTAMP_COLUMN < t.TIMESTAMP_COLUMN
AND t.SOME_OTHER_COLUMN = SOME_OTHER_LITERAL_VALUE
THEN t.TIMESTAMP END) OVER
(PARTITION BY t.COLUMN1) AS PREV_TIMESTAMP_COLUMN2
FROM Table1 t
As soon as I saw WHEN t.TIMESTAMP_COLUMN < t.TIMESTAMP_COLUMN I thought "This can't be right ..."
I know there are many other ways of using analytic functions, such as ROWS UNBOUNDED PRECEDING, but I'm new to analytic functions, and I don't know how to implement those.
What's the best way to use analytic functions to accomplish this?

I think that you could do a conditional window max with a frame specification, as follows:
SELECT DISTINCT
COLUMN1,
SOME_OTHER_COLUMN,
TIMESTAMP_COLUMN,
MAX(CASE WHEN SOME_OTHER_COLUMN = 'SOME_LITERAL_VALUE' THEN TIMESTAMP_COLUMN END)
OVER(
PARTITION BY COLUMN1
ORDER BY TIMESTAMP_COLUMN
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
) PREV_TIMESTAMP_COLUMN
FROM Table1 t
This will get you the greatest timestamp within previous records having the same COLUMN1, and whose SOME_OTHER_COLUMN is equal to the desired litteral value.

Related

Correlated subquery - Group by in inner query and rewrite to window function

I am looking at this query:
select ID, date1
from table1 as t1
where date2 = (select max(date2)
from table1 as t2
where t1.ID = t2.ID
group by t2.ID)
First of all I don't think the Group by is necessary. Am I right? Second is it generally more efficint to rewrite this as a window function?
Does this look right?
select ID, date1
from (select ID, date1, row_number() over(partition by ID) as row_num,
max(date2) over(partition by ID) as max_date
from table1)
where row_num = 1;
First of all I don't think the Group by is necessary. Am I right?
You are correct. That's a scalar subquery anyway: group by doesn't change the result since we are filtering on a single ID. Not using group by makes the intent clearer in my opinion.
The window function solution does not need max() - the filtering is sufficient, but the window function needs an order by clause (and the derived table needs an alias). So:
select ID, date1
from (
select ID, date1, row_number() over(partition by ID order by date1 desc) as row_num
from table1
) t1
where row_num = 1;
That's not exactly the same thing as the first query, because it does not allow ties, while the first query does. Use rank() if you want the exact same logic.
Which query performs better highly depends on your data and structure. The correlated subquery would take advantage of an index on (id, date2). Both solutions are canonical approaches to the problem, and you would probably need to test both solutions against your data to see which one is better in your context.

SQL: How to select a max record each day?

I found a lot of similar questions but no one fits perfectly for my case and I am struggling for hours for a solution. My table is composed by the fields DAY, HOUR, EVENT1, EVENT2, EVENT3. Therefore I have 24 rows each day. EVENT1, EVENT2, EVENT3 have some values and I'd like to select each day only the row (I mean the record) for which EVENT3 has the maximum value in the day (among the 24 hours). The final outcome will be one row per day
One method uses a correlated subquery:
select t.*
from t
where t.event3 = (select max(t2.event3)
from t t2
where t2.date = t.date
);
In most databases, this has very good performance with an index on (date, event3).
A more canonical solution uses row_number():
select t.*
from (select t.*,
row_number() over (partition by date order by event3 desc) as seqnum
from t
) t
where seqnum = 1;
Another option aside from using correlated subqueries is to write this is a left self-join, something like this:
SELECT t.*
FROM t
LEFT JOIN t AS t2 ON t.day = t2.day AND t2.event3 > t.event3
WHERE t2.id IS NULL
If you want to select an arbitrary matching row each day in the event of multiple rows with the same maximum event3, tack GROUP BY t.day on the end of that.
I'm not sure how performance of this is going to compare to Gordon Linoff's solutions, but they might get assembled into quite similar query plans by the RDBMS anyway.

Get minimum without using row number/window function in Bigquery

I have a table like as shown below
What I would like to do is get the minimum of each subject. Though I am able to do this with row_number function, I would like to do this with groupby and min() approach. But it doesn't work.
row_number approach - works fine
SELECT * FROM (select subject_id,value,id,min_time,max_time,time_1,
row_number() OVER (PARTITION BY subject_id ORDER BY value) AS rank
from table A) WHERE RANK = 1
min() approach - doesn't work
select subject_id,id,min_time,max_time,time_1,min(value) from table A
GROUP BY SUBJECT_ID,id
As you can see just the two columns (subject_id and id) is enough to group the items together. They will help differentiate the group. But why am I not able to use the other columns in select clause. If I use the other columns, I may not get the expected output because time_1 has different values.
I expect my output to be like as shown below
In BigQuery you can use aggregation for this:
SELECT ARRAY_AGG(a ORDER BY value LIMIT 1)[SAFE_OFFSET(1)].*
FROM table A
GROUP BY SUBJECT_ID;
This uses ARRAY_AGG() to aggregate each record (the a in the argument list). ARRAY_AGG() allows you to order the result (by value) and to limit the size of the array. The latter is important for performance.
After you concatenate the arrays, you want the first element. The .* transforms the record referred to by a to the component columns.
I'm not sure why you don't want to use ROW_NUMBER(). If the problem is the lingering rank column, you an easily remove it:
SELECT a.* EXCEPT (rank)
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY value) AS rank
FROM A
) a
WHERE RANK = 1;
Are you looking for something like below-
SELECT
A.subject_id,
A.id,
A.min_time,
A.max_time,
A.time_1,
A.value
FROM table A
INNER JOIN(
SELECT subject_id, MIN(value) Value
FROM table
GROUP BY subject_id
) B ON A.subject_id = B.subject_id
AND A.Value = B.Value
If you do not required to select Time_1 column's value, this following query will work (As I can see values in column min_time and max_time is same for the same group)-
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
--A.time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time
Finally, the best approach is if you can apply something like CAST(Time_1 AS DATE) on your time column. This will consider only the date part regardless of the time part. The query will be
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE) Time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE)
-- Make sure the syntax of CAST AS DATE
-- in BigQuery is as I written here or bit different.
Below is for BigQuery Standard SQL and is most efficient way for such cases like in your question
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY value LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY subject_id
Using ROW_NUMBER is not efficient and in many cases lead to Resources exceeded error.
Note: self join is also very ineffective way of achieving your objective
A bit late to the party, but here is a cte-based approach which made sense to me:
with mins as (
select subject_id, id, min(value) as min_value
from table
group by subject_id, id
)
select distinct t.subject_id, t.id, t.time_1, t.min_time, t.max_time, m.min_value
from table t
join mins m on m.subject_id = t.subject_id and m.id = t.id

In SQL, how do I create new column of values for each distinct values of another column?

Something like this: SQL How to create a value for a new column based on the count of an existing column by groups?
But I have more than two distinct values. I have a variable n number of distinct values, so I don't always know have many different counts I have.
And then in the original table, I want each row '3', '4', etc. to have the count i.e. all the rows with the '3' would have the same count, all the rows with '4' would have the same count, etc.
edit: Also how would I split the count via different dates i.e. '2017-07-19' for each distinct values?
edit2: Here is how I did it, but now I need to split it via different dates.
edit3: This is how I split via dates.
#standardSQL
SELECT * FROM
(SELECT * FROM table1) main
LEFT JOIN (SELECT event_date, value, COUNT(value) AS count
FROM table1
GROUP BY event_date, value) sub ON main.value=sub.value
AND sub.event_date=SAFE_CAST(main.event_time AS DATE)
edit4: I wish PARTITION BY was documented somewhere better. Nothing seems to be widely written on BigQuery or anything with detailed documentation
#standardSQL
SELECT
*,
COUNT(*) OVER (PARTITION BY event_date, value) AS cnt
FROM table1;
The query that you give would better be written using window functions:
SELECT t1.*, COUNT(*) OVER (PARTITION BY value) as cnt
FROM table1 t1;
I am not sure if this answers your question.
If you have another column that you want to count as well, you can use conditional aggregation:
SELECT t1.*,
COUNT(*) OVER (PARTITION BY value) as cnt,
SUM(CASE WHEN datecol = '2017-07-19' THEN 1 ELSE 0 END) OVER (PARTITION BY value) as cnt_20170719
FROM table1 t1;

Parallelizable OVER EACH BY

I am hitting this obstacle again and again...
JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions
Is there a best practice or recommendations how to use window functions (Over()) with very large data sets that cannot be processed on a single node?
Fragmenting my data and running the same query with different filters can work, but its very limiting, takes lot of time (and manual labor) and costly (running same query on the same data set 30 times instead of once).
Referring to Jeremy's answer bellow...
It's better, but still doesn't work properly.
If I take my original query sample:
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true
)
group by title
Now works...
But
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A-Z]')=true
)
group each by title
Gives again the Resources Exceeded Error...
Window functions can now be executed in distributed fashion according to the PARTITION BY clause given inside OVER. If you supply a PARTITION BY with your window functions, your data will be processed in parallel similar to how JOIN EACH and GROUP EACH BY are processed.
In addition, you can use PARTITION BY on the output of JOIN EACH or GROUP EACH BY without serializing execution. Using the same keys for PARTITION BY as for JOIN EACH or GROUP EACH BY is particularly efficient, because the data will not need to be reshuffled between join/aggregation and window function execution.
Update: note Jeremy's comment with good news.
OVER() functions always need to run on the whole dataset as the last step of execution (they even run after the LIMIT clauses). Everything needs to fit in the last VM, unless it's parallelizable with a PARTITION clause.
When I find this type of errors, I try to filter as much data as I can in earlier steps.
For example, this query doesn't run:
SELECT Year, Actor1Name, Actor2Name, c FROM (
SELECT Actor1Name, Actor2Name, Year, COUNT(*) c, RANK() OVER(PARTITION BY YEAR ORDER BY c DESC) rank
FROM
(SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name),
(SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name),
WHERE Actor1Name IS NOT null
AND Actor2Name IS NOT null
GROUP EACH BY 1, 2, 3
)
WHERE rank=1
ORDER BY Year
But I can fix it easily with an earlier filter, in this case adding a "HAVING c > 100":
SELECT Year, Actor1Name, Actor2Name, c FROM (
SELECT Actor1Name, Actor2Name, Year, COUNT(*) c, RANK() OVER(PARTITION BY YEAR ORDER BY c DESC) rank
FROM
(SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name),
(SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name),
WHERE Actor1Name IS NOT null
AND Actor2Name IS NOT null
GROUP EACH BY 1, 2, 3
HAVING c > 100
)
WHERE rank=1
ORDER BY Year
So what is happening here: Before applying RANK() OVER(), I'm getting rid of many of the combinations that won't matter when I'm looking for the top ones (as I'm filtering out everything with a count less than 100).
To give a more specific answer, it's always better if you can supply a query and sample data to review.