Correlated subquery - Group by in inner query and rewrite to window function - sql

I am looking at this query:
select ID, date1
from table1 as t1
where date2 = (select max(date2)
from table1 as t2
where t1.ID = t2.ID
group by t2.ID)
First of all I don't think the Group by is necessary. Am I right? Second is it generally more efficint to rewrite this as a window function?
Does this look right?
select ID, date1
from (select ID, date1, row_number() over(partition by ID) as row_num,
max(date2) over(partition by ID) as max_date
from table1)
where row_num = 1;

First of all I don't think the Group by is necessary. Am I right?
You are correct. That's a scalar subquery anyway: group by doesn't change the result since we are filtering on a single ID. Not using group by makes the intent clearer in my opinion.
The window function solution does not need max() - the filtering is sufficient, but the window function needs an order by clause (and the derived table needs an alias). So:
select ID, date1
from (
select ID, date1, row_number() over(partition by ID order by date1 desc) as row_num
from table1
) t1
where row_num = 1;
That's not exactly the same thing as the first query, because it does not allow ties, while the first query does. Use rank() if you want the exact same logic.
Which query performs better highly depends on your data and structure. The correlated subquery would take advantage of an index on (id, date2). Both solutions are canonical approaches to the problem, and you would probably need to test both solutions against your data to see which one is better in your context.

Related

How to get unique records using group by?

Actual:
I have a table that looks below and I want a query for max(ts) of record from each index.
index,ts,pearson,close
1,2018-01-01 00:00:00.0,-0.0732723,1.19985
1,2018-01-01 00:01:00.0,-0.0324333,1.18989
1,2018-01-01 00:02:00.0,-0.0737444,1.17985
2,2018-01-01 00:01:00.0,-0.0832523,1.18955
2,2018-01-01 00:02:00.0,-0.0624323,1.16919
2,2018-01-01 00:03:00.0,-0.0237494,1.17789
Expected:
index,ts,pearson,close
1,2018-01-01 00:02:00.0,-0.0737444,1.17985
2,2018-01-01 00:03:00.0,-0.0237494,1.17789
One canonical way of doing this uses ROW_NUMBER:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY "index" ORDER BY ts DESC) rn
FROM yourTable
)
SELECT "index", ts, pearson, close
FROM cte
WHERE rn = 1;
As a side note, please don't name your columns index, which is a reserved keyword in most flavors of SQL. I escaped index above using double quotes, though you might have to escape some other way depending on your actual database.
If you want the last record by time for each row, then a reasonable approach is:
select t.*
from t
where t.ts = (select max(t2.ts) from t t2 where t2.index = t.index);
In particular, this can use an index on (index, ts).

How to use analytic functions to find the next-most-recent timestamp in the same table

I am currently using a self-join to calculate the next-most-recent timestamp for any given row:
SELECT t.COLUMN1,
t.SOME_OTHER_COLUMN,
t.TIMESTAMP_COLUMN,
MAX(pt.TIMESTAMP_COLUMN) AS PREV_TIMESTAMP_COLUMN
FROM Table1 t
LEFT JOIN Table1 pt ON pt.COLUMN1 = t.COLUMN1
AND pt.TIMESTAMP_COLUMN < t.TIMESTAMP_COLUMN
AND pt.SOME_OTHER_COLUMN = SOME_LITERAL_VALUE
GROUP BY t.COLUMN1,
t.SOME_OTHER_COLUMN,
t.TIMESTAMP_COLUMN
The problem is, I need to do this multiple times, for multiple comparisons, which will require multiple nested self-joins, which will be very ugly code, and probably very slow to execute.
How do you accomplish this same thing, but using analytic functions instead?
I started writing some code, but it looks wrong:
SELECT DISTINCT t.COLUMN1,
t.SOME_OTHER_COLUMN,
t.TIMESTAMP_COLUMN,
MAX(CASE WHEN t.TIMESTAMP_COLUMN < t.TIMESTAMP_COLUMN
AND t.SOME_OTHER_COLUMN = SOME_LITERAL_VALUE
THEN t.TIMESTAMP END) OVER
(PARTITION BY t.COLUMN1) AS PREV_TIMESTAMP_COLUMN1,
MAX(CASE WHEN t.TIMESTAMP_COLUMN < t.TIMESTAMP_COLUMN
AND t.SOME_OTHER_COLUMN = SOME_OTHER_LITERAL_VALUE
THEN t.TIMESTAMP END) OVER
(PARTITION BY t.COLUMN1) AS PREV_TIMESTAMP_COLUMN2
FROM Table1 t
As soon as I saw WHEN t.TIMESTAMP_COLUMN < t.TIMESTAMP_COLUMN I thought "This can't be right ..."
I know there are many other ways of using analytic functions, such as ROWS UNBOUNDED PRECEDING, but I'm new to analytic functions, and I don't know how to implement those.
What's the best way to use analytic functions to accomplish this?
I think that you could do a conditional window max with a frame specification, as follows:
SELECT DISTINCT
COLUMN1,
SOME_OTHER_COLUMN,
TIMESTAMP_COLUMN,
MAX(CASE WHEN SOME_OTHER_COLUMN = 'SOME_LITERAL_VALUE' THEN TIMESTAMP_COLUMN END)
OVER(
PARTITION BY COLUMN1
ORDER BY TIMESTAMP_COLUMN
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
) PREV_TIMESTAMP_COLUMN
FROM Table1 t
This will get you the greatest timestamp within previous records having the same COLUMN1, and whose SOME_OTHER_COLUMN is equal to the desired litteral value.

Get minimum without using row number/window function in Bigquery

I have a table like as shown below
What I would like to do is get the minimum of each subject. Though I am able to do this with row_number function, I would like to do this with groupby and min() approach. But it doesn't work.
row_number approach - works fine
SELECT * FROM (select subject_id,value,id,min_time,max_time,time_1,
row_number() OVER (PARTITION BY subject_id ORDER BY value) AS rank
from table A) WHERE RANK = 1
min() approach - doesn't work
select subject_id,id,min_time,max_time,time_1,min(value) from table A
GROUP BY SUBJECT_ID,id
As you can see just the two columns (subject_id and id) is enough to group the items together. They will help differentiate the group. But why am I not able to use the other columns in select clause. If I use the other columns, I may not get the expected output because time_1 has different values.
I expect my output to be like as shown below
In BigQuery you can use aggregation for this:
SELECT ARRAY_AGG(a ORDER BY value LIMIT 1)[SAFE_OFFSET(1)].*
FROM table A
GROUP BY SUBJECT_ID;
This uses ARRAY_AGG() to aggregate each record (the a in the argument list). ARRAY_AGG() allows you to order the result (by value) and to limit the size of the array. The latter is important for performance.
After you concatenate the arrays, you want the first element. The .* transforms the record referred to by a to the component columns.
I'm not sure why you don't want to use ROW_NUMBER(). If the problem is the lingering rank column, you an easily remove it:
SELECT a.* EXCEPT (rank)
FROM (SELECT a.*,
ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY value) AS rank
FROM A
) a
WHERE RANK = 1;
Are you looking for something like below-
SELECT
A.subject_id,
A.id,
A.min_time,
A.max_time,
A.time_1,
A.value
FROM table A
INNER JOIN(
SELECT subject_id, MIN(value) Value
FROM table
GROUP BY subject_id
) B ON A.subject_id = B.subject_id
AND A.Value = B.Value
If you do not required to select Time_1 column's value, this following query will work (As I can see values in column min_time and max_time is same for the same group)-
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
--A.time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time
Finally, the best approach is if you can apply something like CAST(Time_1 AS DATE) on your time column. This will consider only the date part regardless of the time part. The query will be
SELECT
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE) Time_1,
MIN(A.value)
FROM table A
GROUP BY
A.subject_id,A.id,A.min_time,A.max_time,
CAST(A.time_1 AS DATE)
-- Make sure the syntax of CAST AS DATE
-- in BigQuery is as I written here or bit different.
Below is for BigQuery Standard SQL and is most efficient way for such cases like in your question
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY value LIMIT 1)[OFFSET(0)]
FROM `project.dataset.table` t
GROUP BY subject_id
Using ROW_NUMBER is not efficient and in many cases lead to Resources exceeded error.
Note: self join is also very ineffective way of achieving your objective
A bit late to the party, but here is a cte-based approach which made sense to me:
with mins as (
select subject_id, id, min(value) as min_value
from table
group by subject_id, id
)
select distinct t.subject_id, t.id, t.time_1, t.min_time, t.max_time, m.min_value
from table t
join mins m on m.subject_id = t.subject_id and m.id = t.id

Explain how this SELECT WHERE subquery works?

Here's the query:
SELECT ID, Name, EventTime, State
FROM mytable as mm Where EventTime IN
(Select MAX(EventTime) from mytable mt where mt.id=mm.id)
Here is the fiddle:
http://sqlfiddle.com/#!3/9630c0/5
It comes from this S.O. question:
Select distinct rows whilst grouping by max value
I would like to hear in plain english how it works. I'm missing some fundamental understanding of part of it.
I don't really understand what the aliases are doing in the mt.id=mm.id part. It selects rows where the id is equal to the id?
The mt.id=mm.id part makes it a correlated subquery, hence the subquery is re-evaluated for each ID.
The query, then, selects the most recent event for each ID.
It is basically translated into "Get me the data for each id with maximum EventTime associated with."
You can also rewrite the code as
SELECT t1.ID, t1.Name, t1.EventTime, t1.State FROM mytable as t1
inner join
(
select id,max(EventTime) as EventTime from mytable group by id
) as t2 on t1.id=t2.id and t1.EventTime=t2.EventTime

Implement FIRST() in select and not in WHERE

I want to get first value in a field in Oracle when another corresponding field has max value.
Normally, we would do this using a query and a subquery. The subquery ordering by a field and the outer query with where rownum<=1.
But, I cannot do this because the table aliases persist only one level deep and this query is a part of another big query and I need to use some aliases from the outermost query.
Here's the query structure
select
(
select a --This should get first value of a after b's are sorted desc
from
(
select a,b from table1 where table1.ID=t2.ID order by b desc
)
where rownum<=1
)
) as "A",
ID
from
table2 t2
Now this is not gonna work because alias t2 wont be available at innermost query.
Real world analogy that comes to my mind is I have a table containing records for all employees of a company, their salaries(including past salaries) and the date from which the salary was effective. So, for each employee, there will multiple records. Now, I want to get latest salaries for all the employees.
With SQL server, I could have used SELECT TOP. But that's not available with Oracle and since where clauses execute before order by, I cannot use where rownum<=1 and order by in same query and expect correct results.
How do I do this?
Using your analogy of employees and their salaries, if I understand what you are trying to do, you could do something like this (haven't tested):
SELECT *
FROM (
SELECT employee_id,
salary,
effective_date,
ROW_NUMBER() OVER (PARTITION BY employee_id ORDER BY effective_date DESC) rowno
FROM employees
)
WHERE rowno=1
I would much rather see you connect the subquery up with a JOIN instead of embedding it in the SELECT. Cleaner SQL. Then you can use the windowing function that roartechs suggests.
Select t2.whatever, t1.a
From table2 t2
Inner Join (
Select tfirst.ID, tfirst.a
From (
Select ID, a,
ROW_NUMBER() Over (Partition BY ID ORDER BY b DESC) rownumber
FROM table1
) tfirst
WHERE tfirst.rownumber=1
) t1 on t2.ID=t1.ID