How to replace a DISTINCT ON with GROUP BY in PostgreSQL 9? - sql

I have been using the DISTINCT ON predicate and have decided to replace it with GROUP BY, mainly because it "is not part of the SQL standard and is sometimes considered bad style because of the potentially indeterminate nature of its results".
I am using DISTINCT ON in conjunction with ORDER BY in order to select the latest records in a history table, but it's not clear to me how to do the same with the GROUP BY.
What could be a general approach in order to move from one construct to the other one?
An example could be
SELECT
DISTINCT ON (f1, f2 ) *
FROM table
ORDER BY f1, f2, datefield DESC;
where I get the "latest" pairs of (f1,f2).

If you have a query like this:
select distinct on (col1) t.*
from table t
order by col1, col2
Then you would replace this with window functions, not a group by:
select t.*
from (select t.*,
row_number() over (partition by col1 order by col2) as seqnum
from table t
) t
where seqnum = 1;

Related

how to select max(column) and a column in the same request teradata

I need to select the max of a column and the column itself in the same request using TeraData SQL Assitant
I tried :
select distinct id, col1, max(col1) from tab where id='myId' group by col1,id;
I tried also :
SELECT DISTINCT a.id, a.col1 FROM tab a
INNER JOIN (SELECT max(a.col1) AS maxINT,id FROM tab GROUP BY id)x
ON a.id = x.id
WHERE a.I_INTNE_DOSS_FIN = 'myId' ;
The problem I have the value of col1 in both col1 and max(col1)
Any idea please ?
Thanks in advance.
I think you want the row where col1 has the greater value for each id.
In Teradata, you can do this with row_number() and qualify:
select *
from tab
qualify row_number() over(partition by id order by col1 desc) = 1
Seems like you want both details and aggregate in the same Select. This is easy using Windowed Aggregates, probably
select id, col1, max(col1) over ()
from tab
where id='myId'
I think you just want one row. If so:
select top (1) t.*
from tab
where id = 'myId'
order by col1 desc;

sql query to get latest record for each id

I have one table. From that I need to get latest "Date" for each "id". I wrote query for One id. But I don't know how to apply for multiple ids.(I mean for each id)
My query for one id is (say table name is tt):
select * from (
SELECT DISTINCT id ,date FROM tt
WHERE Trim(id) ='1000082'
ORDER BY date desc
) where rownum<=1;
If you have just two columns, aggregation is good enough:
select id, max(date) max_date
from mytable
group by id
If you have more columns and you want the entire row that has the latest date for each id, then one option uses a correlated subquery for filtering:
select t.*
from mytable t
where t.date = (select max(t1.date) from mytable t1 where t1.id = t.id)
Or you can use window functions, if your database supports them:
select *
from (select t.*, row_number() over(partition by id order by date desc) rn from mytable t) t
where rn = 1

Scalable Solution to get latest row for each ID in BigQuery

I have a quite large table with a field ID and another field as collection_time. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time) time is not unique together in my data. I want just one of records with the maximum collection time. I have tried two solutions but none of them has worked for me:
First: using query
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1
This results in Resources exceeded error that I guess is because of ORDER BY in the query.
Second
Using join between table and latest time:
(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)
this solution does not work for me because (ID, collection_time) are not unique together so in JOIN result there would be multiple rows for each ID.
I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?
SELECT
agg.table.*
FROM (
SELECT
id,
ARRAY_AGG(STRUCT(table)
ORDER BY
collection_time DESC)[SAFE_OFFSET(0)] agg
FROM
`dataset.table` table
GROUP BY
id)
This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this
Short and scalable version:
select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;
Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1
And with Standard SQL (proposed by S.Mohsen sh)
WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1
If you don't care about writing a piece of code for every column:
SELECT ID,
ARRAY_AGG(col1 ORDER BY collection_time DESC)[OFFSET(0)] AS col1,
ARRAY_AGG(col2 ORDER BY collection_time DESC)[OFFSET(0)] AS col2
FROM myTable
GROUP BY ID
I see no one has mentioned window functions with QUALIFY:
SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp
The window function adds a column max_timestamp that is accessible in the QUALIFY clause to filter on.
As per your comment, Considering you have a table with unique ID's for which you need to find latest collection_time. Here is another way to do it using Correlated Sub-Query. Give it a try.
SELECT id,
(SELECT Max(collection_time)
FROM mytable B
WHERE A.id = B.id) AS Max_collection_time
FROM id_table A
Another solution, which could be more scalable since it avoids multiple scans of the same table (which will happen with both self-join and correlated subquery in above answers). This solution only works with standard SQL (uncheck "Use Legacy SQL" option):
SELECT
ID,
(SELECT srow.*
FROM UNNEST(t.srows) srow
WHERE srow.collection_time = MAX(srow.collection_time))
FROM
(SELECT ID, ARRAY_AGG(STRUCT(col1, col2, col3, ...)) srows
FROM id_table
GROUP BY ID) t

Multiple columns in OVER ORDER BY

Is there a way to specify multiple columns in the OVER ORDER BY clause?
SELECT ROW_NUMBER() OVER(ORDER BY (A.Col1)) AS ID FROM MyTable A
The above works fine, but trying to add a second column does not work.
SELECT ROW_NUMBER() OVER(ORDER BY (A.Col1, A.Col2)) AS ID FROM MyTable A
Incorrect syntax near ','.
The problem is the extra parentheses around the column name. These should all work:
-- The standard way
SELECT ROW_NUMBER() OVER(ORDER BY A.Col1) AS ID FROM MyTable A
SELECT ROW_NUMBER() OVER(ORDER BY A.Col1, A.Col2) AS ID FROM MyTable A
-- Works, but unnecessary
SELECT ROW_NUMBER() OVER(ORDER BY (A.Col1), (A.Col2)) AS ID FROM MyTable A
Also, when you ask an SQL question, you should always specify which database you are querying against.
No brackets.
SELECT ROW_NUMBER() OVER(ORDER BY A.Col1, A.Col2) AS ID FROM MyTable A

Multiple rows match, but I only want one?

Sometimes I wish to perform a join whereby I take the largest value of one column. Doing this I have to perform a max() and a groupby- which prevents me from retrieving the other columns from the row which was the max (beause they were not contained in a GROUP BY or aggregate function).
To fix this, I join the max value back on the original data source, to get the other columns. However, my problem is that this sometimes returns more than one row.
So, so far I have something like:
SELECT * FROM
(SELECT Col1, Max(Col2) FROM Table GROUP BY Col1) tab1
JOIN
(SELECT Col1, Col2 FROM Table) tab2
ON tab1.Col2 = tab2.Col2
If the above query now returns three rows (which match the largest value for column2) I have a bit of a headache.
If there was an extra column- col3 and for the rows returned by the above query, I only wanted to return the one which was, say the minimum Col3 value- how would I do this?
If you are using SQL Server 2005+. Then you can do it like this:
CTE way
;WITH CTE
AS
(
SELECT
ROW_NUMBER() OVER(PARTITION BY Col1 ORDER BY Col2 DESC) AS RowNbr,
table.*
FROM
table
)
SELECT
*
FROM
CTE
WHERE
CTE.RowNbr=1
Subquery way
SELECT
*
FROM
(
SELECT
ROW_NUMBER() OVER(PARTITION BY Col1 ORDER BY Col2 DESC) AS RowNbr,
table.*
FROM
table
) AS T
WHERE
T.RowNbr=1
As I got it can be something like this
SELECT * FROM
(SELECT Col1, Max(Col2) FROM Table GROUP BY Col1) tab1
JOIN
(SELECT Col1, Col2 FROM Table) tab2
ON tab1.Col2 = tab2.Col2 and Col3 = (select min(Col3) from table )
Assuming you are using SQL-Server 2005 or later You can make use of Window functions here. I have chosen ROW_NUMBER() but it is not hte only option.
;WITH T AS
( SELECT *,
ROW_NUMBER() OVER(PARTITION BY Col1 ORDER BY Col2 DESC) [RowNumber]
FROM Table
)
SELECT *
FROM T
WHERE RowNumber = 1
The PARTITION BY within the OVER clause is equivalent to your group by in your subquery, then your ORDER BY determines the order in which to start numbering the rows. In this case Col2 DESC to start with the highest value of col2 (Equivalent to your MAX statement).