Multiple columns in OVER ORDER BY - sql

Is there a way to specify multiple columns in the OVER ORDER BY clause?
SELECT ROW_NUMBER() OVER(ORDER BY (A.Col1)) AS ID FROM MyTable A
The above works fine, but trying to add a second column does not work.
SELECT ROW_NUMBER() OVER(ORDER BY (A.Col1, A.Col2)) AS ID FROM MyTable A
Incorrect syntax near ','.

The problem is the extra parentheses around the column name. These should all work:
-- The standard way
SELECT ROW_NUMBER() OVER(ORDER BY A.Col1) AS ID FROM MyTable A
SELECT ROW_NUMBER() OVER(ORDER BY A.Col1, A.Col2) AS ID FROM MyTable A
-- Works, but unnecessary
SELECT ROW_NUMBER() OVER(ORDER BY (A.Col1), (A.Col2)) AS ID FROM MyTable A
Also, when you ask an SQL question, you should always specify which database you are querying against.

No brackets.
SELECT ROW_NUMBER() OVER(ORDER BY A.Col1, A.Col2) AS ID FROM MyTable A

Related

Creating column and filtering it in one select statement

Wondering if it is possible to creating a new column and filter on that column. The following is an example:
SELECT row_number() over (partition by ID order by date asc) row# FROM table1 where row# = 1
Thanks!
Some databases support a QUALIFY clause which you might be able to use:
SELECT *
FROM table1
QUALIFY ROW_NUMBER() OVER (PARTITION BY ID ORDER BY date) = 1;
On SQL Server, you may use a TOP 1 WITH TIES trick:
SELECT TOP 1 WITH TIES *
FROM table1
ORDER BY ROW_NUMBER() OVER (PARTITION BY ID ORDER BY date);
More generally, you would have to use a subquery:
WITH cte AS (
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY date) rn
FROM table1 t
)
SELECT *
FROM cte
WHERE rn = 1;
The WHERE clause is evaluated before the SELECT so your column has to exist before you can use a WHERE clause. You could achieve this by making a subquery of the original query.
SELECT *
FROM
(
SELECT row_number() over (partition by ID order by date asc) row#
FROM table1
) a
WHERE a.row# = 1

Cannot recognize input near 'DELETE' 'FROM' 'CTE' in statement

I want to drop duplicates in mytable if there are identical value in col1.
WITH CTE AS
(
SELECT
*, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col1) AS RN
FROM
mytable
)
DELETE FROM CTE
WHERE RN <> 1
I got error:
Cannot recognize input near 'DELETE' 'FROM' 'CTE' in statement
I don't think Hive supports that syntax for DELETE. Try this:
DELETE FROM mytable t
WHERE t.id > (SELECT MIN(t2.id) -- some sort of unique id
FROM t t2
WHERE t2.id = t.id
);
If you have complete duplicates, then the above won't work. In the most recent versions of Hive you can use MERGE. In older versions:
create table temp_t as
select distinct t.*
from t;
truncate table t;
insert into t
select * from temp_t;
Of course, backup the table before trying this!
Alternative way: assuming you have UNIQUE ID Column.
Delete from MyTable where ID in
(SELECT ID FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY col1) AS RN
FROM mytable) a where RN <> 1)

Scalable Solution to get latest row for each ID in BigQuery

I have a quite large table with a field ID and another field as collection_time. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time) time is not unique together in my data. I want just one of records with the maximum collection time. I have tried two solutions but none of them has worked for me:
First: using query
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1
This results in Resources exceeded error that I guess is because of ORDER BY in the query.
Second
Using join between table and latest time:
(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)
this solution does not work for me because (ID, collection_time) are not unique together so in JOIN result there would be multiple rows for each ID.
I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?
SELECT
agg.table.*
FROM (
SELECT
id,
ARRAY_AGG(STRUCT(table)
ORDER BY
collection_time DESC)[SAFE_OFFSET(0)] agg
FROM
`dataset.table` table
GROUP BY
id)
This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this
Short and scalable version:
select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;
Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1
And with Standard SQL (proposed by S.Mohsen sh)
WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1
If you don't care about writing a piece of code for every column:
SELECT ID,
ARRAY_AGG(col1 ORDER BY collection_time DESC)[OFFSET(0)] AS col1,
ARRAY_AGG(col2 ORDER BY collection_time DESC)[OFFSET(0)] AS col2
FROM myTable
GROUP BY ID
I see no one has mentioned window functions with QUALIFY:
SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp
The window function adds a column max_timestamp that is accessible in the QUALIFY clause to filter on.
As per your comment, Considering you have a table with unique ID's for which you need to find latest collection_time. Here is another way to do it using Correlated Sub-Query. Give it a try.
SELECT id,
(SELECT Max(collection_time)
FROM mytable B
WHERE A.id = B.id) AS Max_collection_time
FROM id_table A
Another solution, which could be more scalable since it avoids multiple scans of the same table (which will happen with both self-join and correlated subquery in above answers). This solution only works with standard SQL (uncheck "Use Legacy SQL" option):
SELECT
ID,
(SELECT srow.*
FROM UNNEST(t.srows) srow
WHERE srow.collection_time = MAX(srow.collection_time))
FROM
(SELECT ID, ARRAY_AGG(STRUCT(col1, col2, col3, ...)) srows
FROM id_table
GROUP BY ID) t

How to replace a DISTINCT ON with GROUP BY in PostgreSQL 9?

I have been using the DISTINCT ON predicate and have decided to replace it with GROUP BY, mainly because it "is not part of the SQL standard and is sometimes considered bad style because of the potentially indeterminate nature of its results".
I am using DISTINCT ON in conjunction with ORDER BY in order to select the latest records in a history table, but it's not clear to me how to do the same with the GROUP BY.
What could be a general approach in order to move from one construct to the other one?
An example could be
SELECT
DISTINCT ON (f1, f2 ) *
FROM table
ORDER BY f1, f2, datefield DESC;
where I get the "latest" pairs of (f1,f2).
If you have a query like this:
select distinct on (col1) t.*
from table t
order by col1, col2
Then you would replace this with window functions, not a group by:
select t.*
from (select t.*,
row_number() over (partition by col1 order by col2) as seqnum
from table t
) t
where seqnum = 1;

How to get DELETE FROM [QUERY] to work?

Hi I'm kinda new to mssql, I'm used to Oracle. I'm trying to delete a specific row from a subquery but mssql doens't really like subqueries.
Here is the query:
DELETE FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY column1 ORDER BY column1) row FROM randomtable) a
WHERE a.row = 1
Is there a way to get this to work?
In Oracle I could've get everything in a query because I can use rownum = 1.
You were nearly there
DELETE a
FROM (SELECT *,
ROW_NUMBER() OVER (PARTITION BY column1 ORDER BY column1) row
FROM randomtable) a
WHERE a.row = 1
Though I prefer the CTE syntax
WITH a
AS (SELECT *,
ROW_NUMBER() OVER (PARTITION BY column1 ORDER BY column1) row
FROM randomtable)
DELETE FROM a
WHERE a.row = 1
You didn't specify the table alias in what you wanted to delete from.
DELETE a FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY column1 ORDER BY column1) row FROM randomtable) a
WHERE a.row = 1
Write the alias DELETE a FROM ...