Creating column and filtering it in one select statement - sql

Wondering if it is possible to creating a new column and filter on that column. The following is an example:
SELECT row_number() over (partition by ID order by date asc) row# FROM table1 where row# = 1
Thanks!

Some databases support a QUALIFY clause which you might be able to use:
SELECT *
FROM table1
QUALIFY ROW_NUMBER() OVER (PARTITION BY ID ORDER BY date) = 1;
On SQL Server, you may use a TOP 1 WITH TIES trick:
SELECT TOP 1 WITH TIES *
FROM table1
ORDER BY ROW_NUMBER() OVER (PARTITION BY ID ORDER BY date);
More generally, you would have to use a subquery:
WITH cte AS (
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY date) rn
FROM table1 t
)
SELECT *
FROM cte
WHERE rn = 1;

The WHERE clause is evaluated before the SELECT so your column has to exist before you can use a WHERE clause. You could achieve this by making a subquery of the original query.
SELECT *
FROM
(
SELECT row_number() over (partition by ID order by date asc) row#
FROM table1
) a
WHERE a.row# = 1

Related

sql query to get latest record for each id

I have one table. From that I need to get latest "Date" for each "id". I wrote query for One id. But I don't know how to apply for multiple ids.(I mean for each id)
My query for one id is (say table name is tt):
select * from (
SELECT DISTINCT id ,date FROM tt
WHERE Trim(id) ='1000082'
ORDER BY date desc
) where rownum<=1;
If you have just two columns, aggregation is good enough:
select id, max(date) max_date
from mytable
group by id
If you have more columns and you want the entire row that has the latest date for each id, then one option uses a correlated subquery for filtering:
select t.*
from mytable t
where t.date = (select max(t1.date) from mytable t1 where t1.id = t.id)
Or you can use window functions, if your database supports them:
select *
from (select t.*, row_number() over(partition by id order by date desc) rn from mytable t) t
where rn = 1

Random records in Oracle table based on conditions

I have a Oracle table with the following columns
Table Structure
In a query I need to return all the records with CPER>=40 which is trivial. However, apart from CPER>=40 I need to list 5 random records for each CPID.
I have attached a sample list of records. However, in my table I have around 50,000 records.
Appreciate if you can help.
Oracle solution:
with CTE as
(
select t1.*,
row_number() over(order by DBMS_RANDOM.VALUE) as rn -- random order assigned
from MyTable t1
where CPID <40
)
select *
from CTE
where rn <=5 -- pick 5 at random
union all
select t2.*, null
from my_table t2
where CPID >= 40
SQL Server:
with CTE as
(
select t1.*,
row_number() over(order by newid()) as rn -- random order assigned
from MyTable t1
where CPID <40
)
select *
from CTE
where rn <=5 -- pick 5 at random
union all
select t2.*, null
from my_table t2
where CPID >= 40
How about something like this...
SELECT *
FROM (SELECT CID,
CVAL,
CPID,
CPER,
Row_number() OVER (partition BY CPID ORDER BY CPID ASC ) AS RN
FROM Table) tmp
WHERE CPER>=40 OR pids <= 5
However, this is not random.
Assuming that you want five additional random records, you can do:
select t.*
from (select t.*,
row_number() over (partition by cpid,
(case when cper >= 40 then 1 else 2 end)
order by dbms_random.value
) as seqnum
from t
) t
where seqnum <= 5 or cper >= 40;
The row_number() is enumerating the rows for each cpid in two groups -- based on the cper value. The outer where is taking all cper values in the range you want as well as five from the other group.

fetch records from the column with the max date column

For the below table
ID DATE FIELD1 FIELD2 FIELD4
123456 01.07.2014 00:00:00 EMPLOYER GROUPS TMC SELECT CARE HMO
123789 01.07.2017 00:00:00 EMPLOYER GROUPS MQC SELECT CARE HMO
How to select only one row that have max(date)? i.e. 01.07.2017
select *
from theTable
where Date = (select max(date) from theTable)
Should do it. Add a top 1 if multiple rows have the same date.
In Oracle 12+, you can use:
select t.*
from t
order by date desc
fetch first 1 row only;
In earlier versions, you an use a subquery:
select t.*
from (select t.*
from t
order by date desc
) t
where rownum = 1;
If you need more than one record with the same maximum date:
select t.*
from t
where t.date = (select max(t2.date) from t);
For tsql / SQL Server you can use the below
DECLARE #max_date datetime;
SELECT #max_date = max(DATE) from table_name;
SELECT TOP 1 * FROM table_name WHERE DATE = #max_date;
The 'TOP 1' makes sure you only receive 1 row.
Which row this is will arbitrary though, unless you add an 'ORDER BY' statement in your query.
Alternatively
dense_rank() function may be used as :
select dt, ID
from
(
select dense_rank() over (order by "Date" desc) as dr,
"Date" as dt,
ID
from tab
)
where dr = 1;
or if you're sure about there's no tie(for the column "Date"), even row_number() function may be used :
select dt, ID
from
(
select row_number() over (order by "Date" desc) as rn,
"Date" as dt,
ID
from tab
)
where rn = 1;
SQL Fiddle Demo

How to ignore column in SQL Server

I have this query:
Select *
from
(Select
*
ROW_NUMBER() OVER (PARTITION BY TID ORDER BY TID) AS RowNumber
from
MyTable
where
Eid = 'C1') as a
where
a.RowNumber = 1
and it displays these results:
Column1 Column2 RowNumber
------------------------------
Value1 value2 1
I want to ignore the RowNumber column in the select statement and I don't want to list all columns in select query (100+ columns and given is just an example).
How to do this in SQL Server?
Well, you would have to list all the columns in the outer select, if you use a subquery and row_number() to get a unique row.
An alternative method uses a correlated subquery, but requires having some unique column in the table. If you have one:
select t.*
from mytable t
where t.col = (select max(t2.col) from mytable t2 where t2.tid = t.tid and t2.eid = 'C1');
With the right indexes, this can have better performance than the row_number() version.
If you don't have a unique column, you can do:
select t.*
from (select distinct tid from mytable where eid = 'C1') tc cross apply
(select top 1 t.*
from mytable t
where t.tid = tc.tid and t.eid = 'C1'
) t;
Wrap your query as a subquery and select specific columns from it like so:
SELECT x.Column1, x.Column2
FROM
(
Select * from (Select * ROW_NUMBER() OVER (PARTITION BY TID ORDER BY TID)
AS RowNumber from MyTable where Eid="C1") as a where a.RowNumber=1
) AS x
OR Change your original Select to:
Select a.[Column1], a.[Column2]
from
(
Select * ROW_NUMBER() OVER (PARTITION BY TID ORDER BY TID)
AS RowNumber from MyTable where Eid="C1"
) as a
Where a.RowNumber=1
Replace * from your query in clarify exactly columnd which you whant
select x.Column1, x.Column2 FROM (
Select * from (Select * ROW_NUMBER() OVER (PARTITION BY TID ORDER BY TID)
AS RowNumber from MyTable where Eid="C1") as a where a.RowNumber=1) AS x

Scalable Solution to get latest row for each ID in BigQuery

I have a quite large table with a field ID and another field as collection_time. I want to select latest record for each ID. Unfortunately combination of (ID, collection_time) time is not unique together in my data. I want just one of records with the maximum collection time. I have tried two solutions but none of them has worked for me:
First: using query
SELECT * FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY collection_time) as rn
FROM mytable) where rn=1
This results in Resources exceeded error that I guess is because of ORDER BY in the query.
Second
Using join between table and latest time:
(SELECT tab1.*
FROM mytable AS tab1
INNER JOIN EACH
(SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP EACH BY ID) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time)
this solution does not work for me because (ID, collection_time) are not unique together so in JOIN result there would be multiple rows for each ID.
I am wondering if there is a workaround for the resourcesExceeded error, or a different query that would work in my case?
SELECT
agg.table.*
FROM (
SELECT
id,
ARRAY_AGG(STRUCT(table)
ORDER BY
collection_time DESC)[SAFE_OFFSET(0)] agg
FROM
`dataset.table` table
GROUP BY
id)
This will do the job for you and is scalable considering the fact that the schema keeps changing, you won't have to change this
Short and scalable version:
select array_agg(t order by collection_time desc limit 1)[offset(0)].*
from mytable t
group by t.id;
Quick and dirty option - combine your both queries into one - first get all records with latest collection_time (using your second query) and then dedup them using your first query:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY tab1.ID) AS rn
FROM (
SELECT tab1.*
FROM mytable AS tab1
INNER JOIN (
SELECT ID, MAX(collection_time) AS second_time
FROM mytable GROUP BY ID
) AS tab2
ON tab1.ID=tab2.ID AND tab1.collection_time=tab2.second_time
)
)
WHERE rn = 1
And with Standard SQL (proposed by S.Mohsen sh)
WITH myTable AS (
SELECT 1 AS ID, 1 AS collection_time
),
tab1 AS (
SELECT ID,
MAX(collection_time) AS second_time
FROM myTable GROUP BY ID
),
tab2 AS (
SELECT * FROM myTable
),
joint AS (
SELECT tab2.*
FROM tab2 INNER JOIN tab1
ON tab2.ID=tab1.ID AND tab2.collection_time=tab1.second_time
)
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID) AS rn
FROM joint
)
WHERE rn=1
If you don't care about writing a piece of code for every column:
SELECT ID,
ARRAY_AGG(col1 ORDER BY collection_time DESC)[OFFSET(0)] AS col1,
ARRAY_AGG(col2 ORDER BY collection_time DESC)[OFFSET(0)] AS col2
FROM myTable
GROUP BY ID
I see no one has mentioned window functions with QUALIFY:
SELECT *, MAX(collection_time) OVER (PARTITION BY id) AS max_timestamp
FROM my_table
QUALIFY collection_time = max_timestamp
The window function adds a column max_timestamp that is accessible in the QUALIFY clause to filter on.
As per your comment, Considering you have a table with unique ID's for which you need to find latest collection_time. Here is another way to do it using Correlated Sub-Query. Give it a try.
SELECT id,
(SELECT Max(collection_time)
FROM mytable B
WHERE A.id = B.id) AS Max_collection_time
FROM id_table A
Another solution, which could be more scalable since it avoids multiple scans of the same table (which will happen with both self-join and correlated subquery in above answers). This solution only works with standard SQL (uncheck "Use Legacy SQL" option):
SELECT
ID,
(SELECT srow.*
FROM UNNEST(t.srows) srow
WHERE srow.collection_time = MAX(srow.collection_time))
FROM
(SELECT ID, ARRAY_AGG(STRUCT(col1, col2, col3, ...)) srows
FROM id_table
GROUP BY ID) t