SQL select youngest record - sql

I have a table. I want to run the SQL query and select the youngest record per ID, I also need to output all other columns associated with the youngest row. In the real table, there are more than 500+ columns.
Please note, I am using AWS Athena. The table has no indexes.
ID COL1 COL2 LAST_UPDATED
1 yyy ddd 01/01/2020
1 ccc eee 12/01/2020
2 xxx ddd 02/01/2020
2 vvv eee 19/01/2020
Desired result:
ID COL1 COL2 LAST_UPDATED
1 ccc eee 12/01/2020
2 vvv eee 19/01/2020

I found solution to use ROW_NUMBER() OVER(PARTITION BY
SELECT *
FROM (
SELECT id, updated_at, ROW_NUMBER() OVER(PARTITION BY id ORDER BY updated_at desc) rn
from table t
)
where rn = 1

Try using below query:
select * from aws
where last_updated in (select max(last_updated) from aws group by id)

A typical and efficient way in most databases is to use a correlated subquery:
select t.*
from t
where t.LAST_UPDATED = (select max(t2.LAST_UPDATED)
from t t2
where t2.id = t.id
);
For performance, you want an index on (id, LAST_UPDATED).
In a database that doesn't have indexes, then use row_nmber():
select t.*
from (select t.*, row_number() over (partition by id order by last_id desc) as seqnum
from t
) t
where seqnum = 1;

Related

Hive query optimization

My requirement is to get the id and name of the students having more than 1 email id's and type=1.
I am using a query like
select distinct b.id, b.name, b.email, b.type,a.cnt
from (
select id, count(email) as cnt
from (
select distinct id, email
from table1
) c
group by id
) a
join table1 b on a.id = b.id
where b.type=1
order by b.id
Please let me know is this fine or any simpler version available.
Sample data is like:
id name email type
123 AAA abc#xyz.com 1
123 AAA acd#xyz.com 1
123 AAA ayx#xyz.com 3
345 BBB nch#xyz.com 1
345 BBB nch#xyz.com 1
678 CCC iuy#xyz.com 1
Expected Output:
123 AAA abc#xyz.com 1 2
123 AAA acd#xyz.com 1 2
345 BBB nch#xyz.com 1 1
678 CCC iuy#xyz.com 1 1
you can use group by -> having count() for this requirement.
select distinct b.id
, b.name,
, b.email
, b.type
from table1 b
where id in
(select distinct id from table1 group by email, id having count(email) > 1)
and b.type=1
order by b.id
You can try to use the analytical way of count() function:
SELECT sub.ID, sub.NAME
FROM (SELECT ID, NAME, COUNT (*) OVER (PARTITION BY ID, EMAIL) cnt
FROM raw.crddacia_raw) sub
WHERE sub.cnt > 1 AND sub.TYPE = 1
I strongly recommend using window functions. However, Hive does not support count(distinct) as a window function. There are different methods to solve this. One is the sum of dense_rank()s:
select id, name, email, type, cnt
from (select t1.*,
(dense_rank() over (partition by id order by email) +
dense_rank() over (partition by id order by email desc)
) as cnt
from table1 t1
) t
where type = 1;
I would expect this to have better performance than your version. However, it is worth testing different versions to see which has the better performance (and feel free to come back to let others know which is better).
One more method using collect_set and taking the size of returned array for calculating distinct emails.
Demo:
--your data example
with table1 as ( --use your table instead of this
select stack(6,
123, 'AAA', 'abc#xyz.com', 1,
123, 'AAA', 'acd#xyz.com', 1,
123, 'AAA', 'ayx#xyz.com', 3,
345, 'BBB', 'nch#xyz.com', 1,
345, 'BBB', 'nch#xyz.com', 1,
678, 'CCC', 'iuy#xyz.com', 1
) as (id, name, email, type )
)
--query
select distinct id, name, email, type,
size(collect_set(email) over(partition by id)) cnt
from table1
where type=1
Result:
id name email type cnt
123 AAA abc#xyz.com 1 2
123 AAA acd#xyz.com 1 2
345 BBB nch#xyz.com 1 1
678 CCC iuy#xyz.com 1 1
We still need DISTINCT here because analytic function does not remove duplicates like in case 345 BBB nch#xyz.com.
This is very similar to your query but here i am filtering data at initial step(in inner query)so that the join should not happen on less data
select distinct b.id,b.name,b.email,b.type,intr_table.cnt from table1 orig_table join
(
select a.id,a.type,count(a.email) as cnt from table1 as a where a.type=1 group by a
) intr_table on inter_table.id=orig_table.id,inter_table.type=orig_table.type

Sum analytical function or any other easy way

I have below Data and need to select all columns with sum of one column
id size desc1, desc2
1 13 xxx yyy
1 13 xxx yyy
1 10 mmm kkk
1 10 mmm kkk
I need below output
id **total_size** desc1 des2
1 23 xxx yyy
1 23 xxx yyy
1 23 mmm kkk
1 23 mmm kkk
total_size should be sum (distinct size)
select a.id
,a.size
,sum(b.size) as 'total_size'
,a.desc1
,a.desc2
from (
select *, row_number() over (order by id, size, desc1, desc2) as 'RowNumber'
from #tmp
) a
left join (
select *, row_number() over(partition by id, size order by id) as 'dupe'
from #tmp
) b
on a.id = b.id
and b.dupe=1
group by a.RowNumber
,a.id
,a.size
,a.desc1
,a.desc2
Not here to argue, but you should really consider reviewing the data structure you're working with.
Select your data, adding a column to number the rows
Join a copy of your data (with distinct records only)
Sum the size column from the list of distinct records
You just need to add sum(distinct "size") over (partition by id) for computing total_size column for each row in your SQL :
with tab(id,"size","desc1","desc2") as
(
select 1 ,13,'xxx','yyy' from dual union all
select 1 ,13,'xxx','yyy' from dual union all
select 1 ,10,'mmm','kkk' from dual union all
select 1 ,10,'mmm','kkk' from dual
)
select t.id,
sum(distinct t."size") over (partition by id) as "total_size",
t."desc1",t."desc2"
from tab t;
P.S. size is a reserved keyword, so, cannot be used as a column name, unless quoted. as "size"

SQL Server get column not in Group By clause?

How to get the following result from this table?
ID1|ID2| Date
----------------------
1 | 1 | 01-01-2014
1 | 2 | 02-01-2014
2 | 3 | 03-01-2014
I want to get ID1 & ID2 for the maximum date when grouped by ID1
Result:
1,2
2,3
My code:
SELECT
ID1, MAX(DATE)
FROM
Table
GROUP BY
ID1
I need something like
SELECT
ID1, ID2, MAX(DATE)
FROM
Table
GROUP BY
ID1
Can someone help me?
There's three ways to do it.
One, a subquery:
SELECT t1.ID1, t1.ID2, t2.MAX_DATE
FROM Table t1
INNER JOIN (
SELECT ID1, MAX(DATE) AS "MAX_DATE" FROM Table GROUP BY ID1) t2
ON t1.ID1 = t2.ID2
Or you can use the OVER() clause if you're on SQL Server 2005+, recent versions of Oracle, or PostgreSQL (and most recent things not MySQL or MariaDB):
SELECT ID1,
ID2,
MAX(DATE) OVER(PARTITION BY ID1)
FROM Table
Or you can use a correlated subquery:
SELECT t1.ID1,
t1.ID2,
(SELECT MAX(DATE) FROM Table WHERE ID1 = t1.ID1)
FROM Table t1
You can accomplish this by joining the table to the aggregate, like this:
SELECT t.*
FROM
Table t
INNER JOIN
(
SELECT
ID1,
MAX(Date) MaxDate
FROM Table
GROUP BY ID1
) MaxDate ON
t.ID1 = MaxDate.ID1 AND
t.Date = MaxDate.MaxDate
you can use ROW_NUMBER analytic function
SELECT *
FROM
(SELECT *,
ROW_NUMBER() over ( partition by ID1 order by [date] desc) as seq
FROM Table1
) T
WHERE T.seq =1

How to select a single row when grouping by column and by max date?

I have the following data which I would like to filter so I only get only one row based on the grouping of the first column and select the max date
co2 contains unique values
col1 | col2 | date
1 | 123 | 2013
1 | 124 | 2012
1 | 125 | 2014
2 | 213 | 2011
2 | 214 | 2015
2 | 215 | 2018
so the results I want are:
1 | 125 | 2014
2 | 215 | 2018
I've tried using a few examples which I found on here (as below) as well other group by / distinct / max(date) but with no luck
select t.*
from (select t.*,
row_number() over (partition by col1, col2 order by date desc) as seqnum
from t
) t
where seqnum = 1
Change the partition in the row_number() to only partition by col1 but keep the order by date desc:
select col1, col2, date
from
(
select col1, col2, date,
row_number() over (partition by col1
order by date desc) as rn
from yourtable
) x
where rn = 1
See SQL Fiddle with Demo.
Since you were partitioning by both col1 and col2 you were getting unique values for each row. So it would not return the row with the max date.
I prefer bluefeet's method, but here is an equivalent using MAX:
SELECT t.col1, t.col2, t.date
FROM yourtable t
JOIN (
SELECT col1, MAX(date) maxDate
FROM yourtable
GROUP BY col1
) t2 on t.col1 = t2.col1 AND t.date = t2.maxDate
SQL Fiddle Demo (borrowed from other post)
Select * from yourtable where date in
(select max(date) from tab group by col1);

Sql query to sort data on one column only but not change the others columns

I want a SQL query to display the following data
ID Name
1 AAA
2 BBB
3 CCC
4 DDD
as this:
ID Name
4 AAA
3 BBB
2 CCC
1 DDD
without changing the other columns.
Kindly suggest?
Thanks
You could use row_number to number the table in two directions, and zip those together:
declare #t table (id int, name varchar(4))
insert #t values (1, 'AAA'), (2, 'BBB'), (3, 'CCC'), (4, 'DDD')
; with numbered as
(
select row_number() over (order by id) as rn1
, row_number() over (order by id desc) as rn2
, *
from #t
)
select t2.id
, t1.name
from numbered t1
join numbered t2
on t1.rn1 = t2.rn2
This prints:
id name
4 AAA
3 BBB
2 CCC
1 DDD
I'm going with something like this :
SELECT t2.ID, t1.NAME
FROM
(SELECT ROW_NUMBER() OVER(ORDER BY ID DESC) AS rownumber,
Name
FROM MyTable) as t1
INNER JOIN
(SELECT ROW_NUMBER() OVER(ORDER BY ID ASC) AS rownumber,
ID
FROM MyTable) as t2
ON t1.rownumber = t2.rownumber
You have to set to each row a number for the Name field, and for the ID field, in different order, and then join between them to retrieve the datas in different order.
I would use a subselect with order by clause for the ID column.