hadoop hive using row_number() - sql

I have a dataset with many duplicating IDs. I just want to do a row_number() and take the first. If i have table1 left join with table2 and only take table2.rownumber=1, it works. but if i do a standalone without table join, it doesn't. I have the following code:
SELECT
ID,
NAME,
NRIC,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) as RNK
FROM TABLE1
WHERE RNK=1;
The error message show that RNK is not a valid table column or alias etc.
Any help would be greatly appreciated. Thanks.

You have to use a subquery or CTE to refer to a column alias for filtering:
SELECT ID, NAME, NRIC, RNK
FROM (SELECT t1.*, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) as RNK
FROM TABLE1
) t1
WHERE RNK = 1;
This is true of all column aliases, even though defined by window functions.

Given:
create table dupes (
id string,
democode string,
extract_timestamp string
);
And:
insert into dupes (id, democode,extract_timestamp) values
('1','code','2020')
,('2','code2','2020')
,('2','code22','2021')
,('3','code3','2020')
,('3','code33','2021')
,('3','code333','2012')
;
When:
SELECT id,democode,extract_timestamp
FROM (
SELECT id,democode,extract_timestamp,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY extract_timestamp DESC) AS row_num
FROM dupes
) t1
WHERE row_num = 1;
Then:
+-----+-----------+--------------------+--+
| id | democode | extract_timestamp |
+-----+-----------+--------------------+--+
| 1 | code | 2020 |
| 2 | code22 | 2021 |
| 3 | code33 | 2021 |
+-----+-----------+--------------------+--+
Note that often tables are partitioned and that we might want to deduplicate within each partition. In which case we would add the partition key(s) into the OVER statement. For example if the table was partition by report_date DATE then we might use:
ROW_NUMBER() OVER (PARTITION BY id, report_date ORDER BY extract_timestamp DESC) AS row_num

Related

How to select the last record of each ID

I need to extract the last records of each user from the table. The table schema is like below.
mytable
product | user_id |
-------------------
A | 15 |
B | 15 |
A | 16 |
C | 16 |
-------------------
The output I want to get is
product | user_id |
-------------------
B | 15 |
C | 16 |
Basically the last records of each user.
Thanks in advance!
You can use a window function called ROW_NUMBER.Here is a solution for you given below. I have also made a demo query in db-fiddle for you. Please check link Demo Code in DB-Fiddle
WITH CTE AS
(SELECT product, user_id,
ROW_NUMBER() OVER(PARTITION BY user_id order by product desc)
as RN
FROM Mytable)
SELECT product, user_id FROM CTE WHERE RN=1 ;
You can try using row_number()
select product,iserid
from
(
select product, userid,row_number() over(partition by userid order by product desc) as rn
from tablename
)A where rn=1
There is no such thing as a "last" record unless you have a column that specifies the ordering. SQL tables represent unordered sets (well technically, multisets).
If you have such a column, then use distinct on:
select distinct on (user_id) t.*
from t
order by user_id, <ordering col> desc;
Distinct on is a very handy Postgres extension that returns one row per "group". It is the first row based on the ordering specified in the order by clause.
You should have a column that stores the insertion order. Whether through auto increment or a value with date and time.
Ex:
autoIncrement
produt
user_id
1
A
15
2
B
15
3
A
16
4
C
16
SELECT produt, user_id FROM table inner join
( SELECT MAX(autoIncrement) as id FROM table group by user_id ) as table_Aux
ON table.autoIncrement = table_Aux.id

Removing duplicate values from sql server on condition of 2 columns

|Rownumber |OldIdassigned |commoncode |
------------------------------------------
| 1 |FLEX |Y2573F102 |
------------------------------------------
| 2 |RCL |Y2573F102 |
------------------------------------------
| 3 |FLEX |Y2573F102 |
------------------------------------------
| 4 |QGEN |N72482123 |
------------------------------------------
| 5 |QGEN |N72482123 |
------------------------------------------
| 6 |QGEN |N72482123 |
------------------------------------------
| 7 |RACE |N72482123 |
------------------------------------------
| 8 |CLB |N22717107 |
------------------------------------------
| 9 |CLB |N22717107 |
------------------------------------------
<b>| 10 |CLB |N22717107 |
I need to delete the duplicate records based on Common code and a condition that - if oldidassigned is same then delete else don't delete.
For example Y2573F102 has 3 duplicate records rows 1,2,3 .... 1,2 need not to be deleted , only 3rd row has to be deleted.
I like updatable CTEs and window functions for this purpose:
with todelete as (
select t.*,
row_number() over (partition by commoncode order by rownumber) as seqnum
from t
)
delete todelete
where seqnum > 1;
Use ROW_NUMBER() :
DELETE t
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY OldIdassigned, commoncode ORDER BY rownumber) AS Seq
FROM table t
) t
WHERE t.seq > 1;
EDIT : If you want to check the duplication based on commoncode only then remove OldIdassigned from PARTITION clause :
DELETE t
FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY commoncode ORDER BY rownumber DESC) AS Seq
FROM table t
) t
WHERE t.seq > 1;
use window function row_number, according to your description and comments it seems you need change in partition clause
delete t
from
(select t1.*,row_number() over(partition by commoncode order by Rownumber) rn from table t1
)t where rn<>1
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=eacc0688efb534a0addee68678f323fe
Use Row_Number()
delete t from
(select *, row_number() over(partition by commoncode order by
rownumber) as rn) t
where rn<>1
Since all answers are similar (and correct), I will post one alternative way:
DELETE FROM TableA
WHERE EXISTS ( SELECT * FROM TableA AS A2
WHERE A2.commoncode = TableA.commoncode
AND A2.OldIdassigned = TableA.OldIdassigned
AND A2.Rownumber < TableA.Rownumber )

How to compare the column values of two last inserted specified rows on the same table?

I have a data set that is being updated on each operation maden by customers.
For example, I am getting a customer's last two operations by
select id,
referance
from (select id,
referance,
row_number()
over (order by time desc) as seqnum
from mytable where id=':id')
al where seqnum <= 2
where id is getting from a feature file. But now I need to compare the referance values of these two operations.
mytable:
id | name | referance | time |
-------------------------------------
11 | abc | 4589 | 09:05 |
11 | abc | 1234 | 09:04 |
10 | xyz | 0185 | 09:02 |
15 | qpr | 9564 | 08:54 |
so on...
Again, I can get the last two rows with id = 11; and, as far as all columns are not (null), it is returning "true" which is what I want literally.
But also I'd like to compare if their referances are the same or not; and, when I call the query, it has to return "true" or "false".
Thanks in advance
P.S. I actually just need a useful function or idea. I've already try to use inner join but couldnt manage it:
select table1.id,
table1.referance,
table2.id,
table2.referance
from (select id,
referance,
row_number()
over (order by time desc) as seqnum
from mytable where id=':id') table1
inner join (select id,
referance,
row_number()
over (order by time desc) as seqnum
from mytable where id=':id') table2
on table1.referance != table2.referance
al where seqnum <= 2 order by seqnum
Aggregate your current query over the id and check if the two reference values be the same or not.
select
id,
case when count(distinct reference) = 1
then 'true' else 'false' end as result
from
(
select id, reference,
row_number() over (order by time desc) as seqnum
from table
where id=':id'
) al
where seqnum <= 2
group by id;
If the distinct count of reference over the two records be 1 then it implies that they have the same value. Otherwise, we can assume that the values are different.
Why are you using row_nubmer()? You can get the last two rows as:
select top 2 id, referance
from mytable
where id=':id'
order by time desc;
You can then determine if these are the same using aggregation:
select (case when min(reference) <> max(reference) then 'false'
else 'true'
end) as is_same
from (select top 2 id, referance
from mytable
where id=':id'
order by time desc
) t;
Note: This doesn't take NULL values for reference into account, but that is easily incorporated into the logic.

Selecting compared pairs from table

I don't really know how to describe it. I have a table:
ID | Name | Date
-------------------------
1 | Mike | 01.01.2016
1 | Michael | 02.03.2016
2 | Samuel | 23.12.2015
2 | Sam | 05.03.2015
3 | Tony | 02.04.2012
I want to select pairs of IDs and Names with latest dates in each pair. The result here should be:
ID | Name | Date
-------------------------
1 | Michael | 02.03.2016
2 | Samuel | 23.12.2015
3 | Tony | 02.04.2012
How do I achieve this?
Oracle Database 11g
You can do it using the ROW_NUMBER() analytic function:
SELECT id, name, "date"
FROM (
SELECT t.*,
ROW_NUMBER() OVER ( PARTITION BY id ORDER BY "date" DESC ) rn
FROM table_name t
)
WHERE rn = 1
This requires only a single table scan (it does not have a self-join or correlated sub-query - i.e. IN (...) or EXISTS(...)).
Have a sub-select that returns each id and it's max date:
select * from table
where (id, date) in (select id, max(date) from table group by id)
You can use NOT EXISTS() :
SELECT * FROM YourTable t
WHERE NOT EXISTS(SELECT 1 FROM YourTable s
WHERE t.id = s.id and s.date > t.date)
Possibly the most efficient method is:
select t.*
from table t
where t.date = (select max(date) from table t2 where t2.id = t.id);
along with an index on table(id, date).
This version should scan the table and look up the correct value in the index.
Or, if there are only three columns, you can use keep:
select id, max(date) as date,
max(name) keep (dense_rank first order by date desc) as name
from table
group by id;
I have found that this version works very well in Oracle.

SQL: Ignore some returned rows, deleting others

I have this table :
| Column | Type |
+---------------+--------------------------------+
| id | integer |
| recipient_id | integer |
| is_read | boolean |
| updated_at | timestamp(0) without time zone |
I have to delete items from this table with this specific rule:
for each recipient_id, we keep the 5 last read items, and we delete the old read one.
I tried to bend my mind with RECURSIVE WITH statements but failed miserably. I've implemented my solution programmatically but I wanted to know if there was a decent pure SQL solution.
DELETE FROM tbl t
USING (
SELECT id, row_number() OVER (PARTITION BY recipient_id
ORDER BY updated_at DESC) as rn
FROM tbl
WHERE is_read
) x
WHERE x.rn > 5
AND x.id = t.id;
A JOIN is usually faster than an IN expression, especially with larger numbers of items.
And use row_number(), not rank()!
Check out window functions:
DELETE FROM table
WHERE id IN (
SELECT id
FROM (
SELECT id, rank() OVER (PARTITION BY recipient_id ORDER BY updated_at DESC) as position
FROM table
WHERE is_read
) subselect WHERE position > 5
)
delete from t
where id in (
select id
from (
select
id,
row_number() over(partition by recipient_id order by updated_at desc) rn
from t
where is_read
) s
where s.rn > 5
)