Hive : Identification of exact duplicate records

Hive : Identification of exact duplicate records - sql

I have a requirement.
I have a hive table which has more than 200 columns.
Now i have to write a insert query to load the data to another hive table after removing all identical duplicate records.
I know i can achive it by using row number () over () .
Code snippet
Insert into table target
Select col1,col2..col200
from
(
Select col1,col2...col200,row_number () over ( partition by col1,col2...col200 order by null ) as rn from source
) a
where
rn=1
But this would very lengthy as need to write all 200 columns name multiple time,
Is there any easier solution available?
Thanks for your advice.

You can use select distinct:
Insert into table target
Select distinct col1,col2..col200
from source ;

Related

Db2 SQL Select rows with excluding rows based on large number of data

I have following query to fetch data from table for reporting which excludes based on a condition
Select ACCOUNTNO, EFFECTIVE_DATE, RATEPERCENTAGE
FROM TESTXY.ACCINFODET
WHERE
EFFECTIVE_DATE > '2021-01-01'
AND ACCOUNTNO NOT IN
(
00000005367890,
00000005378912,
00000007326741,
.
.
.
.
00000089237410,
)
ORDER BY ACCOUNTNO;
Exclude condition data ranges from 600 to 2K account number
can you please advice the best way to try
Many Thanks!

Load the account numbers to be excluded in to a separate, single column table, then use a sub select in your not in clause:
...
AND ACCOUNTNO NOT IN
( SELECT * FROM TESTXY.EXCLACCTNOS )
Warning: Not tested.

Deduplicate rows in complex schema in a bigquery partition

I have read some threads but I know too little sql to solve my problem.
I have a table with a complex schema with records and nested fields.
Below you see a query which finds the exact row that I need to deduplicate.
SELECT *
FROM my-data-project-214805.rfid_data.rfid_data_table
WHERE DATE(_PARTITIONTIME) = "2020-02-07"
AND DetectorDataMessage.Header.MessageID ='478993053'
DetectorDataMessage.Header.MessageID is supposed to be unique.
How can I delete one of these rows? (there are two)
If possible I would like deduplicate the whole table but its partitioned and I can't get it right. I try the suggestions in below threads but I get this error Column DetectorDataMessage of type STRUCT cannot be used in...
Threads of interest:
Deduplicate rows in a BigQuery partition
Delete duplicate rows from a BigQuery table
Any suggestions? Can you guide me in the right direction?

Try using a MERGE to remove the existing duplicate rows, and a single identical one. In this case I'm going for a specific date and id, as in the question:
MERGE `temp.many_random` t
USING (
# choose a single row to replace the duplicates
SELECT a.*
FROM (
SELECT ANY_VALUE(a) a
FROM `temp.many_random` a
WHERE DATE(_PARTITIONTIME)='2018-10-01'
AND DetectorDataMessage.Header.MessageID ='478993053'
GROUP BY _PARTITIONTIME, DetectorDataMessage.Header.MessageID
)
)
ON FALSE
WHEN NOT MATCHED BY SOURCE
# delete the duplicates
AND DATE(_PARTITIONTIME)='2018-10-01'
AND DetectorDataMessage.Header.MessageID ='478993053'
THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
Based on this answer:
Deduplicate rows in a BigQuery partition

If all of the values in the duplicate rows are the same, just use 'SELECT distinct'.
If not, I would use the ROW_NUMBER() function to create a rank for each unique index, and then just choose the first rank.
I don't know what your columns are, but here's an example:
WITH subquery as
(select MessageId
ROW_NUMBER() OVER(partition by MessageID order by MessageId ASC) AS rank
)
select *
from subquery
where rank = 1

Handling duplicates in BigQuery (Nested Table)

I think this is a very simple question but I would like some guidance: I didn't want to have to drop a table to send a new table with the deduplicated records, like using DELETE FROM based on the query below using BigQuery, is it possible? PS: This is a nested table!
SELECT
*
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY id, date_register) row_number
FROM
dataset.table)
WHERE
row_number = 1
order by id, date_register

To de-duplicate in place, without re-creating the table - use MERGE:
MERGE `temp.many_random` t
USING (
SELECT DISTINCT *
FROM `temp.many_random`
)
ON FALSE
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
It's simpler than the current accepted answer, as it won't ask you to match the current partitioning or clustering - it will just respect it.

Update: please also check Felipe Hoffa's answer which is simpler, and learn more on this post: BigQuery Deduplication.
You need to exclude row_number from output and overwrite your table using CREATE OR REPLACE TABLE:
CREATE OR REPLACE TABLE your_table AS
PARTITION BY DATE(date_register)
SELECT
* EXCEPT(row_number)
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY id, date_register) row_number
FROM your_table)
WHERE
row_number = 1
If you don´t have a partition field defined at the source, I recommend that you create a new table with the partition field to make this query work so that you can automate the process.

How to merge two table using order by?

While trying to merge two tables, when rows not matched how do I insert rows based on an order. For example in table_2 I have a column "Type" (sample values 1,2,3 etc), so when I do an insert for unmatched codes I need to insert records with type as 1 first, then 2 etc.
So far I tried below code
WITH tab1 AS
(
select * From TABLE_2 order by Type
)
merge tab1 as Source using TABLE_1 as Target on Target.Code=Source.Code
when matched then update set Target.Description=Source.Description
when not matched then insert (Code,Description,Type)
values (Source.Code,Source.Description,Source.Type);
But I get "The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified." error because of using order by in sub query.
So how do I insert records based on an order while merging two table?
Thanks in advance.

Change
select *
to
select top 100 percent
That will allow you to use ORDER BY in the first select

Select into with max()

I have a basic query I use to determine the max value of a column in a table:
select A.revenue_code_id, max(A.revenue_code_version) from rev_code_lookup A
group by A.revenue_code_id
This results in ~580 rows (the entire table has over 2400 rows).
This works just fine for my query results but what I don't know is how to insert the 580 rows into a new able based on the max value. I realize this isn't the right code but what I am thinking of would look something like this:
select * into new_table from rev_code_lookup where max(revenue_code_version)

You can use the row_number() function to get the data you want. Combine with the other answer to insert the results into a table (I've made up a couple of extra columns as an example):
Select
x.revenue_code_id,
x.revenue_code_version,
x.update_timestamp,
x.updated_by
From (
Select
revenue_code_id,
revenue_code_version,
update_timestamp,
updated_by,
row_number() over (partition by revenue_code_id Order By revenue_code_version Desc) as rn
From
revenue_code_lookup
) x
Where
x.rn = 1
Example Fiddle

The insert in another table is always the same way, no matter the complexity of your select:
insert into table
[unbeliavablycomplicatedselecthere]
So in your case:
insert into new_table
select A.revenue_code_id, max(A.revenue_code_version) from rev_code_lookup A
group by A.revenue_code_id
Similarly, if you need to create a brand new table, do this first:
CREATE TABLE new_table
AS
select A.revenue_code_id, max(A.revenue_code_version) from rev_code_lookup A
group by A.revenue_code_id
This will create the corresponding table schema and then you can execute the previous query to insert the data.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive : Identification of exact duplicate records - sql

You can use select distinct: Insert into table target Select distinct col1,col2..col200 from source ;

Related

Db2 SQL Select rows with excluding rows based on large number of data

Deduplicate rows in complex schema in a bigquery partition

Handling duplicates in BigQuery (Nested Table)

How to merge two table using order by?

Select into with max()

Categories

Resources