Handling duplicates in BigQuery (Nested Table) - google-bigquery

I think this is a very simple question but I would like some guidance: I didn't want to have to drop a table to send a new table with the deduplicated records, like using DELETE FROM based on the query below using BigQuery, is it possible? PS: This is a nested table!
SELECT
*
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY id, date_register) row_number
FROM
dataset.table)
WHERE
row_number = 1
order by id, date_register

To de-duplicate in place, without re-creating the table - use MERGE:
MERGE `temp.many_random` t
USING (
SELECT DISTINCT *
FROM `temp.many_random`
)
ON FALSE
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
It's simpler than the current accepted answer, as it won't ask you to match the current partitioning or clustering - it will just respect it.

Update: please also check Felipe Hoffa's answer which is simpler, and learn more on this post: BigQuery Deduplication.
You need to exclude row_number from output and overwrite your table using CREATE OR REPLACE TABLE:
CREATE OR REPLACE TABLE your_table AS
PARTITION BY DATE(date_register)
SELECT
* EXCEPT(row_number)
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY id, date_register) row_number
FROM your_table)
WHERE
row_number = 1
If you don´t have a partition field defined at the source, I recommend that you create a new table with the partition field to make this query work so that you can automate the process.

Related

Deduplicate rows in complex schema in a bigquery partition

I have read some threads but I know too little sql to solve my problem.
I have a table with a complex schema with records and nested fields.
Below you see a query which finds the exact row that I need to deduplicate.
SELECT *
FROM my-data-project-214805.rfid_data.rfid_data_table
WHERE DATE(_PARTITIONTIME) = "2020-02-07"
AND DetectorDataMessage.Header.MessageID ='478993053'
DetectorDataMessage.Header.MessageID is supposed to be unique.
How can I delete one of these rows? (there are two)
If possible I would like deduplicate the whole table but its partitioned and I can't get it right. I try the suggestions in below threads but I get this error Column DetectorDataMessage of type STRUCT cannot be used in...
Threads of interest:
Deduplicate rows in a BigQuery partition
Delete duplicate rows from a BigQuery table
Any suggestions? Can you guide me in the right direction?
Try using a MERGE to remove the existing duplicate rows, and a single identical one. In this case I'm going for a specific date and id, as in the question:
MERGE `temp.many_random` t
USING (
# choose a single row to replace the duplicates
SELECT a.*
FROM (
SELECT ANY_VALUE(a) a
FROM `temp.many_random` a
WHERE DATE(_PARTITIONTIME)='2018-10-01'
AND DetectorDataMessage.Header.MessageID ='478993053'
GROUP BY _PARTITIONTIME, DetectorDataMessage.Header.MessageID
)
)
ON FALSE
WHEN NOT MATCHED BY SOURCE
# delete the duplicates
AND DATE(_PARTITIONTIME)='2018-10-01'
AND DetectorDataMessage.Header.MessageID ='478993053'
THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
Based on this answer:
Deduplicate rows in a BigQuery partition
If all of the values in the duplicate rows are the same, just use 'SELECT distinct'.
If not, I would use the ROW_NUMBER() function to create a rank for each unique index, and then just choose the first rank.
I don't know what your columns are, but here's an example:
WITH subquery as
(select MessageId
ROW_NUMBER() OVER(partition by MessageID order by MessageId ASC) AS rank
)
select *
from subquery
where rank = 1

SQL Eliminate Duplicates with NO ID

I have a table with the following Columns...
Node, Date_Time, Market, Price
I would like to delete all but 1 record for each Node, Date time.
SELECT Node, Date_Time, MAX(Price)
FROM Hourly_Data
Group BY Node, Date_Time
That gets the results I would like to see but cant figure out how to remove the other records.
Note - There is no ID for this table
Here are steps that are rather workaround than a simple one-command which will work in any relational database:
Create new table that looks just like the one you already have
Insert the data computed by your group-by query to newly created table
Drop the old table
Rename new table to the name the old one used to have
Just remember that locking takes place and you need to have some maintenance time to perform this action.
There are simpler ways to achieve this, but they are DBMS specific.
here is an easy sql-server method that creates a Row Number within a cte and deletes from it. I believe this method also works for most RDBMS that support window functions and Common Table Expressions.
;WITH cte AS (
SELECT
*
,RowNum = ROW_NUMBER() OVER (PARTITION BY Node, Date_Time ORDER BY Price DESC)
FROM
Hourly_Data
)
DELETE
FROM
cte
WHERE
RowNum > 1

Return only the newest rows from a BigQuery table with a duplicate items

I have a table with many duplicate items – Many rows with the same id, perhaps with the only difference being a requested_at column.
I'd like to do a select * from the table, but only return one row with the same id – the most recently requested.
I've looked into group by id but then I need to do an aggregate for each column. This is easy with requested_at – max(requested_at) as requested_at – but the others are tough.
How do I make sure I get the value for title, etc that corresponds to that most recently updated row?
I suggest a similar form that avoids a sort in the window function:
SELECT *
FROM (
SELECT
*,
MAX(<timestamp_column>)
OVER (PARTITION BY <id_column>)
AS max_timestamp,
FROM <table>
)
WHERE <timestamp_column> = max_timestamp
Try something like this:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (
PARTITION BY <id_column>
ORDER BY <timestamp column> DESC)
row_number,
FROM <table>
)
WHERE row_number = 1
Note it will add a row_number column, which you might not want. To fix this, you can select individual columns by name in the outer select statement.
In your case, it sounds like the requested_at column is the one you want to use in the ORDER BY.
And, you will also want to use allow_large_results, set a destination table, and specify no flattening of results (if you have a schema with repeated fields).

how can we delete one value out of two values from table?

Is it possible to this?I have a table with two rows and 1 column.Both rows have same value.no primary key is there.can we delete 1 row?
Here's one way to do it with ROW_NUMBER() and a common table expression:
with cte as (
select *,
row_number() over (partition by id order by id) rn
from yourtable)
delete from cte
where rn = 1;
SQL Fiddle Demo
you can do this using RANK() function.
or you can use TOP keyword.
You can get fancy and use cte to delete one but if they are the same value (and the table is as simple as you're describing it), you can also delete both and add one back. Much simpler.
Surrogate Key anyone?

Select into with max()

I have a basic query I use to determine the max value of a column in a table:
select A.revenue_code_id, max(A.revenue_code_version) from rev_code_lookup A
group by A.revenue_code_id
This results in ~580 rows (the entire table has over 2400 rows).
This works just fine for my query results but what I don't know is how to insert the 580 rows into a new able based on the max value. I realize this isn't the right code but what I am thinking of would look something like this:
select * into new_table from rev_code_lookup where max(revenue_code_version)
You can use the row_number() function to get the data you want. Combine with the other answer to insert the results into a table (I've made up a couple of extra columns as an example):
Select
x.revenue_code_id,
x.revenue_code_version,
x.update_timestamp,
x.updated_by
From (
Select
revenue_code_id,
revenue_code_version,
update_timestamp,
updated_by,
row_number() over (partition by revenue_code_id Order By revenue_code_version Desc) as rn
From
revenue_code_lookup
) x
Where
x.rn = 1
Example Fiddle
The insert in another table is always the same way, no matter the complexity of your select:
insert into table
[unbeliavablycomplicatedselecthere]
So in your case:
insert into new_table
select A.revenue_code_id, max(A.revenue_code_version) from rev_code_lookup A
group by A.revenue_code_id
Similarly, if you need to create a brand new table, do this first:
CREATE TABLE new_table
AS
select A.revenue_code_id, max(A.revenue_code_version) from rev_code_lookup A
group by A.revenue_code_id
This will create the corresponding table schema and then you can execute the previous query to insert the data.