In bigquery Keep one row and remove other duplicate data

In bigquery Keep one row and remove other duplicate data - google-bigquery

I have BQ table which has duplicates of rows with same details. I want to keep one row and wish to delete all other duplicate rows.
Example:
Empno,empname, salary
1,Sam,1000
1,Sam,1000
1,Sam,1000
1,Sam,1000
My expectations:

If the table is small we can use select distinct. But if the tables are more than 500gb . The above solution is not recommended. Because table scan going to be billed.

Related

Removing rows with duplicated column values based on another column's value

Hey guys, maybe this is a basic SQL qn. Say I have this very simple table, I need to run a simple sql statement to return a result like this:
Basically, the its to dedup Name based on it's row's Value column, whichever is larger should stay.
Thanks!

Framing the problem correctly would help you figure it out.
"Deduplication" suggests altering the table - starting with a state with duplicates, ending with a state without them. Usually done in three steps (getting the rows without duplicates into temp table, removing original table, renaming temp table).
"Removing rows with duplicated column values" also suggests alteration of data and derails train of thought.
What you do want is to get the entire table, and in cases where the columns you care about have multiple values attached get the highest one. One could say... group by columns you care about? And attach them to the highest value, a maximum value?
select id,name,max(value) from table group by id,name

SQL Exclude a specific column from SQL query result

I have a query where I process columns from two tables and at the end I want ALL colulmns from one temporary table and ONLY ONE column from the other table. Also I do not want the KEY column to appear twice after the join.
I cannot find a clean efficient way to do it. I found these solutions:
Specify all columns explicitly. Bad for obvious reasons if you have to type multiple columns
Get all columns and the DROP the ones you dont need. Not efficient because you carry loads of data and then throwing them away.
Is there a one liner SQL command that leaves out a single column?
Is there an SQL command that removes duplicate KEY column after joining?
Thanks!!

How about selecting all columns from one table and one from the other?
select t1.*, t2.col
from t1 join
t2
on . . .

SQL to identify duplicate columns from table having hundreds of column

I've 250+ columns in customer table. As per my process, there should be only one row per customer however I've found few customers who are having more than one entry in the table
After running distinct on entire table for that customer it still returns two rows for me. I suspect one of column may be suffixed with space / junk from source tables resulting two rows of same information.
select distinct * from ( select * from customer_table where custoemr = '123' ) a;
Above query returns two rows. If you see with naked eye to results there is not difference in any of column.
I can identify which column is causing duplicates if I run query every time for each column with distinct but thinking that would be very manual task for 250+ columns.
This sounds like very dumb question but kind of stuck here. Please suggest if you have any better way to identify this, thank you.

Solving this one-time issue with sql is too much effort. Simply copy-paste to excel, transpose data into columns and use some simple function like "if a==b then 1 else 0".

H2 incrementally update counts from another table?

With the H2 database, suppose there is a SUMS table that has a key and several count fields and there is an UPDATES table which has the same key and count fields. The keys in the UPDATES table may or may not exist in the SUMS table.
What is the most efficient way to add all the counts for each key from the UPDATES table to the SUM table, or insert a row with those counts if the SUMS table does not yet have it?
Of course I could always process the result set of a select on the UPDATES table and then one-by-one update or insert into the SUMS table, but this feels like there should be a more efficient way to do it.
If it is not possible in H2 but possible in some other Java-embeddable solution I would be interested in this too, because this processing is just an intermediate step for processing a larger number of these counts (a couple of dozen million keys and a couple of billion rows for updating them).

Eliminating Duplicate Records in a DB2 Table

How do delete duplicate records in a DB2 table? I want to be left with a single record for each group of dupes.

Create another table "no_dups" that has exactly the same columns as the table you want to eliminate the duplicates from. (You may want to add an identity column, just to make it easier to identify individual rows).
Insert into "no_dups", select distinct column1, column2...columnN from the original table. The "select distinct" should only bring back one row for every duplicate in the original table. If it doesn't you may have to alter the list of columns or have a closer look at your data, it may look like duplicate data but actually is not.
When step 2 is done, you will have your original table, and "no_dups" will have all the rows without duplicates. At this point you can do any number of things - drop and rename tables, or delete all from the original and insert into the original, select * from no_dups.
If you're running into problems identifying duplicates, and you've added an identity column to "no_dups," you should be able to delete rows one by one using the identity column value.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

In bigquery Keep one row and remove other duplicate data - google-bigquery

I have BQ table which has duplicates of rows with same details. I want to keep one row and wish to delete all other duplicate rows. Example: Empno,empname, salary 1,Sam,1000 1,Sam,1000 1,Sam,1000 1,Sam,1000 My expectations:

If the table is small we can use select distinct. But if the tables are more than 500gb . The above solution is not recommended. Because table scan going to be billed.

Related

Removing rows with duplicated column values based on another column's value

SQL Exclude a specific column from SQL query result

SQL to identify duplicate columns from table having hundreds of column

H2 incrementally update counts from another table?

Eliminating Duplicate Records in a DB2 Table

Categories

Resources