BigQuery delete duplicates keep last - google-bigquery

I have BigQuery table with columns col1, col2, col3. I want to delete rows that have duplicated values in col1 but keep the last (The row that was last pushed to BigQuery). col2 and col3 does not have to be duplicates.
My main problem is that I cannot find the last column. I tried below query, ordering was not right. But when I just SELECT all the rows the ordering is from the oldest to the newest rows.
SELECT *, ROW_NUMBER() OVER (PARTITION BY col1) AS row_numb
FROM table
I saw other solutions but they ordered by some column timestamp/created_at that I do not have. I know that one solution would be to add column with timestamp and then order by it to get most recent row. But is there any other way?
Example:
col1
col2
col3
1
2
3
2
3
1
1
4
3
The last row (col1 = 1, col2 = 4, col3 = 3) was added last to BigQuery
So what I want is to find duplicates in first column (That would be 1. and 3. row) and delete all the duplicates except the one that was added last to BigQuery (That is the 3. row).
The result would be
col1
col2
col3
2
3
1
1
4
3

Related

delete duplicates in table using partion and where clause

Using SQL Server 2016
I have found that using the partition over rows method to be fastest for duplicating rows in large tables. I'm trying to use the same process to delete some duplicates, but now I have unique situation.
Basically I need to delete rows that are duplicated on all columns except one. However, the rows would be allowed to be duplicated if the excluded column was also duplicated but not if it is different.
For example
col1 col2 col3 col4
1 2 3 4
1 2 3 4
1 2 3 5
The first 2 rows would be allowed to stay, but the 3rd row needs to be removed.
Normally I would use the code below to delete rows that are duplicated on certain criteria, but I don't know how to account for my current situation.
delete x from (select col1, ROW_NUMBER()
over (partition by col1 order by col1) As rn From table1) x
where rn > 1
Thanks for any help.
Just FYI the table contains 226 Million rows.
I think you want a count, not a row number:
with todelete as (
select t1.*,
count(*) over (partition by col1, col2, col3, col4) as cnt,
from t1
)
delete from todelete
where cnt > 1;

how to select min value from table if table has two unique values with rest of columns are identical

ex:Input
ID Col1 Col2 Col3
-- ---- ---- ----
1 a a sql
2 a a hive
Out put
ID Col1 Col2 Col3
-- ---- ---- ----
1 a a sql
Here my id value and Col3 values are unique but i need to filter on min id and populate all records.
I know below approach will work, but any best approach other than this please suggest
select Col1,Col2,min(ID) from table group by Col1,Col2;
and join this on ID,Col1,Col2
I think you want row_number():
select t.*
from (select t.*, row_number() over (partition by col1, col2 order by id) as seqnum
from t
) t
where seqnum = 1
It appears that Hive supports ROW_NUMBER. Though I’ve never used hive, other rdbms would use it like this to get the entire contents of the min row without needing to join (doesn’t suffer problems if there are repeated minimum values)
SELECT a.* FROM
(
SELECT *, ROW_NUMBER() OVER(ORDER BY id) rn FROM yourtable
) a
WHERE a.rn = 1
The inner query selects all the table data and establishes an incrementing counter in order of ID. It could be based on any column, the min ID (in this case) being row number 1. If you wanted the max, order by ID desc
If you want the number to restart for different values of another column (eg of ten of your Col3 were “sql” and twenty rows had “hive”) you an say PARTITION BY col3 ORDER BY id, and the row number will be a counter that increments for identical values of col3, restarting from 1 for each distinct value of col3

How to use group-by and get other rows results

Question: if this is my data:
col1,col2,col3,col4
===================
www.com,0,dangerous,reason A
www.com,1,dangerous 2,reason B
I want the a single result where column 2 value is max, so I will use in my select the Max(col2) function - but how can I get those corresponding col3 and col4 row ?
select
col1, max(col2), col3, col4
group by
col1
and ???
Thanks
Idan
You can use order by and limit to one row. The ANSI-standard syntax is:
select t.*
from t
order by t.col2 desc
fetch first 1 row only;
Not all databases support the fetch first clause, so you might have to use select top 1, limit, or some other construct.
You can use where in select statement
Like
Select * from table name where col2=max(col2)
You can get max column entire row with single value
If the column col2 which contain same value like 1,1,2,2 at this time above query return the 2 rows. At that time if you want single row you want to use this
Select * from table name where col2=max(col2) fetch first 1 row only
Might be this helpful

sqlite: select all columns where one filed has max value over all columns

I have a table like this:
id int, col1 int, ...
Different rows can have col1 of same value.
Now I want to gather all rows where col1 has a the maximum value.
e.g. this table values
1 4
2 3
3 4
The query shall give my row 1 and 3
You can use subquery:
SELECT id, col1
FROM tab
WHERE col1 = (SELECT MAX(col1) FROM tab);
SqlFiddleDemo

Returning a count of duplicate rows based on 3 matching column entries

In a table, I want to return duplicate rows based on three columns and a count of the duplicates found.
For example,
In a row, say I have an entry of 1 for column a, 1 for column b and 1 for column c.
I only want to return/count this row if other rows have the exact same entry for the 3 columns (1, 1 and 1).
Thanks in advance!
-N
-- You need to specify the columns you need, and the count
SELECT col1, col2, col3, COUNT(*)
FROM myTable
-- Then you have to group the tuples based on the columns you are doing the count on
GROUP BY col1, col2, col3
-- Here you specify the condition for COUNT(*)
HAVING COUNT(*) > 1;
You can find more information about this here (plus some other useful stuff).