Delete duplicate equal rows in BigQuery

Delete duplicate equal rows in BigQuery - google-bigquery

There is a table with duplicates rows, where all column values are equal:
+------+---------+------------+
| id | value | timestamp |
+------+---------+------------+
| 1 | 500 | 2019-10-12 |
| 2 | 400 | 2019-10-11 |
| 1 | 500 | 2019-10-12 |
+------+---------+------------+
I want to keep one of those equal rows and delete the others. I came up with:
DELETE
FROM
`table` t1
WHERE (
SELECT
ROW_NUMBER() OVER (PARTITION BY id),
FROM
`table` t2
WHERE
t1.id = t2.id
) > 1
However this does not work:
Correlated subqueries that reference other tables are not supported
unless they can be de-correlated, such as by transforming them into an
efficient JOIN.
Any ideas how to remove duplicate rows?

Below is for BigQuery Standard SQL
... where all column values are equal
So you can use simple DISTINCT * and instead of DELETE use CREATE / REPLACE to write back to the same table
#standardSQL
CREATE OR REPLACE TABLE `project.dataset.table`
PARTITION BY date
SELECT DISTINCT *
FROM `project.dataset.table`
In PARTITION BY clause - you should add the fields you use to partition the original table

Related

ORACLE SELECT DISTINCT VALUE ONLY IN SOME COLUMNS

+----+------+-------+---------+---------+
| id | order| value | type | account |
+----+------+-------+---------+---------+
| 1 | 1 | a | 2 | 1 |
| 1 | 2 | b | 1 | 1 |
| 1 | 3 | c | 4 | 1 |
| 1 | 4 | d | 2 | 1 |
| 1 | 5 | e | 1 | 1 |
| 1 | 5 | f | 6 | 1 |
| 2 | 6 | g | 1 | 1 |
+----+------+-------+---------+---------+
I need get a select of all fields of this table but only getting 1 row for each combination of id+type (I don't care the value of the type). But I tried some approach without result.
At the moment that I make an DISTINCT I cant include rest of the fields to make it available in a subquery. If I add ROWNUM in the subquery all rows will be different making this not working.
Some ideas?
My better query at the moment is this:
SELECT ID, TYPE, VALUE, ACCOUNT
FROM MYTABLE
WHERE ROWID IN (SELECT DISTINCT MAX(ROWID)
FROM MYTABLE
GROUP BY ID, TYPE);

It seems you need to select one (random) row for each distinct combination of id and type. If so, you could do that efficiently using the row_number analytic function. Something like this:
select id, type, value, account
from (
select id, type, value, account,
row_number() over (partition by id, type order by null) as rn
from your_table
)
where rn = 1
;
order by null means random ordering of rows within each group (partition) by (id, type); this means that the ordering step, which is usually time-consuming, will be trivial in this case. Also, Oracle optimizes such queries (for the filter rn = 1).
Or, in versions 12.1 and higher, you can get the same with the match_recognize clause:
select id, type, value, account
from my_table
match_recognize (
partition by id, type
all rows per match
pattern (^r)
define r as null is null
);
This partitions the rows by id and type, it doesn't order them (which means random ordering), and selects just the "first" row from each partition. Note that some analytic functions, including row_number(), require an order by clause (even when we don't care about the ordering) - order by null is customary, but it can't be left out completely. By contrast, in match_recognize you can leave out the order by clause (the default is "random order"). On the other hand, you can't leave out the define clause, even if it imposes no conditions whatsoever. Why Oracle doesn't use a default for that clause too, only Oracle knows.

How to delete duplicates but keep one when all tuples are identical in duplicates and original? In postgreSQL

Suppose we have below table.
How to delete 2 duplicates and keep one? My code deletes all of them.
+----+-------+
| ID | NAME |
+----+-------+
| 2 | ARK |
| 3 | CAR |
| 9 | PAR |
| 9 | PAR |
| 9 | PAR |
+----+-------+

Ideally, your table should have a unique ID. If not then you can use ctid as a dummy unique ID field as the below query.
ctid represents the physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row’s ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. But it does the job here.
delete from my_table a using my_table b where a=b and a.ctid < b.ctid;
DB fiddle link - https://dbfiddle.uk/?rdbms=postgres_10&fiddle=4888d519e125dc095496a57477a60b9f

You could do it using deletion by row_number
Delete from table t1 where 1<(
Select rn from ( select id, name,
row_number() over (partition by id, name
order by id) rn from table)

Find difference between two consecutive rows from a result in SQL server 2008

I want to fetch the difference in "Data" column between two consecutive rows. For example, need Row2-Row1 ( 1902.4-1899.66) , Row 3-Row 2 and so on. The difference should be stored in a new column.
+----+-------+-----------+-------------------------+----+
| Name | Data |meter| Time |
+----+-------+-----------+-------------------------+----+
| Boiler-1 | 1899.66 | 1 | 5/16/2019 12:00:00 AM |
| Boiler-1 | 1902.4 | 1 | 5/16/2019 12:15:00 AM |
| Boiler-1 | 1908.1 | 1 | 5/16/2019 12:15:00 AM |
| Boiler-1 | 1911.7 | 6 | 5/16/2019 12:15:00 AM |
| Boiler-1 | 1926.4 | 6 | 5/16/2019 12:15:00 AM |
|
+----+-------+-----------+------------------------- +
Thing is the table structure that I have shown in the question, is actually obtained from two different tables. I mean, the above table is a result of a Select query to get data from two different tables. Goes like "select name, data, unitId, Timestamp from table t1 join table t2....." So is there anyway for me to calculate the difference in "data" column value between consecutive rows, without storing this above shown result into a table?
I use SQL 2008, so Lead/Lag functionality cannot be used.

The equivalent in SQL Server 2008 uses apply -- and it can be expensive:
with t as (
<your query here>
)
select t.*,
(t.data - tprev.data) as diff
from t outer apply
(select top (1) tprev.*
from t tprev
where tprev.name = t.name and
tprev.boiler = t.boiler and
tprev.time < t.time
order by tprev.time desc
) tprev;
This assumes that you want the previous row when the name and boiler are the same. You can adjust the correlation clause if you have different groupings in mind.

Not claiming that this is best, this is just another option in SQL SERVER < 2012. As from SQL Server 2012 its easy to do the same using LEAD and LAG default option added. Any way, for small and medium data set, you can consider this below script as well :)
Note: This is just an Idea for you.
WITH CTE(Name,Data)
AS
(
SELECT 'Boiler-1' ,1899.66 UNION ALL
SELECT 'Boiler-1',1902.4 UNION ALL
SELECT 'Boiler-1',1908.1 UNION ALL
SELECT 'Boiler-1',1911.7 UNION ALL
SELECT 'Boiler-1',1926.4
--Replace above select statement with your query
)
SELECT A.Name,A.Data,A.Data-ISNULL(B.Data,0) AS [Diff]
FROM
(
--Adding ROW_NUMBER Over (SELECT NULL) will keep the natural order
--of your data and will just add the row number.
SELECT *,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) RN FROM CTE
)A
LEFT JOIN
(
SELECT *,ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) RN FROM CTE
) B
--Here the JOINING will take place on curent and next row for using ( = B.RN-1)
ON A.RN = B.RN-1

Counting the total number of rows with SELECT DISTINCT ON without using a subquery

I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.
Assume I have a table my_table like the following:
CREATE TABLE my_table(
id int,
my_field text,
id_reference bigint
);
I then have a couple of values:
id | my_field | id_reference
----+----------+--------------
1 | a | 1
1 | b | 2
2 | a | 3
2 | c | 4
3 | x | 5
Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.
My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.
For example, in the above case I would like to retrieve the following output:
| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3 | 1 | b | 2 |
+-------+----+----------+--------------+
| 3 | 2 | c | 4 |
+-------+----+----------+--------------+
| 3 | 3 | x | 5 |
+-------+----+----------+--------------+
My attemp is the following:
select distinct on (id)
count(*) over () as total,
*
from my_table
order by id, id_reference desc
This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:
total | id | my_field | id_reference
-------+----+----------+--------------
5 | 1 | b | 2
5 | 2 | c | 4
5 | 3 | x | 5
(3 rows)
As you can see it has 5 instead of the expected 3.
I can fix this by using a subquery and count as an aggregate function:
with my_values as (
select distinct on (id)
*
from my_table
order by id, id_reference desc
)
select count(*) over (), * from my_values
Which produces my expected output.
My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?

You are looking at my_table 3 ways:
to find the latest id_reference for each id
to find my_field for the latest id_reference for each id
to count the distinct number of ids in the table
I therefore prefer this solution:
select
c.id_count as total,
a.id,
a.my_field,
b.max_id_reference
from
my_table a
join
(
select
id,
max(id_reference) as max_id_reference
from
my_table
group by
id
) b
on
a.id = b.id and
a.id_reference = b.max_id_reference
join
(
select
count(distinct id) as id_count
from
my_table
) c
on true;
This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.
The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.
There is nothing necessarily wrong with subqueries.

SQLite - select the newest row with a certain field value

I have an SQLite question which essentially boils down to the following problem.
id | key | data
1 | A | x
2 | A | x
3 | B | x
4 | B | x
5 | A | x
6 | A | x
New data is appended to the end of the table with an auto-incremented id.
Now, I want to create a query which returns the latest row for each key, like this:
id | key | data
4 | B | x
6 | A | x
I've tried some different queries but I have been unsuccessful. How do you select only the latest rows for each "key" value in the table?

use this SQL-Query:
select * from tbl where id in (select max(id) from tbl group by key);

You could split the main task into two subroutine.
You could move with the approach first retrieve all id/key value then get the id for the latest value of A and B keys,
Now you could easly write a query to get latest value for A and B because you have value of id's for both A and B keys.

SELECT *
FROM mytable
JOIN
( SELECT MAX(id) AS maxid
FROM mytable
GROUP BY "key"
) AS grp
ON grp.maxid = mytable.id
Side note: it's best not to use reserved words like keyas identifiers (for tables, fields. etc.)

Without nested SELECTs, or JOINs but only if the field determining "newest" is primary key (e.g. autoincrement):
SELECT * FROM table GROUP BY key DESC;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Delete duplicate equal rows in BigQuery - google-bigquery

Related

ORACLE SELECT DISTINCT VALUE ONLY IN SOME COLUMNS

How to delete duplicates but keep one when all tuples are identical in duplicates and original? In postgreSQL

Find difference between two consecutive rows from a result in SQL server 2008

Counting the total number of rows with SELECT DISTINCT ON without using a subquery

SQLite - select the newest row with a certain field value

Categories

Resources