Removing true duplicates from greenplum table

Removing true duplicates from greenplum table - sql

I am trying to remove true duplicates from a table. I have removed dupes multiple times in past but I'm not able to figure what's wrong with my syntax with this one.
My code -
DELETE
FROM my_table_name
WHERE (
column1, column2, column3, column4, column5, column6, column7, column8, column9) IN
(
SELECT Row_number() OVER( partition BY column1, column2,column3, column4,column5,column6,column7,column8 ORDER BY column2 DESC, column3 ASC ) AS row_num,
column1,
column2,
column3,
column4,
column5,
column6,
column7,
column8,
column9
FROM my_table_name
WHERE column1='some_value') a
WHERE row_num=2;
Error
********** Error **********
ERROR: syntax error at or near ""a""
SQL state: 42601
Character: 1607
I can see that the error is on creating the alias a subquery. But I'm not able to pin point what's wrong here.
Any help is appreciated
Edit 1 -
If I remove a, I get the below error
********** Error **********
ERROR: syntax error at or near "where"
SQL state: 42601
Character: 1608

If you have duplicate rows, you can't just delete all but one of the records in a single command. You have to delete all duplicates and then insert just one version for each duplicate row or build new table (preferred) without duplicates.
Let's start with the preferred method which is to create a new table without the duplicates. This solution utilizes disk space in the most efficient way possible rather than having a fragmented table.
Example:
create table foo
(id int, fname text)
with (appendonly=true)
distributed by (id);
Insert some data with duplicates:
insert into foo values (1, 'jon');
insert into foo values (1, 'jon');
insert into foo values (2, 'bill');
insert into foo values (2, 'bill');
insert into foo values (3, 'sue');
insert into foo values (4, 'ted');
insert into foo values (4, 'ted');
insert into foo values (4, 'ted');
insert into foo values (4, 'ted');
Create a new version of the table without the duplicates:
create table foo_new with (appendonly=true) as
select id, fname
from (
select row_number() over (partition by id) as row_num, id, fname
from foo
) as sub
where sub.row_num = 1
distributed by (id);
And now rename the tables:
alter table foo rename to foo_old;
alter table foo_new rename to foo;
The second method is to use DELETE but you'll see that it needs more steps to complete.
First, create a temp table with the IDs you want to delete. You typically don't have primary keys enforced in Greenplum but you still have a logical PK. Columns like customer_id, product_id, etc are all in your data. So, find the dups first based on the PK.
drop table if exists foo_pk_delete;
create temporary table foo_pk_delete with (appendonly=true) as
select id
from foo
group by id
having count(*) > 1
distributed by (id);
Next, get the entire row for each duplicate but only one version of it.
drop table if exists foo_dedup;
create temporary table foo_dedup with (appendonly=true) as
select id, fname
from (
select row_number() over (partition by f.id) as row_num, f.id, f.fname
from foo f
join foo_pk_delete fd on f.id = fd.id
) as sub
where sub.row_num = 1
distributed by (id);
Now you can delete the duplicates:
delete
from foo f
using foo_pk_delete fk
where f.id = fk.id;
And then you can insert the deduplicated data back into the table.
insert into foo (id, fname)
select id, fname from foo_dedup;
You'll want to vacuum your table after this data manipulation.
vacuum foo;

Related

A sql query to create multiple rows in different tables using inserted id

I need to insert a row into one table and use this row's id to insert two more rows into a different table within one transaction. I've tried this
begin;
insert into table default values returning table.id as C;
insert into table1(table1_id, column1) values (C, 1);
insert into table1(table1_id, column1) values (C, 2);
commit;
But it doesn't work. How can I fix it?
updated

You need a CTE, and you don't need a begin/commit to do it in one transaction:
WITH inserted AS (
INSERT INTO ... RETURNING id
)
INSERT INTO other_table (id)
SELECT id
FROM inserted;
Edit:
To insert two rows into a single table using that id, you could do that two ways:
two separate INSERT statements, one in the CTE and one in the "main" part
a single INSERT which joins on a list of values; a row will be inserted for each of those values.
With these tables as the setup:
CREATE TEMP TABLE t1 (id INTEGER);
CREATE TEMP TABLE t2 (id INTEGER, t TEXT);
Method 1:
WITH inserted1 AS (
INSERT INTO t1
SELECT 9
RETURNING id
), inserted2 AS (
INSERT INTO t2
SELECT id, 'some val'
FROM inserted1
RETURNING id
)
INSERT INTO t2
SELECT id, 'other val'
FROM inserted1
Method 2:
WITH inserted AS (
INSERT INTO t1
SELECT 4
RETURNING id
)
INSERT INTO t2
SELECT id, v
FROM inserted
CROSS JOIN (
VALUES
('val1'),
('val2')
) vals(v)
If you run either, then check t2, you'll see it will contain the expected values.

Please find the below query:
insert into table1(columnName)values('stack2');
insert into table_2 values(SCOPE_IDENTITY(),'val1','val2');

SQL Query - Ensuring my code is correct/Assistance with multiple queries in one Query (Subquerying)

I have a query today regarding SQL.
Basically here is what I am trying to do (this will also be useful for a couple other tables I have in this DB)
Table 1 = Members
Table 2 = Payments
Essentially trying to insert record into 1 table however have the query also copy over a memberID field if it's present for the individual.
INSERT INTO Payments (FirstName, LastName, PaymentMade)
VALUES ('', '', ''); AND UPDATE Payments
SET Payments.MemberID = Members.MemberID
FROM Members INNER JOIN Members ON Payments.MemberID = Members.MemberID;
Question is: Have I performed this correctly or have I missed a critical step here?
Many thanks! :)

I'm guessing you want to insert data from Table1 into Table2.
INSERT INTO table2 (column1, column2, column3, ...)
SELECT column1, column2, column3, ...
FROM table1
Use a Where condition for filtering out unwanted records.
INSERT INTO table2 (column1, column2, column3, ...)
SELECT column1, column2, column3, ...
FROM table1
WHERE condition;

How to insert multiple insert sql statement

There is a table Person(id, name). I am inserting more than 1000 records into person table. Both id and name should be unique. I wrote something like this
INSERT ALL
INTO PERSON (1, 'MAYUR')
INTO PERSON (2, 'SALUNKE')
.....(1000 records)
SELECT * FROM DUAL;
I am getting unique constraint for name in this query. How do I know which record in particular is failing. All I see in logs is this
Error starting at line : 3 in command - ORA-00001: unique constraint
(UN_PERSON_NAME) violated.
This does not tell the exact record which is duplicate.

You are missing values keyword. Try this!
INSERT ALL
INTO PERSON values(1, 'MAYUR')
INTO PERSON values(2, 'SALUNKE')
.....(1000 records)
SELECT * FROM DUAL;

INSERT INTO table2 (column1, column2, column3, ...)
SELECT column1, column2, column3, ...
FROM table1

Unfortunately, Oracle doesn't support multiple inserts using a single VALUES() statement. I usually approach this as:
INSERT PERSON (id, name)
SELECT 1, 'MAYUR' FROM DUAL UNION ALL
SELECT 2, 'SALUNKE' FROM DUAL UNION ALL
.....;
One advantage of this approach is you can use a subquery and assign the id:
INSERT PERSON (id, name)
SELECT rownum, x.name
FROM (SELECT 'MAYUR' FROM DUAL UNION ALL
SELECT 'SALUNKE' FROM DUAL UNION ALL
.....
) x

PostgreSQl - Is there a way to do Upsert from one table to another?

I want to do insert or update from one table to another in PostgreSQl.
What I've achieved till now is to make two SQL commands one for insertion and one for updating. So is there a way to do this in just one command?
CREATE TEMP TABLE IF NOT EXISTS tmp_table_a (LIKE table_a INCLUDING DEFAULTS);
COPY tmp_table_a FROM STDIN DELIMITER ',' CSV;
UPDATE table_a a SET (id, column1, column2) = (tmp.id, tmp.column1, tmp.column2) FROM tmp_table_a tmp WHERE a.id = tmp.id;
INSERT INTO table_a (id, column1, column2) SELECT tmp.id, tmp.column1, tmp.column2 FROM tmp_table_a tmp WHERE tmp.id not in (select distinct id from table_a);

select table to select from, dependent on column-value of already given table

for my intention I have to select a table to select columns from dependent on the column-value of an already given table.
First I thought about a CASE construct, if this is possible with sqlite.
SELECT * FROM
CASE IF myTable.column1 = "value1" THEN (SELECT * FROM table1 WHERE ...)
ELSE IF myTable.column1 = "value2" THEN (SELECT * FROM table2 WHERE ...)
END;
I am new to SQL. What construct would be the most concise (not ugly) solution and if I cannot have it in sqlite, what RDBM would be the best fit?
Thanks

Here is a proposal for associating a value from one of two tables for each entry in mytable. I.e. this is making the assumption that mytable does not only contain a single entry for choosing the secondary table.
For details on what this means, see "MCVE" at the end of this answer.
If you want to switch between two secondary tables, based on a single entry in main table, see at the very end of this answer.
Details:
a hardcoded "value1"/"value2" as column1 added on the fly to the result from secondary tables
joining by the faked colummn1 and a secondary join-key, assumption here id
a union all to make a single table from both secondary tables (including the fake column1)
select *
from mytable
left join
(select 'value1' as column1, * from table1
UNION ALL
select 'value2' as column1, * from table2)
using(id, column1);
Output (for the MCVE provided below, "a-f" from table1, "A-Z" from table2):
value1|1|a
value2|2|B
value1|3|c
value1|4|d
value2|5|E
value2|6|F
MCVE:
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE mytable (column1 varchar(10), id int);
INSERT INTO mytable VALUES('value1',1);
INSERT INTO mytable VALUES('value2',2);
INSERT INTO mytable VALUES('value1',3);
INSERT INTO mytable VALUES('value1',4);
INSERT INTO mytable VALUES('value2',5);
INSERT INTO mytable VALUES('value2',6);
CREATE TABLE table2 (value varchar(2), id int);
INSERT INTO table2 VALUES('F',6);
INSERT INTO table2 VALUES('E',5);
INSERT INTO table2 VALUES('D',4);
INSERT INTO table2 VALUES('C',3);
INSERT INTO table2 VALUES('B',2);
INSERT INTO table2 VALUES('A',1);
CREATE TABLE table1 (value varchar(2), id int);
INSERT INTO table1 VALUES('a',1);
INSERT INTO table1 VALUES('b',2);
INSERT INTO table1 VALUES('c',3);
INSERT INTO table1 VALUES('d',4);
INSERT INTO table1 VALUES('e',5);
INSERT INTO table1 VALUES('f',6);
COMMIT;
For selecting between two tables based on a single entry in main table (in this case "mytable2":
select * from table1 where (select column1 from mytable2) = 'value1'
union all
select * from table2 where (select column1 from mytable2) = 'value2';
Output (with mytable2 only containing 'value1'):
a|1
b|2
c|3
d|4
e|5
f|6

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Removing true duplicates from greenplum table - sql

Related

A sql query to create multiple rows in different tables using inserted id

SQL Query - Ensuring my code is correct/Assistance with multiple queries in one Query (Subquerying)

How to insert multiple insert sql statement

PostgreSQl - Is there a way to do Upsert from one table to another?

select table to select from, dependent on column-value of already given table

Categories

Resources