PostgreSQL/SQL query optimization - sql

So, i have log table with something about 8M records. Because of programming error it happened that there are more than 1 record for company within same date. Now, what i need is to delete all records from this log for each company for same date except latest (which has max id). Count of records to be deleted approximately 300K.
The fastest and easiest thing that i tried is this
delete from indexing_log where id not in (
select max(id)
from indexing_log
group by company_id,
"date"
)
But this query is taking enormous time (about 3 days) on production server (which for some reason doesn't have ssd drive). I tried all ways that i know and need some advice. How can it be faster?
UPDATE
I decided to do it in bucket way through celery task.

you can try
delete from indexing_log as l
where
exists
(
select *
from indexing_log as i
where i.id < l.id and i.company_id = l.company_id and i.dt = l.dt
);

Dump the distinct rows to a temporary table
create temporary table t as
select distinct on (company_id, "date") *
from indexing_log
order by company_id, "date", id desc;
Truncate the original
truncate table indexing_log;
Since the table is now empty use the opportunity to do an instantaneous vacuum:
vacuum full indexing_log;
Move the rows from the temporary to the original
insert into indexing_log
select *
from t;

Truncate Table should be much quicker. But there you cannot say "delete everything except..."
If it is possible with your data you could write a procedure for that, save your Max IDs into a temptable, trucate the table and write your temptable back. For PostgreSQL the syntax is slighly different (http://www.postgresql.org/docs/9.1/static/sql-selectinto.html)
SELECT * from indexing_log
INTO #temptable
WHERE id IN (
SELECT max(id)
FROM indexing_log
GROUP BY company_id,
"date")

Not Exists is sometimes faster than Not in
delete from indexing_log
where not exists (select 1
from (select max(id) as iid
from indexing_log
group by company_id,
"date") mids
where id = mids.iid
)

Related

SQL Delete Rows Based on Multiple Criteria

I am trying to delete rows from a data set based on multiple criteria, but I am receiving a syntax error. Here is the current code:
With cte As (
Select *,
Row_Number() Over(Partition By ID, Numb1 Order by ID) as RowNumb
from DataSet
)
Delete from cte Where RowNumb > 1;
Where DataSet looks like this:
I want to delete all records in which the ID and the Numb1 are the same. So I would expect the code to delete all rows except:
I am not very experienced with Vertica but it seems like it is not very flexible about delete statements.
One way to do it would be to use a temporary table to store the rows that you want to keep, then truncate the original the table, and insert back into it from the temp table:
create temporary table MyTempTable as
select id, numb1, state_coding
from (select t.*, count(*) over(partition by id, numb1) cnt from DataSet) as t
where cnt = 1;
truncate table DataSet;
insert into DataSet
select id, numb1, state_coding from MyTempTable;
Note that I used a window count instead of row_number. This will remove records for which at least another record exists with the same id and numb1, which is what I understand that you want from your sample data and expected results.
Important: make sure to backup your entire table before you do this!
WITH Clauses in Vertica only support SELECT or INSERT, not DELETE/UPDATE.
Vertica Documentation
The cte is a temporary table. You cannot delete from it. It is effectively read-only.
If you are trying to delete duplicates out of the original DataSet table, you have to delete from the DataSet, not from the cte table.
Try this:
with cte as
(
select
ID,
Row_Number() Over(Partition By ID, Numb1 Order by ID) as RowNumb
from
DataSet
)
delete from DataSet where ID in (select ID from cte where RowNumb > 1)
Can't delete from CTEs. Just manually use delete syntax but rollback transactions or if you have permissions you can always replicate it and test.
You'd have saved me ~5 min had you pasted the data as text and not as picture - as I could not copy-paste and had to retype ...
Having said that:
Rebuild the table here:
DROP TABLE IF EXISTS input;
CREATE TABLE input(id,numb1,state_coding) AS (
SELECT 202003,4718868,'D'
UNION ALL SELECT 202003, 35756,'AA'
UNION ALL SELECT 204281, 146199,'D'
UNION ALL SELECT 204281, 146199,'D'
UNION ALL SELECT 204346, 108094,'D'
UNION ALL SELECT 204346, 108094,'D'
UNION ALL SELECT 204389, 14642,'DD'
UNION ALL SELECT 204389, 96504,'F'
UNION ALL SELECT 204392, 22010,'D'
UNION ALL SELECT 204392, 8051,'G'
UNION ALL SELECT 204400, 74118,'D'
UNION ALL SELECT 204400, 103900,'D'
UNION ALL SELECT 204406,1387304,'D'
UNION ALL SELECT 204406, 0,'HJ'
UNION ALL SELECT 204516, 894,'D'
UNION ALL SELECT 204516, 3927,'D'
UNION ALL SELECT 204586, 234235,'D'
UNION ALL SELECT 204586, 234235,'D'
)
;
And then:
Based on what was said in other responses, and keeping in mind that a mass delete of an important part of the table, not only in Vertica, is best implemented as an INSERT ... SELECT with inverted WHERE condition - here goes:
CREATE TABLE input_help AS
SELECT * FROM input
GROUP BY id,numb1,state_coding
HAVING COUNT(*) = 1;
DROP TABLE input;
ALTER TABLE input_help RENAME TO input;
At least, it works with that simplicity if the whole row is the same - I notice you don't put state_coding into the condition yourself. Otherwise, it gets slightly more complicated.
Or did you want to re-insert one row of the duplicates each afterwards?
Then, just build input_help as SELECT DISTINCT * FROM input; , then drop, then rename.

How to optimize SELECT some_field, max(primary_key) FROM table GROUP BY some_field

I have SQL query in SQL Azure:
SELECT some_field, max(primary_key) FROM table GROUP BY some_field
Table has currently over 6 million rows. Index on (some_field asc, primary_key desc) is created. primary_key field is incremental. There is about 700 distinct values of some_field. This select takes at least 30 seconds.
There are only inserts into this table, no updates or deletes.
I can create separate table to store some_field and maximal value of primary key and write trigger to build it, but I am looking for more elegant solution. Is there any?
Dont know if this will be performant but you you can give it a shot...
;WITH cte AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY some_field ORDER BY primary_key DESC) AS rn
FROM table
)
SELECT *
FROM cte
WHERE rn = 1
Definitely do the secondary table of "somefield" and "highestPK" columns that is indexed on the "somefield" column. Build that once up front as a baseline and use that.
Then, whenever any new records are inserted into your 6 million record table, have a simple trigger to update your secondary table with something as simple as..
update SecondaryTable
set highestPK = newlyInsertedPKID
where somefield = newlyInsertedSomeFieldValue
This way, it stays updated with every insert as the highest PK for your "somefield" column will qualify, and if no update is available, insert into the secondary table with the new "somefield" value.

How can I delete one of two perfectly identical rows?

I am cleaning out a database table without a primary key (I know, I know, what were they thinking?). I cannot add a primary key, because there is a duplicate in the column that would become the key. The duplicate value comes from one of two rows that are in all respects identical. I can't delete the row via a GUI (in this case MySQL Workbench, but I'm looking for a database agnostic approach) because it refuses to perform tasks on tables without primary keys (or at least a UQ NN column), and I cannot add a primary key, because there is a duplicate in the column that would become the key. The duplicate value comes from one...
How can I delete one of the twins?
SET ROWCOUNT 1
DELETE FROM [table] WHERE ....
SET ROWCOUNT 0
This will only delete one of the two identical rows
One option to solve your problem is to create a new table with the same schema, and then do:
INSERT INTO new_table (SELECT DISTINCT * FROM old_table)
and then just rename the tables.
You will of course need approximately the same amount of space as your table requires spare on your disk to do this!
It's not efficient, but it's incredibly simple.
Note that MySQL has its own extension of DELETE, which is DELETE ... LIMIT, which works in the usual way you'd expect from LIMIT: http://dev.mysql.com/doc/refman/5.0/en/delete.html
The MySQL-specific LIMIT row_count option to DELETE tells the server
the maximum number of rows to be deleted before control is returned to
the client. This can be used to ensure that a given DELETE statement
does not take too much time. You can simply repeat the DELETE
statement until the number of affected rows is less than the LIMIT
value.
Therefore, you could use DELETE FROM some_table WHERE x="y" AND foo="bar" LIMIT 1; note that there isn't a simple way to say "delete everything except one" - just keep checking whether you still have row duplicates.
delete top(1) works on Microsoft SQL Server (T-SQL).
This can be accomplished using a CTE and the ROW_NUMBER() function, as below:
/* Sample Data */
CREATE TABLE #dupes (ID INT, DWCreated DATETIME2(3))
INSERT INTO #dupes (ID, DWCreated) SELECT 1, '2015-08-03 01:02:03.456'
INSERT INTO #dupes (ID, DWCreated) SELECT 2, '2014-08-03 01:02:03.456'
INSERT INTO #dupes (ID, DWCreated) SELECT 1, '2013-08-03 01:02:03.456'
/* Check sample data - returns three rows, with two rows for ID#1 */
SELECT * FROM #dupes
/* CTE to give each row that shares an ID a unique number */
;WITH toDelete AS
(
SELECT ID, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY DWCreated) AS RN
FROM #dupes
)
/* Delete any row that is not the first instance of an ID */
DELETE FROM toDelete WHERE RN > 1
/* Check the results: ID is now unique */
SELECT * FROM #dupes
/* Clean up */
DROP TABLE #dupes
Having a column to ORDER BY is handy, but not necessary unless you have a preference for which of the rows to delete. This will also handle all instances of duplicate records, rather than forcing you to delete one row at a time.
For PostgreSQL you can do this:
DELETE FROM tablename
WHERE id IN (SELECT id
FROM (SELECT id, ROW_NUMBER()
OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
FROM tablename) t
WHERE t.rnum > 1);
column1, column2, column3 would the column set which have duplicate values.
Reference here.
This works for PostgreSQL
DELETE FROM tablename WHERE id = 123 AND ctid IN (SELECT ctid FROM tablename WHERE id = 123 LIMIT 1)
Tried LIMIT 1? This will only delete 1 of the rows that match your DELETE query
DELETE FROM `table_name` WHERE `column_name`='value' LIMIT 1;
In my case I could get the GUI to give me a string of values of the row in question (alternatively, I could have done this by hand). On the suggestion of a colleague, in whose debt I remain, I used this to create an INSERT statement:
INSERT
'ID1219243408800307444663', '2004-01-20 10:20:55', 'INFORMATION', 'admin' (...)
INTO some_table;
I tested the insert statement, so that I now had triplets. Finally, I ran a simple DELETE to remove all of them...
DELETE FROM some_table WHERE logid = 'ID1219243408800307444663';
followed by the INSERT one more time, leaving me with a single row, and the bright possibilities of a primary key.
in case you can add a column like
ALTER TABLE yourtable ADD IDCOLUMN bigint NOT NULL IDENTITY (1, 1)
do so.
then count rows grouping by your problem column where count >1 , this will identify your twins (or triplets or whatever).
then select your problem column where its content equals the identified content of above and check the IDs in IDCOLUMN.
delete from your table where IDCOLUMN equals one of those IDs.
You could use a max, which was relevant in my case.
DELETE FROM [table] where id in
(select max(id) from [table] group by id, col2, col3 having count(id) > 1)
Be sure to test your results first and having a limiting condition in your "having" clausule. With such a huge delete query you might want to update your database first.
delete top(1) tableNAme
where --your conditions for filtering identical rows
I added a Guid column to the table and set it to generate a new id for each row. Then I could delete the rows using a GUI.
In PostgreSQL there is an implicit column called ctid. See the wiki. So you are free to use the following:
WITH cte1 as(
SELECT unique_column, max( ctid ) as max_ctid
FROM table_1
GROUP BY unique_column
HAVING count(*) > 1
), cte2 as(
SELECT t.ctid as target_ctid
FROM table_1 t
JOIN cte1 USING( unique_column )
WHERE t.ctid != max_ctid
)
DELETE FROM table_1
WHERE ctid IN( SELECT target_ctid FROM cte2 )
I'm not sure how safe it is to use this when there is a possibility of concurrent updates. So one may find it sensible to make a LOCK TABLE table_1 IN ACCESS EXCLUSIVE MODE; before actually doing the cleanup.
In case there are multiple duplicate rows to delete and all fields are identical, no different id, the table has no primary key , one option is to save the duplicate rows with distinct in a new table, delete all duplicate rows and insert the rows back. This is helpful if the table is really big and the number of duplicate rows is small.
--- col1 , col2 ... coln are the table columns that are relevant.
--- if not sure add all columns of the table in the select bellow and the where clause later.
--- make a copy of the table T to be sure you can rollback anytime , if possible
--- check the ##rowcount to be sure it's what you want
--- use transactions and rollback in case there is an error
--- first find all with duplicate rows that are identical , this statement could be joined
--- with the first one if you choose all columns
select col1,col2, --- other columns as needed
count(*) c into temp_duplicate group by col1,col2 having count(*) > 1
--- save all the rows that are identical only once ( DISTINCT )
insert distinct * into temp_insert from T , temp_duplicate D where
T.col1 = D.col1 and
T.col2 = D.col2 --- and other columns if needed
--- delete all the rows that are duplicate
delete T from T , temp_duplicate D where
T.col1 = D.col1 and
T.col2 = D.col2 ---- and other columns if needed
--- add the duplicate rows , now only once
insert into T select * from temp_insert
--- drop the temp tables after you check all is ok
If, like me, you don't want to have to list out all the columns of the database, you can convert each row to JSONB and compare by that.
(NOTE: This is incredibly inefficient - be careful!)
select to_jsonb(a.*), to_jsonb(b.*)
FROM
table a
left join table b
on
a.entry_date < b.entry_date
where (SELECT NOT exists(
SELECT
FROM jsonb_each_text(to_jsonb(a.*) - 'unwanted_column') t1
FULL OUTER JOIN jsonb_each_text(to_jsonb(b.*) - 'unwanted_column') t2 USING (key)
WHERE t1.value<>t2.value OR t1.key IS NULL OR t2.key IS NULL
))
Suppose we want to delete duplicate records with keeping only 1 unique records from Employee table - Employee(id,name,age)
delete from Employee
where id not in (select MAX(id)
from Employee
group by (id,name,age)
);
You can use limit 1
This works perfectly for me with MySQL
delete from `your_table` [where condition] limit 1;
DELETE FROM Table_Name
WHERE ID NOT IN
(
SELECT MAX(ID) AS MaxRecordID
FROM Table_Name
GROUP BY [FirstName],
[LastName],
[Country]
);

Fastest way to identify differences between two tables?

I have a need to check a live table against a transactional archive table and I'm unsure of the fastest way to do this...
For instance, let's say my live table is made up of these columns:
Term
CRN
Fee
Level Code
My archive table would have the same columns, but also have an archive date so I can see what values the live table had at a given date.
Now... How would I write a query to ensure that the values for the live table are the same as the most recent entries in the archive table?
PS I'd prefer to handle this in SQL, but PL/SQL is also an option if it's faster.
SELECT term, crn, fee, level_code
FROM live_data
MINUS
SELECT term, crn, fee, level_code
FROM historical_data
Whats on live but not in historical. Can then union to a reverse of this to get whats in historical but not live.
Simply:
SELECT collist
FROM TABLE A
minus
SELECT collist
FROM TABLE B
UNION ALL
SELECT collist
FROM TABLE B
minus
SELECT collist
FROM TABLE A;
You didn't mention how rows are uniquely identified, so I've assumed you also have an "id" column:
SELECT *
FROM livetable
WHERE (term, crn, fee, levelcode) NOT IN (
SELECT FIRST_VALUE(term) OVER (ORDER BY archivedate DESC)
,FIRST_VALUE(crn) OVER (ORDER BY archivedate DESC)
,FIRST_VALUE(fee) OVER (ORDER BY archivedate DESC)
,FIRST_VALUE(levelcode) OVER (ORDER BY archivedate DESC)
FROM archivetable
WHERE livetable.id = archivetable.id
);
Note: This query doesn't take NULLS into account - if any of the columns are nullable you can add suitable logic (e.g. NVL each column to some "impossible" value).
unload to table.unl
select * from table1
order by 1,2,3,4
unload to table2.unl
select * from table2
order by 1,2,3,4
diff table1.unl table2.unl > diff.unl
Could you use a query of the form:
SELECT your columns FROM your live table
EXCEPT
SELECT your columns FROM your archive table WHERE archive date is most recent;
Any results will be rows in your live table that are not in your most recent archive.
If you also need rows in your most recent archive that are not in your live table, simply reverse the order of the selects, and repeat, or get them all in the same query by performing a (live UNION archive) EXCEPT (live INTERSECTION archive)

How to keep only one row of a table, removing duplicate rows?

I have a table that has a lot of duplicates in the Name column. I'd
like to only keep one row for each.
The following lists the duplicates, but I don't know how to delete the
duplicates and just keep one:
SELECT name FROM members GROUP BY name HAVING COUNT(*) > 1;
Thank you.
See the following question: Deleting duplicate rows from a table.
The adapted accepted answer from there (which is my answer, so no "theft" here...):
You can do it in a simple way assuming you have a unique ID field: you can delete all records that are the same except for the ID, but don't have "the minimum ID" for their name.
Example query:
DELETE FROM members
WHERE ID NOT IN
(
SELECT MIN(ID)
FROM members
GROUP BY name
)
In case you don't have a unique index, my recommendation is to simply add an auto-incremental unique index. Mainly because it's good design, but also because it will allow you to run the query above.
It would probably be easier to select the unique ones into a new table, drop the old table, then rename the temp table to replace it.
#create a table with same schema as members
CREATE TABLE tmp (...);
#insert the unique records
INSERT INTO tmp SELECT * FROM members GROUP BY name;
#swap it in
RENAME TABLE members TO members_old, tmp TO members;
#drop the old one
DROP TABLE members_old;
We have a huge database where deleting duplicates is part of the regular maintenance process. We use DISTINCT to select the unique records then write them into a TEMPORARY TABLE. After TRUNCATE we write back the TEMPORARY data into the TABLE.
That is one way of doing it and works as a STORED PROCEDURE.
If we want to see first which rows you are about to delete. Then delete them.
with MYCTE as (
SELECT DuplicateKey1
,DuplicateKey2 --optional
,count(*) X
FROM MyTable
group by DuplicateKey1, DuplicateKey2
having count(*) > 1
)
SELECT E.*
FROM MyTable E
JOIN MYCTE cte
ON E.DuplicateKey1=cte.DuplicateKey1
AND E.DuplicateKey2=cte.DuplicateKey2
ORDER BY E.DuplicateKey1, E.DuplicateKey2, CreatedAt
Full example at http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
You can join table with yourself by matched field and delete unmatching rows
DELETE t1 FROM table_name t1
LEFT JOIN tablename t2 ON t1.match_field = t2.match_field
WHERE t1.id <> t2.id;
delete dup row keep one
table has duplicate rows and may be some rows have no duplicate rows then it keep one rows if have duplicate or single in a table.
table has two column id and name if we have to remove duplicate name from table
and keep one. Its Work Fine at My end You have to Use this query.
DELETE FROM tablename
WHERE id NOT IN(
SELECT id FROM
(
SELECT MIN(id)AS id
FROM tablename
GROUP BY name HAVING
COUNT(*) > 1
)AS a )
AND id NOT IN(
(SELECT ids FROM
(
SELECT MIN(id)AS ids
FROM tablename
GROUP BY name HAVING
COUNT(*) =1
)AS a1
)
)
before delete table is below see the screenshot:
enter image description here
after delete table is below see the screenshot this query delete amit and akhil duplicate rows and keep one record (amit and akhil):
enter image description here
if you want to remove duplicate record from table.
CREATE TABLE tmp SELECT lastname, firstname, sex
FROM user_tbl;
GROUP BY (lastname, firstname);
DROP TABLE user_tbl;
ALTER TABLE tmp RENAME TO user_tbl;
show record
SELECT `page_url`,count(*) FROM wl_meta_tags GROUP BY page_url HAVING count(*) > 1
delete record
DELETE FROM wl_meta_tags
WHERE meta_id NOT IN( SELECT meta_id
FROM ( SELECT MIN(meta_id)AS meta_id FROM wl_meta_tags GROUP BY page_url HAVING COUNT(*) > 1 )AS a )
AND meta_id NOT IN( (SELECT ids FROM (
SELECT MIN(meta_id)AS ids FROM wl_meta_tags GROUP BY page_url HAVING COUNT(*) =1 )AS a1 ) )
Source url
WITH CTE AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY [emp_id] ORDER BY [emp_id]) AS Row, * FROM employee_salary
)
DELETE FROM CTE
WHERE ROW <> 1