Remove duplicate entries from database with conditions

Remove duplicate entries from database with conditions - sql

I've had a good look around but havnt been able to find a solution so hoping someone can help with this one.
I have a MySQL table of results from an internal logging application which records a result from a check routine, there are a number of check routines which are identified with the tracker column:
id (int)(PK), tracker (int), time (timestamp), result (int)
A result only needs to be recorded if the previous result is not the same, only changes need to be captured. Unfortunatly this was ignored when it was built (in a hurry) a month ago and results have been recorded blindly with no checks on previous results. This has now been recorded but I'm still left with a few thousand rows of which a significant number are duplicate entries and I'm after a way of clearing these out to just leave the change points.
So I need to go through each row, look at the previous result recorded by that tracker and delete the row if its the same, this is a bit beyond my experience with MySQL and the attempts I've made so far have been fairly poor!
Can anyone help?

Use:
DELETE a
FROM YOUR_TABLE a
LEFT JOIN (SELECT MAX(t.id) AS latest_id
FROM YOUR_TABLE t
GROUP BY t.tracker, t.result) b ON b.latest_id = a.id
WHERE b.latest_id IS NULL
Alternate using IN:
DELETE FROM YOUR_TABLE
WHERE id NOT IN (SELECT x.latest_id
FROM (SELECT MAX(t.id) AS latest_id
FROM YOUR_TABLE t
GROUP BY t.tracker, t.result) x )

There are complaints that this is slow to execute, but that probably doesn't affect you. It will certainly be faster than anything else you might do:
select DISTINCT id, tracker, time, result
from table;

I think you want a unique index on the table:
ALTER IGNORE TABLE table ADD UNIQUE INDEX (tracker, time, result)
http://dev.mysql.com/doc/refman/5.1/en/alter-table.html
You'll have to use INSERT IGNORE... when adding new rows as inserts that would duplicate an existing (tracker, time, result) key will cause an error.

Related

SQL - delete rows where only one column changes

I have a large table in SQL, in which an effective_from date column should update every time one of the other columns changes. However, for some reason, there are numerous rows in which the effective_from date changes, but no other values have changed. For example:
CODE NAME EFFECTIVE_FROM
CCWA Oak 1999
CCWA Willow 2001
CCWA Willow 2004
How can I delete the rows where the change in effective_from date doesn't provide any info. e.g. the third row in the above table.
The tables are very large, so I would prefer to use SELECT statements rather than DELETE or ALTER which seem to be slow.
Any help much appreciated!

I believe you are looking for:
SELECT Code, Name, MAX(EFFECTIVE_FROM)
FROM myTable
GROUP BY Code, Name

Since it is the later date that adds no information, you want to select the minimum date value.
SELECT Code, Name, MIN(EFFECTIVE_FROM)
FROM CodeTable
GROUP BY Code, Name

try this:
SELECT code, name, max(EFFECTIVE_FROM)
FROM tablename
GROUP BY code, name

You want to use lag(). The result set without duplicates:
select t.*
from (select t.*,
lag(code) over (order by effective_from) as prev_code,
lag(name) over (order by effective_from) as prev_name
from t
) t
where (prev_code <> code or prev_code is null) and
(prev_name <> name or prev_name is null);
This assumes that code and name are never NULL. That is easy to incorporate in the logic (but it makes the where clause a bit complicated).

Your question doesn not clearify the real result you want to achieve: if you want to permanently delete elements from the table, you need to use a DELETE, if your target is simply to filter out the duplicates you described, you can use a SELECT (and the elements will remain in the table).
The fact that you consider to use a DELETE make me suppose that this "duplicates" (except for the date) are not desirable.
In this case you can also consider to add a trigger that prevent the insertion when the informative fields (all fields except EFFECTIVE_FROM) aren't changed, in this way only interesting data changes will generate a new row.
Then you can execute a one-shot operation which delete all the duplicated elements that does not reflect any data change (operation to do by night, or however when the system has a low load or no one is using it, if the table is really very large as you typed).
This kind of solution changes the nature of this table, in fact you lose the historical information of updates without real data changes. Consider this solution only if these informations aren't necessary for your target.

SQL - renumbering a sequential column to be sequential again after deletion

I've researched and realize I have a unique situation.
First off, I am not allowed to post images yet to the board since I'm a new user, so see appropriate links below
I have multiple tables where a column (not always the identifier column) is sequentially numbered and shouldn't have any breaks in the numbering. My goal is to make sure this stays true.
Down and Dirty
We have an 'Event' table where we randomly select a percentage of the rows and insert the rows into table 'Results'. The "ID" column from the 'Results' is passed to a bunch of delete queries.
This more or less ensures that there are missing rows in several tables.
My problem:
Figuring out an sql query that will renumber the column I specify. I prefer to not drop the column.
Example delete query:
delete ItemVoid
from ItemTicket
join ItemVoid
on ItemTicket.item_ticket_id = itemvoid.item_ticket_id
where itemticket.ID in (select ID
from results)
Example Tables Before:
Example Tables After:
As you can see 2 rows were delete from both tables based on the ID column. So now I gotta figure out how to renumber the item_ticket_id and the item_void_id columns where the the higher number decreases to the missing value, and the next highest one decreases, etc. Problem #2, if the item_ticket_id changes in order to be sequential in ItemTickets, then
it has to update that change in ItemVoid's item_ticket_id.
I appreciate any advice you can give on this.

(answering an old question as it's the first search result when I was looking this up)
(MS T-SQL)
To resequence an ID column (not an Identity one) that has gaps,
can be performed using only a simple CTE with a row_number() to generate a new sequence.
The UPDATE works via the CTE 'virtual table' without any extra problems, actually updating the underlying original table.
Don't worry about the ID fields clashing during the update, if you wonder what happens when ID's are set that already exist, it
doesn't suffer that problem - the original sequence is changed to the new sequence in one go.
WITH NewSequence AS
(
SELECT
ID,
ROW_NUMBER() OVER (ORDER BY ID) as ID_New
FROM YourTable
)
UPDATE NewSequence SET ID = ID_New;

Since you are looking for advice on this, my advice is you need to redesign this as I see a big flaw in your design.
Instead of deleting the records and then going through the hassle of renumbering the remaining records, use a bit flag that will mark the records as Inactive. Then when you are querying the records, just include a WHERE clause to only include the records are that active:
SELECT *
FROM yourTable
WHERE Inactive = 0
Then you never have to worry about re-numbering the records. This also gives you the ability to go back and see the records that would have been deleted and you do not lose the history.
If you really want to delete the records and renumber them then you can perform this task the following way:
create a new table
Insert your original data into your new table using the new numbers
drop your old table
rename your new table with the corrected numbers
As you can see there would be a lot of steps involved in re-numbering the records. You are creating much more work this way when you could just perform an UPDATE of the bit flag.
You would change your DELETE query to something similar to this:
UPDATE ItemVoid
SET InActive = 1
FROM ItemVoid
JOIN ItemTicket
on ItemVoid.item_ticket_id = ItemTicket.item_ticket_id
WHERE ItemTicket.ID IN (select ID from results)
The bit flag is much easier and that would be the method that I would recommend.

The function that you are looking for is a window function. In standard SQL (SQL Server, MySQL), the function is row_number(). You use it as follows:
select row_number() over (partition by <col>)
from <table>
In order to use this in your case, you would delete the rows from the table, then use a with statement to recalculate the row numbers, and then assign them using an update. For transactional integrity, you might wrap the delete and update into a single transaction.
Oracle supports similar functionality, but the syntax is a bit different. Oracle calls these functions analytic functions and they support a richer set of operations on them.
I would strongly caution you from using cursors, since these have lousy performance. Of course, this will not work on an identity column, since such a column cannot be modified.

SQL: Remove rows whose associations are broken (orphaned data)

I have a table called "downloads" with two foreign key columns -- "user_id" and "item_id". I need to select all rows from that table and remove the rows where the User or the Item in question no longer exists. (Look up the User and if it's not found, delete the row in "downloads", then look up the Item and if it's not found, delete the row in "downloads").
It's 3.4 million rows, so all my scripted solutions have been taking 6+ hours. I'm hoping there's a faster, SQL-only way to do this?

use two anti joins and or them together:
delete from your_table
where user_id not in (select id from users_table)
or item_id not in (select id from items_table)
once that's done, consider adding two foreign keys, each with an on delete cascade clause. it'll do this for you automatically.

delete from your_table where user_id not in (select id from users_table) or item_id not in (select id from items_table)

think there is no faster solution when there are so many rows
that are on your server 157 rows per second
check user id
if mysql num rows = 0 than delete the downloads and also check the item_id
there was also a similar question about the performance of myswl num rows
MySQL: Fastest way to count number of rows
edit: think the best is to creatse some triggers so the database server does the job for you
currently i would use a cronjob for the first time

For future reference. For these kind of long operations. It is possible to optimise the server independently of the SQL. For example detach the sql service, defrag the system disk, if you can ensure the sql log files are on separate disk drive to the drive where database is.
This will at least reduce the pain of these kind of long operations.

I've found in SQL 2008 R2, if your "in" clause contains a null value (perhaps from a table who has a reference to this key that is nullable), no records will be returned! To correct, just add a clause to your selects in the union part:
delete from SomeTable where Key not in (
select SomeTableKey from TableB where SomeTableKey is not null
union
select SomeTableKey from TableC where SomeTableKey is not null
)

Semi-Distinct MySQL Query

I have a MySQL table called items that contains thousands of records. Each record has a user_id field and a created (datetime) field.
Trying to put together a query to SELECT 25 rows, passing a string of user ids as a condition and sorted by created DESC.
In some cases, there might be just a few user ids, while in other instances, there may be hundreds.
If the result set is greater than 25, I want to pare it down by eliminating duplicate user_id records. For instance, if there were two records for user_id = 3, only the most recent (according to created datetime) would be included.
In my attempts at a solution, I am having trouble because while, for example, it's easy to get a result set of 100 (allowing duplicate user_id records), or a result set of 16 (using GROUP BY for unique user_id records), it's hard to get 25.
One logical approach, which may not be the correct MySQL approach, is to get the most recent record for each for each user_id, and then, if the result set is less than 25, begin adding a second record for each user_id until the 25 record limit is met (maybe a third, fourth, etc. record for each user_id would be needed).
Can this be accomplished with a MySQL query, or will I need to take a large result set and trim it down to 25 with code?

I don't think what you're trying to accomplish is possible as a SQL query. Your desire is to return 25 rows, no matter what the normal data groupings are whereas SQL is usually picky about returning based on data groupings.
If you want a purely MySQL-based solution, you may be able to accomplish this with a stored procedure. (Supported in MySQL 5.0.x and later.) However, it might just make more sense to run the query to return all 100+ rows and then trim it programmatically within the application.

This will get you the most recent for each user --
SELECT user_id, create
FROM items AS i1
LEFT JOIN items AS i2
ON i1.user_id = i2.user_id AND i1.create > i2.create
WHERE i2.id IS NULL
his will get you the most recent two records for each user --
SELECT user_id, create
FROM items AS i1
LEFT JOIN items AS i2
ON i1.user_id = i2.user_id AND i1.create > i2.create
LEFT JOIN items IS i3
ON i2.user_id = i3.user_id AND i2.create > i3.create
WHERE i3.id IS NULL
Try working from there.
You could nicely put this into a stored procedure.

My opinion is to use application logic, as this is very much application layer logic you are trying to implement at the DB level, i.e. filtering down the results to make the search more useful to the end user.
You could implement a stored procedure (personally I would never do such a thing) or just get the application to decide which 25 results.

One approach would be to get the most recent item from each user, followed by the most recent items from all users, and limit that. You could construct pathological examples where this probably isn't what you want, but it should be pretty good in general.

Unfortunately, there is no easy way :( I had to do something similar when I built a report for my company that would pull up customer disables that were logged in a database. Only problem was that the disconnect is ran and logged every 30 minutes. Therefore, the rows would not be distinct since the timestamp was different in every disconnect. I solved this problem with sub queries. I don't have the exact code anymore, but I beleive this is how I implemented it:
SELECT CORP, HOUSE, CUST,
(
SELECT TOP 1 hsd
FROM #TempTable t2
WHERE t1.corp = t2.corp
AND t1.house = t2.house
AND t1.cust = t2.cust
) DisableDate
FROM #TempTable t1
GROUP BY corp, house, cust -- selecting distinct
So, my answer is to elimante the non-distinct column from the query by using sub queries. There might be an easier way to do it though. I'm curious to see what others post.
Sorry, i keep editing this, I keep trying to find ways to make it easier to show what I did.

What's the most efficient way to check the presence of a row in a table?

Say I want to check if a record in a MySQL table exists. I'd run a query, check the number of rows returned. If 0 rows do this, otherwise do that.
SELECT * FROM table WHERE id=5
SELECT id FROM table WHERE id=5
Is there any difference at all between these two queries? Is effort spent in returning every column, or is effort spent in filtering out the columns we don't care about?
SELECT COUNT(*) FROM table WHERE id=5
Is a whole new question. Would the server grab all the values and then count the values (harder than usual), or would it not bother grabbing anything and just increment a variable each time it finds a match (easier than usual)?
I think I'm making a lot of false assumptions about how MySQL works, but that's the meat of the question! Where am I wrong? Educate me, Stack Overflow!

Optimizers are pretty smart (generally). They typically only grab what they need so I'd go with:
SELECT COUNT(1) FROM mytable WHERE id = 5

The most explicit way would be
SELECT WHEN EXISTS (SELECT 1 FROM table WHERE id = 5) THEN 1 ELSE 0 END
If there is an index on (or starting with) id, it will only search, with maximum efficiency, for the first entry in the index it can find with that value. It won't read the record.
If you SELECT COUNT(*) (or COUNT anything else) it will, under the same circumstances, count the index entries, but not read the records.
If you SELECT *, it will read all the records.

Limit your results to at most one row by appending LIMIT 1, if all you want to do is check the presence of a record.
SELECT id FROM table WHERE id=5 LIMIT 1
This will definitely ensure that no more than one row is returned or processed. In my experience, LIMIT 1 (or TOP 1 depending in the DB) to check for existence of a row makes a big difference in terms of performance for large tables.
EDIT: I think I misread your question, but I'll leave my answer here anyway if it's of any help.

I would think this
SELECT null FROM table WHERE id = 5 LIMIT 1;
would be faster than this
SELECT 1 FROM table WHERE id = 5 LIMIT 1;
but the timer says the winner is "SELECT 1".

For the first two queries, most people will generally say, always specify exactly what you need and leave the rest. Effort isn't all specific as bandwidth could be spent in returning data that you aren't even going to do anything with.
As for the previous answer will do for your result set, unless you're dealing with a language that supports affected rows. This can sometimes work when getting data to collect information on how many rows were returned in the last query. You'll need to look at your interface documentation as to how to get that information.

The difference between your 3 queries depends on how you've built your index. Only returning the primary key is likely to be faster as MySQL will have your index in memory, and not have to hit disk. Adding the LIMIT 1 is also a good trick that will speed up the optimizer significantly in early 5.0.x branches and earlier.
try EXPLAIN SELECT id FROM table WHERE id=5 and check the Extras column for the presence of USING INDEX. If its there, then you're query is coming straight from the index, and is going to be much faster.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove duplicate entries from database with conditions - sql

There are complaints that this is slow to execute, but that probably doesn't affect you. It will certainly be faster than anything else you might do: select DISTINCT id, tracker, time, result from table;

Related

SQL - delete rows where only one column changes

SQL - renumbering a sequential column to be sequential again after deletion

SQL: Remove rows whose associations are broken (orphaned data)

Semi-Distinct MySQL Query

What's the most efficient way to check the presence of a row in a table?

Categories

Resources