SQL Delete low counts - sql

I have a table with this data:
Id Qty
-- ---
A 1
A 2
A 3
B 112
B 125
B 109
But I'm supposed to only have the max values for each id. Max value for A is 3 and for B is 125. How can I isolate (and delete) the other values?
The final table should look like this :
Id Qty
-- ---
A 3
B 125
Running MySQL 4.1

Oh wait. Got a simpler solution :
I'll select all the max values(group by id), export the data, flush the table, reimport only the max values.
CREATE TABLE tabletemp LIKE table;
INSERT INTO tabletemp SELECT id,MAX(qty) FROM table GROUP BY id;
DROP TABLE table;
RENAME TABLE tabletemp TO table;
Thanks to all !

Try this in SQL Server:
delete from tbl o
left outer join
(Select max(qty) anz , id
from tbl i
group by i.id) k on o.id = k.id and k.anz = o.qty
where k.id is null
Revision 2 for MySQL... Can anyone check this one?:
delete from tbl o
where concat(id,qty) not in
(select concat(id,anz) from (Select max(qty) anz , id
from tbl i
group by i.id))
Explanation:
Since I was supposed to not use joins (See comments about MySQL Support on joins and delete/update/insert), I moved the subquery into a IN(a,b,c) clause.
Inside an In clause I can use a subquery, but that query is only allowed to return one field. So in order to filter all elements that are not the maximum, i need to concat both fields into a single one, so i can return it inside the in clause. So basically my query inside the IN returns the biggest ID+QTY only. To compare it with the main table i also need to make a concat on the outside, so the data for both fields match.
Basically the In clause contains:
("A3","B125")
Disclaimer: The above query is "evil!" since it uses a function (concat) on fields to compare against. This will cause any index on those fields to become almost useless. You should never formulate a query that way that is run on a regular basis. I only wanted to try to bend it so it works on mysql.
Example of this "bad construct":
(Get all o from the last 2 weeks)
select ... from orders where orderday + 14 > now()
You should allways do:
select ... from orders where orderday > now() - 14
The difference is subtle: Version 2 only has to do the math once, and is able to use the index, and version 1 has to do the math for every single row in the orders table., and you can forget about the index usage...

I'd try this:
delete from T
where exists (
select * from T as T2
where T2.Id = T.Id
and T2.Qty > T.Qty
);
For those who might have similar question in the future, this might be supported some day (it is now in SQL Server 2005 and later)
It won't require a join, and it has advantages over the use of a temporary table if the table has dependencies
with Tranked(Id,Qty,rk) as (
select
Id, Qty,
rank() over (
partition by Id
order by Qty desc
)
from T
)
delete from Tranked
where rk > 1;

You'll have to go via another table (among other things that makes a single delete statement here quite impossible in mysql is you can't delete from a table and use the same table in a subquery).
BEGIN;
create temporary table tmp_del select id,max(qty) as qty from the_tbl;
delete the_tbl from the_tbl,tmp_del where
the_tbl.id=tmp_del.id and the_tbl.qty=tmp_del.qty;
drop table tmp_del;
END;

MySQL 4.0 and later supports a simple multi-table syntax for DELETE:
DELETE t1 FROM MyTable t1 JOIN MyTable t2 ON t1.id = t2.id AND t1.qty < t2.qty;
This produces a join of each row with a given id to all other rows with the same id, and deletes only the row with the lesser qty in each pairing. After this is all done, the row with the greatest qty per group of id is left not deleted.
If you only have one row with a given id, it still works because a single row is naturally the one with the greatest value.
FWIW, I just tried my solution using MySQL 5.0.75 on a Macbook Pro 2.40GHz. I inserted 1 million rows of synthetic data, with different numbers of rows per "group":
2 rows per id completes in 26.78 sec.
5 rows per id completes in 43.18 sec.
10 rows per id completes in 1 min 3.77 sec.
100 rows per id completes in 6 min 46.60 sec.
1000 rows per id didn't complete before I terminated it.

Related

SQL loop on duplicate row to combine into one

I have something to fix in my database here it is:
I have a table with duplicate rows like that:
the duplicate columns are IDPatient and IDObjet and you should never have both duplicate and that's why i put Key on both column but it's a bit too late.. so I have to fix this by combining these duplicate row into one without losing data and to put it in order.
Example, as you can see in the picture the column texte_1 contains each one a date 2010-11-25 and 2011-11-04. The date 2010-11-25 come before 2011-11-04 So i have to put 2011-11-04 into the column texte_2 of the first row and looping like that for each data I have in my row and to verify if the date is older or not. If yes, I have to replace the data in the row one with the second row, taking the information we have replace in a temp var and then finding a new column("Texte_X") to insert into the same row my replace data and validating at the same time if it's not older.
I can have multiple duplicate row in my table and I know looping in SQL server is slow, but would really appreciate a good solution to solve this here.
Here's a example of multiple duplicate row
How about a MERGE:
merge mytable as t
using (
select idPatient, idObject, max(texte_1) dt
from mytable
group by idPatient, idObject
) s on t.idPatient = s.idPatient
and t.idObject = s.idObject
and t.texte_1 != s.dt
when matched then delete;
You could use the ROW_NUMBER() function and your ID field to order the duplicates, then PIVOT to de-normalize the records, or self-joins, like:
;with cte as (SELECT *,RN = ROW_NUMBER() OVER(PARTITION BY IDPatient,IDObjet ORDER BY ID)
FROM YourTable
)
SELECT a.IDPatient,a.IDObjet,a.Texte_1, b.Texte_1 as Texte_2, c.Texte_1 AS Texte_3
FROM cte a
LEFT JOIN cte b
ON a.IDPatient = b.IDPatient
AND a.IDObjet = b.IDObjet
AND b.RN = 2
LEFT JOIN cte c
ON a.IDPatient = c.IDPatient
AND a.IDObjet = c.IDObjet
AND c.RN = 3
WHERE a.RN = 1
This assumes the ID order is sufficient, you could change it to your date field if needed. Since you ultimately want to remove the duplicate lines, you could either run this query into a new table, or after you use this as the basis of your update you can then DELETE records from the cte above where RN > 1
Personally, I would avoid the de-normalized Texte_1-10 structure, and add a new field that's the equivalent of the RN field as part of the key.

UPDATE random row from another table SQL Server 2014

I tried to do an UPDATE statement with a random row from another table. I know this question has been asked before (here), but it doesn't seem to work for me.
I should update each row with a different value from the other table. In my case it only gets one random row from a table and puts that in every row.
UPDATE dbo.TABLE_CHARGE
SET COLRW_STREET =
(SELECT TOP 1 COLRW_STREET FROM CHIEF_PreProduction.dbo.TABLE_FAKESTREET
ORDER BY ABS(CHECKSUM(NewId())%250))
Thanks in advance!
I took a liberty to assume that you have ID field in your TABLE_CHARGE table. This is probably not the most efficient way, but seems to work:
WITH random_values as
(
SELECT t.id, t.COLRW_STREET, t.random_street FROM (
SELECT c.id, c.COLRW_STREET,
f.COLRW_STREET as random_street, ROW_NUMBER() OVER (partition by c.id ORDER BY ABS(CHECKSUM(NewId())%250)) rn
FROM table_charge c, TABLE_FAKESTREET f) t
WHERE t.rn = 1
)
UPDATE random_values SET COLRW_STREET = random_street;
SQL Fiddle demo
Your original code did not work because when yo do ... SET x = (SELECT TOP 1 ..) database does OUTER JOIN of your target table with one TOP row, which means that one single row is applied to all rows in your target table. Hence you have same value in all rows.
Following query demonstrates what is happening in the UPDATE:
SELECT * FROM
TABLE_CHARGE tc,
(SELECT TOP 1 COLRW_STREET as random_street FROM TABLE_FAKESTREET
ORDER BY ABS(CHECKSUM(NewId())%250)) t
My solution gets all fake records ordered randomly for each record in target table and only selects the first one per ID.

Deleting non distinct rows

I have a table that has a unique non-clustered index and 4 of the columns are listed in this index. I want to update a large number of rows in the table. If I do so, they will no longer be distinct, therefore the update fails because of the index.
I am wanting to disable the index and then delete the oldest duplicate rows. Here's my query so far:
SELECT t.itemid, t.fieldid, t.version, updated
FROM dbo.VersionedFields w
inner JOIN
(
SELECT itemid, fieldid, version, COUNT(*) AS QTY
FROM dbo.VersionedFields
GROUP BY itemid, fieldid, version
HAVING COUNT(*) > 1
) t
on w.itemid = t.itemid and w.fieldid = t.fieldid and w.version = t.version
The select inside the inner join returns the right number of records that we want to delete, but groups them so there is actually twice the amount.
After the join it shows all the records but all I want to delete is the oldest ones?
How can this be done?
If you say SQL (Structured Query Language), but really mean SQL Server (the Microsoft relatinonal database system) by it, and if you're using SQL Server 2005 or newer, you can use a CTE (Common Table Expression) for this purpose.
With this CTE, you can partition your data by some criteria - i.e. your ItemId (or a combination of columns) - and have SQL Server number all your rows starting at 1 for each of those partitions, ordered by some other criteria - i.e. probably version (or some other column).
So try something like this:
;WITH PartitionedData AS
(
SELECT
itemid, fieldid, version,
ROW_NUMBER() OVER(PARTITION BY ItemId ORDER BY version DESC) AS 'RowNum'
FROM dbo.VersionedFields
)
DELETE FROM PartitionedData
WHERE RowNum > 1
Basically, you're partitioning your data by some criteria and numbering each partition, starting at 1 for each new partition, ordered by some other criteria (e.g. Date or Version).
So for each "partition" of data, the "newest" entry has RowNum = 1, and any others that belongs into the same partition (by means of having the same partitino values) will have sequentially numbered values from 2 up to however many rows there are in that partition.
If you want to keep only the newest entry - delete anything with a RowNum larger than 1 and you're done!
In SQL Server 2005 and above:
WITH q AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY itemid, fieldid, version ORDER BY updated DESC) AS rn
FROM versionedFields
)
DELETE
FROM q
WHERE rn > 1
Try something like:
DELETE FROM dbo.VersionedFields w WHERE w.version < (SELECT MAX(version) FROM dbo.VersionedFields)
Ofcourse, you'd want to limit the MAX(version) to only the versions of the field you're wanting to delete.
You probably need to look at this Stack Overflow answer (delete earlier of duplicate rows).
Essentially the technique uses grouping (or optionally, windowing) to find the minimum id value of a group in order to delete it. It may be more accurate to delete rows where the value <> max(row identifier).
So:
Drop unique index
Load data
Delete data using the grouping mechanism (ideally in a transaction, so that you can rollback if there is a mistake), then commit
Recreate the index.
Note that recreating an index on a big table can take a long time.

SQL Microsoft Access

I have a table of transactions in Microsoft Access that contains many transactions for many vendors. I need to identify if there is sequential transaction numbering for each vendor. I don't know what the sequence will be or the number of transactions per vendor. I need to write a SQL that identifies sequential numbering for vendors and sets a field to '1' if present. I was thinking of running nested loops that first determine number of transactions per vendor then loops through those transactions comparing the transaction numbers. Can anybody help me with this??
To find one sequential set (2 records where one transaction number follows the other):
SELECT transactionId FROM tbl WHERE EXISTS
(SELECT * FROM tbl as t WHERE tbl.vendorId = t.vendorId
AND tbl.transactionId+1 = t.transactionId)
I'm not sure this is the most straightforward approach but I think it could work. Apologies for using multiple steps but Jet 4.0 kind of forces one to do so.**
I've assumed all transactionId values are positive integers and that a sequence is a set of evenly spaced transactionId values by vendorId. I further assume there is a key on (vendorId, transactionId).
First step, elmininate invalid rows e.g. need at least three rows to be able to determine a sequence (do all other rows pass or fail?); may want to filter other junk out here too (e.g. rows/groups with NULL values):
CREATE VIEW tbl1
AS
SELECT T1.vendorId, T1.transactionId
FROM tbl AS T1
WHERE EXISTS (
SELECT T2.vendorId
FROM tbl AS T2
WHERE T2.vendorId = T1.vendorId
GROUP
BY T2.vendorId
HAVING COUNT(*) > 2
);
Find the lowest value for each vendor (comes in handy later):
CREATE VIEW tbl2
AS
SELECT vendorId, MIN(transactionId) AS transactionId_min
FROM tbl1
GROUP
BY vendorId;
Make all sequences start at zero (transactionId_base_zero) by subtracting the lowest value for each vendor:
CREATE VIEW tbl3
AS
SELECT T1.vendorId, T1.transactionId,
T1.transactionId - T2.transactionId_min AS transactionId_base_zero
FROM tbl1 AS T1
INNER JOIN tbl2 AS T2
ON T1.vendorId = T2.vendorId;
Predict the step value (difference between adjacent sequence values) based on the MAX, MIN and COUNT set values for each vendor:
CREATE VIEW tbl4
AS
SELECT vendorId,
MAX(transactionId_base_zero) / (COUNT(*) - 1)
AS transactionId_predicted_step
FROM tbl3;
Test that the predicted step value hold true for each squence value i.e. (pseudo code) this_transactionId - step_value = prior_transactionId (omit the lowest transactionId because it doesn't have a prior value!):
SELECT DISTINCT T.vendorId
FROM tbl3 AS T
WHERE T.transactionId_base_zero > 0
AND NOT EXISTS (
SELECT *
FROM tbl3 AS T3
INNER JOIN tbl4 AS T4
ON T3.vendorId = T4.vendorId
WHERE T.vendorId = T3.vendorId
AND T.transactionId_base_zero
- T4.transactionId_predicted_step
= T3.transactionId_base_zero
);
The above query should return the vendorId of vendors whose transactionId values are not sequential.
** In my defense, I ran into a couple of bugs Jet 4.0 I had to code around workaround. Yes, I do know the bugs are in Jet 4.0 (or its OLE DB provider) because a) I double checked results using SQL Server and b) they defy logic! (even SQL's own strange 3VL logic :)
I would use a query that finds gaps in numbering for any vendor, and if that returns any records, then you do not have sequential numbering for all vendors.
SELECT *
FROM tblTransaction As T1
WHERE (
SELECT TOP 1 T2.transactionID
FROM tblTransaction As T2
WHERE T1.vendorID = T2.vendorID AND
T1.transactionID < T2.transactionID
ORDER BY T2.transactionID
) - T1.transactionID > 1
What this does is, for each record in the table, look for the lowest-numbered other transactionID in the same table that is for the same vendor and has a higher-numbered transactionID than the first one. If that the transactionID value of that record is more than one higher than the value in the first record, that represents a gap in numbering for the vendor.
Edit: Changed variable names above as requested.

Delete all but top n from database table in SQL

What's the best way to delete all rows from a table in sql but to keep n number of rows on the top?
DELETE FROM Table WHERE ID NOT IN (SELECT TOP 10 ID FROM Table)
Edit:
Chris brings up a good performance hit since the TOP 10 query would be run for each row. If this is a one time thing, then it may not be as big of a deal, but if it is a common thing, then I did look closer at it.
I would select ID column(s) the set of rows that you want to keep into a temp table or table variable. Then delete all the rows that do not exist in the temp table. The syntax mentioned by another user:
DELETE FROM Table WHERE ID NOT IN (SELECT TOP 10 ID FROM Table)
Has a potential problem. The "SELECT TOP 10" query will be executed for each row in the table, which could be a huge performance hit. You want to avoid making the same query over and over again.
This syntax should work, based what you listed as your original SQL statement:
create table #nuke(NukeID int)
insert into #nuke(Nuke) select top 1000 id from article
delete article where not exists (select 1 from nuke where Nukeid = id)
drop table #nuke
Future reference for those of use who don't use MS SQL.
In PostgreSQL use ORDER BY and LIMIT instead of TOP.
DELETE FROM table
WHERE id NOT IN (SELECT id FROM table ORDER BY id LIMIT n);
MySQL -- well...
Error -- This version of MySQL does not yet support 'LIMIT &
IN/ALL/ANY/SOME subquery'
Not yet I guess.
Here is how I did it. This method is faster and simpler:
Delete all but top n from database table in MS SQL using OFFSET command
WITH CTE AS
(
SELECT ID
FROM dbo.TableName
ORDER BY ID DESC
OFFSET 11 ROWS
)
DELETE CTE;
Replace ID with column by which you want to sort.
Replace number after OFFSET with number of rows which you want to keep.
Choose DESC or ASC - whatever suits your case.
I think using a virtual table would be much better than an IN-clause or temp table.
DELETE
Product
FROM
Product
LEFT OUTER JOIN
(
SELECT TOP 10
Product.id
FROM
Product
) TopProducts ON Product.id = TopProducts.id
WHERE
TopProducts.id IS NULL
This really is going to be language specific, but I would likely use something like the following for SQL server.
declare #n int
SET #n = SELECT Count(*) FROM dTABLE;
DELETE TOP (#n - 10 ) FROM dTable
if you don't care about the exact number of rows, there is always
DELETE TOP 90 PERCENT FROM dTABLE;
I don't know about other flavors but MySQL DELETE allows LIMIT.
If you could order things so that the n rows you want to keep are at the bottom, then you could do a DELETE FROM table LIMIT tablecount-n.
Edit
Oooo. I think I like Cory Foy's answer better, assuming it works in your case. My way feels a little clunky by comparison.
I would solve it using the technique below. The example expect an article table with an id on each row.
Delete article where id not in (select top 1000 id from article)
Edit: Too slow to answer my own question ...
Refactored?
Delete a From Table a Inner Join (
Select Top (Select Count(tableID) From Table) - 10)
From Table Order By tableID Desc
) b On b.tableID = A.tableID
edit: tried them both in the query analyzer, current answer is fasted (damn order by...)
Better way would be to insert the rows you DO want into another table, drop the original table and then rename the new table so it has the same name as the old table
I've got a trick to avoid executing the TOP expression for every row. We can combine TOP with MAX to get the MaxId we want to keep. Then we just delete everything greater than MaxId.
-- Declare Variable to hold the highest id we want to keep.
DECLARE #MaxId as int = (
SELECT MAX(temp.ID)
FROM (SELECT TOP 10 ID FROM table ORDER BY ID ASC) temp
)
-- Delete anything greater than MaxId. If MaxId is null, there is nothing to delete.
IF #MaxId IS NOT NULL
DELETE FROM table WHERE ID > #MaxId
Note: It is important to use ORDER BY when declaring MaxId to ensure proper results are queried.