SQL Query to delete oldest rows over a certain row count? - sql

I have a table that contains log entries for a program I'm writing. I'm looking for ideas on an SQL query (I'm using SQL Server Express 2005) that will keep the newest X number of records, and delete the rest. I have a datetime column that is a timestamp for the log entry.
I figure something like the following would work, but I'm not sure of the performance with the IN clause for larger numbers of records. Performance isn't critical, but I might as well do the best I can the first time.
DELETE FROM MyTable WHERE PrimaryKey NOT IN
(SELECT TOP 10,000 PrimaryKey FROM MyTable ORDER BY TimeStamp DESC)
I should mention that this query will run 3-4 times a day (as part of another process), so the number of records that will be deleted with each query will be small in comparison to the number of records that will be kept.

Try this:
DECLARE #X int
SELECT #X=COUNT(*) FROM MyTable
SET #X=#X-10000
DELETE MyTable
WHERE PrimaryKey IN (SELECT TOP(#x) PrimaryKey
FROM MyTable
ORDER BY TimeStamp ASC
)
kind of depends on if you are deleting fewer than 10,000 rows, if so this might run faster, as it identifies the rows to delete, not the rows to keep.

Try this, uses a CTE to get the row ordinal number, and then only deletes X number of rows at a time. You can alter this variable to suit your server.
Adding ReadPast table hint should prevent locking.
:
DECLARE #numberToDelete INT;
DECLARE #ROWSTOKEEP INT;
SET #ROWSTOKEEP = 50000;
SET #numberToDelete =1000;
WHILE 1=1
BEGIN
WITH ROWSTODELETE AS
(
SELECT ROW_NUMBER() OVER(ORDER BY dtsTimeStamp DESC) rn,
*
FROM MyTable
)
DELETE TOP (#numberToDelete) FROM ROWSTODELETE WITH(READPAST)
WHERE rn>#ROWSTOKEEP;
IF ##ROWCOUNT=0
BREAK;
END;

The query you have is about as efficient as it gets, and is readable.
NOT IN and NOT EXISTS are more efficient than LEFT JOIN/IS NULL, but only because both columns can never be null. You can read this link for a more in-depth comparison.

This depends on your scenario (whether it's feasible for you) and how many rows you have, but there is a potentially far more optimal approach.
Create a new copy of the log table with a new name
Insert into the new table, the most recent 10,000 records from the original table
Drop the original table (or rename)
Rename the new table, to the proper name
This obviously requires more thought than just deleting rows (e.g. if the table has an IDENTITY column this needs to be set on the new table etc). But if you have a large table it would be more efficient to copy 10,000 rows to a new table then drop the original table, than trying to delete millions of rows to leave just 10,000.

DELETE FROM MyTable
WHERE TimeStamp < (SELECT min(TimeStamp)
FROM (SELECT TOP 10,000 TimeStamp
FROM MyTable
ORDER BY TimeStamp DESC))
or
DELETE FROM MyTable
WHERE TimeStamp < (SELECT min(TimeStamp)
FROM MyTable
WHERE PrimaryKey IN (SELECT TOP 10,000 TimeStamp
FROM MyTable
ORDER BY TimeStamp DESC))
Not sure if these are improvement as far as efficiency though.

Related

Getting a specific number of rows from Database using RowNumber; Inconsistent results

Here is my SQL query:
select * from TABLE T where ROWNUM<=100
If i execute this and then re-execute this, I don't get the same result. Why?
Also, on a sybase system if i execute
set rowcount 100
select * from TABLE
even on re-execution i get the same result?
Can someone explain why? and provide possible solution for RowNum
Thanks
If you don't use ORDER BY in your query you get the results in natural order.
Natural order is whatever is fastest for the database at the moment.
A possible solution is to ORDER BY your primary key, if it's an INT
SELECT TOP 100 START AT 0 * FROM TABLE
ORDER BY TABLE.ID;
If your primary key is not a sequentially incrementing integer and you don't have another column to order by (such as a timestamp) you may need to create an extra column SORT_ORDER INT and increment in automatically on insert using either an Autoincrement column or a sequence and an insert trigger, depending on the database.
Make sure to create an index on that column to speed up the query.
You need to specify an ORDER BY. Queries without explicit ORDER BY clause make no guarantee about the order in which the rows are returned. And from this result set you take the first 100 rows. As the order in which the rows can be different every time, so can be your first 100 rows.
You need to use ORDER BY first, followed by ROWNUM. You will get inconsistent results if you don't follow this order.
select * from
(
select * from TABLE T ORDER BY rowid
) where ROWNUM<=100

SQL get last rows in table WITHOUT primary ID

I have a table with 800,000 entries without a primary key. I am not allowed to add a primary key and I cant sort by TOP 1 ....ORDER BY DESC because it takes hours to complete this task. So I tried this work around:
DECLARE #ROWCOUNT int, #OFFSET int
SELECT #ROWCOUNT = (SELECT COUNT(field) FROM TABLE)
SET #OFFSET = #ROWCOUNT-1
select TOP 1 FROM TABLE WHERE=?????NO PRIMARY KEY??? BETWEEN #Offset AND #ROWCOUNT
Of course this doesn't work.
Anyway to do use this code/or better code to retrieve the last row in table?
If your table has no primary key or your primary key is not orderly... you can try the code below... if you want see more last record, you can change the number in code
Select top (select COUNT(*) from table) * From table
EXCEPT
Select top ((select COUNT(*) from table)-(1)) * From table
I assume that when you are saying 'last rows', you mean 'last created rows'.
Even if you had primary key, it would still be not the best option to use it do determine rows creation order.
There is no guarantee that that the row with the bigger primary key value was created after the row with a smaller primary key value.
Even if primary key is on identity column, you can still always override identity values on insert by using
set identity_insert on.
It is a better idea to have timestamp column, for example CreatedDateTime with a default constraint.
You would have index on this field.Then your query would be simple, efficient and correct:
select top 1 *
from MyTable
order by CreatedDateTime desc
If you don't have timestamp column, you can't determine 'last rows'.
If you need to select 1 column from a table of 800,000 rows where that column is the min or max possible value, and that column is not indexed, then the unassailable fact is that SQL will have to read every row in the table in order to identify that min or max value.
(An aside, on the face of it reading all the rows of an 800,000 row table shouldn't take all that long. How wide is the column? How often is the query run? Are there concurrency, locking, blocking, or deadlocking issues? These may be pain points that could be addressed. End of aside.)
There are any number of work-arounds (indexes, views, indexed views, peridocially indexed copies of the talbe, run once store result use for T period of time before refreshing, etc.), but virtually all of them require making permanent modifications to the database. It sounds like you are not permitted to do this, and I don't think there's much you can do here without some such permanent change--and call it improvement, when you discuss it with your project manager--to the database.
You need to add an Index, can you?
Even if you don't have a primary key an Index will speed up considerably the query.
You say you don't have a primary key, but for your question I assume you have some type of timestamp or something similar on the table, if you create an Index using this column you will be able to execute a query like :
SELECT *
FROM table_name
WHERE timestamp_column_name=(
SELECT max(timestamp_column_name)
FROM table_name
)
If you're not allowed to edit this table, have you considered creating a view, or replicating the data in the table and moving it into one that has a primary key?
Sounds hacky, but then, your 800k row table doesn't have a primary key, so hacky seems to be the order of the day. :)
I believe you could write it simply as
SELECT * FROM table ORDER BY rowid DESC LIMIT 1;
Hope it helps.

How do I delete a fixed number of rows with sorting in PostgreSQL?

I'm trying to port some old MySQL queries to PostgreSQL, but I'm having trouble with this one:
DELETE FROM logtable ORDER BY timestamp LIMIT 10;
PostgreSQL doesn't allow ordering or limits in its delete syntax, and the table doesn't have a primary key so I can't use a subquery. Additionally, I want to preserve the behavior where the query deletes exactly the given number or records -- for example, if the table contains 30 rows but they all have the same timestamp, I still want to delete 10, although it doesn't matter which 10.
So; how do I delete a fixed number of rows with sorting in PostgreSQL?
Edit: No primary key means there's no log_id column or similar. Ah, the joys of legacy systems!
You could try using the ctid:
DELETE FROM logtable
WHERE ctid IN (
SELECT ctid
FROM logtable
ORDER BY timestamp
LIMIT 10
)
The ctid is:
The physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row's ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier.
There's also oid but that only exists if you specifically ask for it when you create the table.
Postgres docs recommend to use array instead of IN and subquery. This should work much faster
DELETE FROM logtable
WHERE id = any (array(SELECT id FROM logtable ORDER BY timestamp LIMIT 10));
This and some other tricks can be found here
delete from logtable where log_id in (
select log_id from logtable order by timestamp limit 10);
If you don't have a primary key you can use the array Where IN syntax with a composite key.
delete from table1 where (schema,id,lac,cid) in (select schema,id,lac,cid from table1 where lac = 0 limit 1000);
This worked for me.
Assuming you want to delete ANY 10 records (without the ordering) you could do this:
DELETE FROM logtable as t1 WHERE t1.ctid < (select t2.ctid from logtable as t2 where (Select count(*) from logtable t3 where t3.ctid < t2.ctid ) = 10 LIMIT 1);
For my use case, deleting 10M records, this turned out to be faster.
You could write a procedure which loops over the delete for individual lines, the procedure could take a parameter to specify the number of items you want to delete. But that's a bit overkill compared to MySQL.

how to select the newly added rows in a table efficiently?

I need to periodically update a local cache with new additions to some DB table. The table rows contain an auto-increment sequential number (SN) field. The cache keeps this number too, so basically I just need to fetch all rows with SN larger than the highest I already have.
SELECT * FROM table where SN > <max_cached_SN>
However, the majority of the attempts will bring no data (I just need to make sure that I have an absolutely up-to-date local copy). So I wander if this will be more efficient:
count = SELECT count(*) from table;
if (count > <cache_size>)
// fetch new rows as above
I suppose that selecting by an indexed numeric field is quite efficient, so I wander whether using count has benefit. On the other hand, this test/update will be done quite frequently and by many clients, so there is a motivation to optimize it.
this test/update will be done quite frequently and by many clients
this could lead to unexpected race competition for cache generation
I would suggest
upon new addition to your table add the newest id into a queue table
using like crontab to trigger the cache generation by checking queue table
upon new cache generated, delete the id from queue table
as you stress majority of the attempts will bring no data, the above will only trigger where there is new addition
and the queue table concept, even can expand for update and delete
I believe that
SELECT * FROM table where SN > <max_cached_SN>
will be faster, because select count(*) may call table scan. Just for clarification, do you never delete rows from this table?
SELECT COUNT(*) may involve a scan (even a full scan), while SELECT ... WHERE SN > constant can effectively use an index by SN, and looking at very few index nodes may suffice. Don't count items if you don't need the exact total, it's expensive.
You don't need to use SELECT COUNT(*)
There is two solution.
You can use a temp table that has one field that contain last count of your table, and create new Trigger after insert on your table and inc temp table field in Trigger.
You can use a temp table that has one field that contain last SN of your table is cached and create new Trigger after insert on your table and update temp table field in Trigger.
not much to this really
drop table if exists foo;
create table foo
(
foo_id int unsigned not null auto_increment primary key
)
engine=innodb;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;

Deleting duplicate rows in a database without using rowid or creating a temp table

Many years ago, I was asked during a phone interview to delete duplicate rows in a database. After giving several solutions that do work, I was eventually told the restrictions are:
Assume table has one VARCHAR column
Cannot use rowid
Cannot use temporary tables
The interviewer refused to give me the answer. I've been stumped ever since.
After asking several colleagues over the years, I'm convinced there is no solution. Am I wrong?!
And if you did have an answer, would a new restriction suddenly present itself? Since you mention ROWID, I assume you were using Oracle. The solutions are for SQL Server.
Inspired by SQLServerCentral.com http://www.sqlservercentral.com/scripts/T-SQL/62866/
while(1=1) begin
delete top (1)
from MyTable
where VarcharColumn in
(select VarcharColumn
from MyTable
group by VarcharColumn
having count(*) > 1)
if ##rowcount = 0
exit
end
Deletes one row at a time. When the second to last row of a set of duplicates disappears then the remaining row won't be in the subselect on the next pass through the loop. (BIG Yuck!)
Also, see http://www.sqlservercentral.com/articles/T-SQL/63578/ for inspiration. There RBarry Young suggests a way that might be modified to store the deduplicated data in the same table, delete all the original rows, then convert the stored deduplicated data back into the right format. He had three columns, so not exactly analogous to what you are doing.
And then it might be do-able with a cursor. Not sure and don't have time to look it up. But create a cursor to select everything out of the table, in order, and then a variable to track what the last row looked like. If the current row is the same, delete, else set the variable to the current row.
This is a completely Jacked up way to do it, but given the assanine requirements, here is a workable solution assuming SQL 2005 or later:
DELETE from MyTable
WHERE ROW_NUMBER() over(PARTITION BY [MyField] order by MyField)>1
I would put a unique number of fixed size in the VARCHAR column for the duplicated rows, then parse out the number and delete all but the minimum row. Maybe that's what his VARCHAR constraint is for. But that stinks because it assumes that your unique number will fit. Lame question. You didn't want to work there anyway. ;-)
Assume you are implementing the DELETE statement for a SQL engine. how will you delete two rows from a table that are exactly identical? You need something to distinguish one from the other!
You actually cannot delete entirely duplicate rows (ALL columns being equal) under the following constraints(as provided to you)
No use of ROWID or ROWNUM
No Temporary Table
No procedural code
It can, however be done even if one of the conditions is relaxed. Here are solutions using at least one of the three conditions
Assume table is defined as below
Create Table t1 (
col1 vacrchar2(100),
col2 number(5),
col3 number(2)
);
Duplicate rows identification:
Select col1, col2, col3
from t1
group by col1, col2, col3
having count(*) >1
Duplicate rows can also be identified using this:
select c1,c2,c3, row_number() over (partition by (c1,c2,c3) order by c1,c2,c3) rn from t1
NOTE: The row_number() analytic function cannot be used in a DELETE statement as suggested by JohnFx at least in Oracle 10g.
Solution using ROWID
Delete from t1 where row_id > ( select min(t1_inner.row_id) from t1 t1_innner where t1_inner.c1=t1.c1 and t1_inner.c2=t1.c2 and t1_inner.c3=t1.c3))
Solution using temp table
create table t1_dups as (
//write query here to find the duplicate rows as liste above//
)
delete from t1
where t1.c1,t1.c2,t1.c3 in (select * from t1.dups)
insert into t1(
select c1,c2,c3 from t1_dups)
Solution using procedural code
This will use an approach similar to the case where we use a temp table.
create table temp as
select c1,c2
from table
group by c1,c2
having(count(*)>1 or count(*)=1);
Now drop the base table .
Rename the temp table to base table.
Mine was resolved using this query:
delete from where in (select from group by having count(*) >1)
in PLSQL