I have a table (let's say ErrorLog)
CREATE TABLE [dbo].[ErrorLog]
(
[Id] [int] IDENTITY(1,1) NOT NULL,
[Created] [datetime] NOT NULL,
[Message] [varchar](max) NOT NULL,
CONSTRAINT [PK_ErrorLog]
PRIMARY KEY CLUSTERED ([Id] ASC)
)
I want to remove all records that are older that 3 months.
I have a non-clustered index on the Created column (ascending).
I am not sure which one of these is better (seem to take same time).
Query #1:
DELETE FROM ErrorLog
WHERE Created <= DATEADD(month, - 3, GETDATE())
Query #2:
DECLARE #id INT
SELECT #id = max(l.Id)
FROM ErrorLog l
WHERE l.Created <= DATEADD(month, - 3, GETDATE())
DELETE FROM ErrorLog
WHERE Id <= #id
Once you know the maximum clustered key you want to delete then it is definitely faster to use this key. The question is whether it worth selecting this key first using the date. The right decision depends on size of the table and what portion of data you need to delete. The smaller the table is and the smaller is also the number of records for deletion the more efficient should be the first option (Query #1). However, if the number of records to delete is large enough, then the non-clustered index on Date column will be ignored and SQL Server will start scanning the base table. In such a case the second option (Query #2) might be more optimal. And there are usually also other factors to consider.
I have solved similar issue recently (deleting about 600 million (2/3) old records from 1.5TB table) and I have decided for the second approach in the end. There were several reasons for it, but the main were as follows.
The table had to be available for new inserts while the old records were being deleted. So, I could not delete the records in one monstrous delete statement but rather I had to use several smaller batched in order to avoid lock escalation to the table level. Smaller batches kept also the transaction log size in reasonable limits. Furthermore, I had only about one hour long maintenance window each day and it was not possible to delete all required records within one day.
With above mentioned in mind the fastest solution for me was to select the maximum ID I needed to delete according to the Date column and then just start deleting from the beginning of the clustered index as far as to the selected Id one batch after the other (DELETE TOP(#BatchSize) FROM ErrorLog WITH(PAGLOCK) WHERE ID <= #myMaxId). I used the PAGLOCK hint in order to increase the batch size without escalating the lock to the table level. I deleted several batches each day in the end.
Related
We have a a staging table that looks like this. This will store all our data in 15-min intervals:
CREATE TABLE [dbo].[15MinDataRawStaging](
[RawId] [int] IDENTITY(1,1) NOT NULL,
[CityId] [varchar](15) NOT NULL,
[Date] [int] NULL,
[Hour] [int] NULL,
[Minute] [int] NULL,
[CounterValue] [int] NOT NULL,
[CounterName] [varchar](40) NOT NULL
)
It currently stores 20 different Counters, which means that we insert about 400K rows every hour of every day to this table.
Right now, I'm deleting data from before 03/2016, but even with the first 8 days of March data, there's over 58M rows.
Once all the hourly data is stored in [15MinDataRawStaging], we start copying data from this table to other tables, which are then used for the reports.
So, for example, we have a Kpi called Downtime, which is composed of counters VeryLongCounterName1 and VeryLongCounterName2. Once the hourly data is stored in [15MinDataRawStaging], we run a stored procedure that inserts these counters to its own table, called [DownTime]. It looks something like this:
insert into [DownTime] (CityKey, Datekey, HourKey, MinuteKey, DownTime, DowntimeType)
select CityId, [date], [hour], [minute], CounterValue, CounterName
From [15MinDataRawStaging] p
where
[date] = #Date
and [Hour] = #Hour
and CounterName in ('VeryLongCounterName1', 'VeryLongCounterName2')
and CounterValue > 0
This runs automatically every hour (through a C# console app), and I've noticed that with this query I'm getting timeout issues. I just ran it, and it indeed takes about 35 seconds to complete.
So my questions are:
Is there a way to optimize the structure of the staging table so these types of INSERTs to other tables don't take that long?
Or is it possible to optimize the INSERT query? The reason I have the staging table is because I need to store the data, even if it's for the current month. No matter what's done, the staging table will have tons of rows.
Do you guys have any other suggestions?
Thanks.
It sounds like you want to partition 15MinDataRawStaging into daily or hourly chunks. The documentation explains how to do this (better than a Stack Overflow answer).
Partitioning basically stores the table in multiple different files (at least conceptually). Certain actions can be very efficient. For instance, dropping a partition is much, much faster than dropping the individual records. In addition, fetching data from a single partition should be fast -- and in your case, the most recent partition will be in memory, making everything faster.
Depending on how the data is used, indexes might also be appropriate. But for this volume of data and the way you are using it, partitions seem like the key idea.
Assuming that the staging table has only one purpose, viz for the INSERT into DownTime, you can trade off a small amount of storage and insert performance (into the staging table) to improve the final ETL performance by adding a clustered index matching the query used in extraction:
CREATE UNIQUE CLUSTERED INDEX MyIndex
ON [15MinDataRawStaging]([Date], [Hour], [Minute], RawId);
I've added RawId in order to allow uniqueness (otherwise a 4 byte uniquefier would have been added in any event).
You'll also want to do some trial and error by testing whether adding [CounterName] and / or [CounterValue] to the index (but before RawId) will improve the overall process throughput (i.e. both Staging insertion and Extraction into the final DownTime table)
We have a table logging data. It is logging at say 15K rows per second.
Question: How would we limit the table size to the 1bn newest rows?
i.e. once 1bn rows is reached, it becomes a ring buffer, deleting the oldest row when adding the newest.
Triggers might load the system too much. Here's a trigger example on SO.
We are already using a bunch of tweaks to keep the speed up (such as stored procedures, Table Parameters etc).
Edit (8 years on) :
My recent question/answer here addresses a similar issue using a time series database.
Unless there is something magic about 1 billion, I think you should consider other approaches.
The first that comes to mind is partitioning the data. Say, put one hour's worth of data into each partition. This will result in about 15,000*60*60 = 54 million records in a partition. About every 20 hours, you can remove a partition.
One big advantage of partitioning is that the insert performance should work well and you don't have to delete individual records. There can be additional overheads depending on the query load, indexes, and other factors. But, with no additional indexes and a query load that is primarily inserts, it should solve your problem better than trying to delete 15,000 records each second along with the inserts.
I don't have a complete answer but hopefully some ideas to help you get started.
I would add some sort of numeric column to the table. This value would increment by 1 until it reached the number of rows you wanted to keep. At that point the procedure would switch to update statements, overwriting the previous row instead of inserting new ones. You obviously won't be able to use this column to determine the order of the rows, so if you don't already I would also add a timestamp column so you can order them chronologically later.
In order to coordinate the counter value across transactions you could use a sequence, then perform a modulo division to get the counter value.
In order to handle any gaps in the table (e.g. someone deleted some of the rows) you may want to use a merge statement. This should perform an insert if the row is missing or an update if it exists.
Hope this helps.
Here's my suggestion:
Pre-populate the table with 1,000,000,000 rows, including a row number as the primary key.
Instead of inserting new rows, have the logger keep a counter variable that increments each time, and update the appropriate row according to the row number.
This is actually what you would do with a ring buffer in other contexts. You wouldn't keep allocating memory and deleting; you'd just overwrite the same array over and over.
Update: the update doesn't actually change the data in place, as I thought it did. So this may not be efficient.
Just an idea that is to complicated to write in a comment.
Create a few log tables, 3 as an example, Log1, Log2, Log3
CREATE TABLE Log1 (
Id int NOT NULL
CHECK (Id BETWEEN 0 AND 9)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log1] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
CREATE TABLE Log2 (
Id int NOT NULL
CHECK (Id BETWEEN 10 AND 19)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log2] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
CREATE TABLE Log3 (
Id int NOT NULL
CHECK (Id BETWEEN 20 AND 29)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log3] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
Then create a partitioned view
CREATE VIEW LogView AS (
SELECT * FROM Log1
UNION ALL
SELECT * FROM Log2
UNION ALL
SELECT * FROM Log3
)
If you are on SQL2012 you can use a sequence
CREATE SEQUENCE LogSequence AS int
START WITH 0
INCREMENT BY 1
MINVALUE 0
MAXVALUE 29
CYCLE
;
And then start to insert values
INSERT INTO LogView (Id, Message)
SELECT NEXT VALUE FOR LogSequence
,'SomeMessage'
Now you just have to truncate the logtables on some kind of schedule
If you don't have sql2012 you need to create the sequence some other way
I'm looking for something similar myself (using a table as a circular buffer) but it seems like a simpler approach (for me) will be just to periodically delete old entries (e.g. the lowest IDs or lowest create/lastmodified datetimes or entries over a certain age). It's not a circular buffer but perhaps it is a close enough approximation for some. ;)
We have the following table:
CREATE TABLE TagValueDouble(
TagIdentity [int] NOT NULL,
TimestampInUtcTicks [bigint] NOT NULL,
DoubleValue [float] NULL,
CONSTRAINT [PK_TagValueDouble] PRIMARY KEY CLUSTERED
(
TagIdentity ASC,
TimestampInUtcTicks ASC
)
This table gets filled up with many measurements from different sources (e.g wind speed). The TagIdentity, represents the source, combined with the timestamp represents a unique record.
This table gets large, say 2000 different sources with a 2Hz update rate.
Not often but sometimes we remove a source and we need to drop all the records of that source in the table. The problem is that using NHibernate the query times out.
My plan was to delete X rows at a time of records that are no longer part of the system. Something in the lines of:
DELETE FROM TagValueDouble
WHERE TagIdentity in
(SELECT TOP 10 TagIdentity, TimestampInUtcTicks
FROM TagValueDouble
Where TagIdentity not in (12, 14))
But this does not work.
Any ideas on how I can clean up the table without risking the timeout?
I'm looking for stability not performance. Deleting all values for a source is something that is done rarely.
PS. It has to work on SQL Server, Oracle, Posgres and SQL CE.
It seems you are missing the index on TagIdentity. Create the index en make sure it is filled correctly before deleting the records.
One idea was to do something like this:
DELETE FROM TagValueDouble
WHERE (TagIdentity + 4000) * 315360000000000 + TimestampInUtcTicks IN
(SELECT TOP 1000 (TagIdentity + 4000) * 315360000000000 + TimestampInUtcTicks
FROM TagValueDouble
WHERE TagIdentity NOT IN (5, 10, 12, 13))
Rationale beeing that I do not expect the system to be running in the year 4000. So I combine the TagIdentity and TimestampInUtcTicks to one unique value. But this ended up checking all entires in the table.
What I ended up with was querying for one value which was not in the "allowed" TagIdentity set. And then delete any value from that values timestamp and 24 hours back in time. If that query also timed out, reduce the number of hours until deletion succeeds. Doing this job until no more values where found cleans up the database.
I have a rather big table named FTPLog with around 3 milion record I wanted to add a delete mechanism to delete old logs but delete command takes long time. I found that clustered index deleting takes long time.
DECLARE #MaxFTPLogId as bigint
SELECT #MaxFTPLogId = Max(FTPLogId) FROM FTPLog WHERE LogTime <= DATEADD(day, -10 , GETDATE())
PRINT #MaxFTPLogId
DELETE FROM FTPLog WHERE FTPLogId <= #MaxFTPLogId
I want to know how can I improve performance of deleting?
It might be slow because a large delete generates a big transaction log. Try to delete it in chunks, like:
WHILE 1 = 1
BEGIN
DELETE TOP (256) FROM FTPLog WHERE FTPLogId <= #MaxFTPLogId
IF ##ROWCOUNT = 0
BREAK
END
This generates smaller transactions. And it mitigates locking issues by creating breathing space for other processes.
You might also look into partitioned tables. These potentially allow you to purge old entries by dropping an entire partition.
Since it's a log table, there is no need to make is clustered.
It's unlikely that you will search it on Id.
Alter your PRIMARY KEY so that it's unclustered. This will use HEAP storage method which is faster on DML:
ALTER TABLE FTPLog DROP CONSTRAINT Primary_Key_Name
ALTER TABLE FTPLog ADD CONSTRAINT Primary_Key_Name PRIMARY KEY NONCLUSTERED (FTPLogId)
, and just issue:
SELECT #MaxFTPLogTime = DATEADD(day, -10 , GETDATE())
PRINT #MaxFTPLogId
DELETE FROM FTPLog WHERE LogTime <= #MaxFTPLogTime
Check the density of your table (use command DBCC showcontig to check density)
Scan Density [Best Count:Actual Count] this parameter should be closer to 100% and Logical Scan Fragmentation parameter should be closer to 0% for best performance of your table. If it is not, re-index and refragment the index of that table to improve performance of your query execution.
I assume that not only this table is huge in terms of number of rows, but also that it is really heavily used for logging new entries while you try to clean it up.
Suggestion of Andomar should help, but I would try to clean it up when there are no inserts going on.
Alternative: when you write logs, you probably do not care about the transaction isolation so much. Therefore I would change transaction isolation level for the code/processes that write the log entries so that you may avoid creating huge tempdb (by the way, check if tempdb grows a lot during this DELETE operation)
Also, I think that deletions from the clustered index should not be really slower then from non-clustered one: you are still psysically deleting rows. Rebuilding this index afterward may take time though.
I have a table with 800,000 entries without a primary key. I am not allowed to add a primary key and I cant sort by TOP 1 ....ORDER BY DESC because it takes hours to complete this task. So I tried this work around:
DECLARE #ROWCOUNT int, #OFFSET int
SELECT #ROWCOUNT = (SELECT COUNT(field) FROM TABLE)
SET #OFFSET = #ROWCOUNT-1
select TOP 1 FROM TABLE WHERE=?????NO PRIMARY KEY??? BETWEEN #Offset AND #ROWCOUNT
Of course this doesn't work.
Anyway to do use this code/or better code to retrieve the last row in table?
If your table has no primary key or your primary key is not orderly... you can try the code below... if you want see more last record, you can change the number in code
Select top (select COUNT(*) from table) * From table
EXCEPT
Select top ((select COUNT(*) from table)-(1)) * From table
I assume that when you are saying 'last rows', you mean 'last created rows'.
Even if you had primary key, it would still be not the best option to use it do determine rows creation order.
There is no guarantee that that the row with the bigger primary key value was created after the row with a smaller primary key value.
Even if primary key is on identity column, you can still always override identity values on insert by using
set identity_insert on.
It is a better idea to have timestamp column, for example CreatedDateTime with a default constraint.
You would have index on this field.Then your query would be simple, efficient and correct:
select top 1 *
from MyTable
order by CreatedDateTime desc
If you don't have timestamp column, you can't determine 'last rows'.
If you need to select 1 column from a table of 800,000 rows where that column is the min or max possible value, and that column is not indexed, then the unassailable fact is that SQL will have to read every row in the table in order to identify that min or max value.
(An aside, on the face of it reading all the rows of an 800,000 row table shouldn't take all that long. How wide is the column? How often is the query run? Are there concurrency, locking, blocking, or deadlocking issues? These may be pain points that could be addressed. End of aside.)
There are any number of work-arounds (indexes, views, indexed views, peridocially indexed copies of the talbe, run once store result use for T period of time before refreshing, etc.), but virtually all of them require making permanent modifications to the database. It sounds like you are not permitted to do this, and I don't think there's much you can do here without some such permanent change--and call it improvement, when you discuss it with your project manager--to the database.
You need to add an Index, can you?
Even if you don't have a primary key an Index will speed up considerably the query.
You say you don't have a primary key, but for your question I assume you have some type of timestamp or something similar on the table, if you create an Index using this column you will be able to execute a query like :
SELECT *
FROM table_name
WHERE timestamp_column_name=(
SELECT max(timestamp_column_name)
FROM table_name
)
If you're not allowed to edit this table, have you considered creating a view, or replicating the data in the table and moving it into one that has a primary key?
Sounds hacky, but then, your 800k row table doesn't have a primary key, so hacky seems to be the order of the day. :)
I believe you could write it simply as
SELECT * FROM table ORDER BY rowid DESC LIMIT 1;
Hope it helps.