Implement a ring buffer - sql

We have a table logging data. It is logging at say 15K rows per second.
Question: How would we limit the table size to the 1bn newest rows?
i.e. once 1bn rows is reached, it becomes a ring buffer, deleting the oldest row when adding the newest.
Triggers might load the system too much. Here's a trigger example on SO.
We are already using a bunch of tweaks to keep the speed up (such as stored procedures, Table Parameters etc).
Edit (8 years on) :
Unless there is something magic about 1 billion, I think you should consider other approaches.
The first that comes to mind is partitioning the data. Say, put one hour's worth of data into each partition. This will result in about 15,000*60*60 = 54 million records in a partition. About every 20 hours, you can remove a partition.
One big advantage of partitioning is that the insert performance should work well and you don't have to delete individual records. There can be additional overheads depending on the query load, indexes, and other factors. But, with no additional indexes and a query load that is primarily inserts, it should solve your problem better than trying to delete 15,000 records each second along with the inserts.

I don't have a complete answer but hopefully some ideas to help you get started.
I would add some sort of numeric column to the table. This value would increment by 1 until it reached the number of rows you wanted to keep. At that point the procedure would switch to update statements, overwriting the previous row instead of inserting new ones. You obviously won't be able to use this column to determine the order of the rows, so if you don't already I would also add a timestamp column so you can order them chronologically later.
In order to coordinate the counter value across transactions you could use a sequence, then perform a modulo division to get the counter value.
In order to handle any gaps in the table (e.g. someone deleted some of the rows) you may want to use a merge statement. This should perform an insert if the row is missing or an update if it exists.
Hope this helps.

Here's my suggestion:
Pre-populate the table with 1,000,000,000 rows, including a row number as the primary key.
Instead of inserting new rows, have the logger keep a counter variable that increments each time, and update the appropriate row according to the row number.
This is actually what you would do with a ring buffer in other contexts. You wouldn't keep allocating memory and deleting; you'd just overwrite the same array over and over.
Update: the update doesn't actually change the data in place, as I thought it did. So this may not be efficient.

Just an idea that is to complicated to write in a comment.
Create a few log tables, 3 as an example, Log1, Log2, Log3
,Message varchar(10) NOT NULL
,Message varchar(10) NOT NULL
,Message varchar(10) NOT NULL
Then create a partitioned view
If you are on SQL2012 you can use a sequence
And then start to insert values
INSERT INTO LogView (Id, Message)
Now you just have to truncate the logtables on some kind of schedule
If you don't have sql2012 you need to create the sequence some other way

I'm looking for something similar myself (using a table as a circular buffer) but it seems like a simpler approach (for me) will be just to periodically delete old entries (e.g. the lowest IDs or lowest create/lastmodified datetimes or entries over a certain age). It's not a circular buffer but perhaps it is a close enough approximation for some. ;)


Remove records by clustered or non-clustered index

I have a table (let's say ErrorLog)
CREATE TABLE [dbo].[ErrorLog]
[Id] [int] IDENTITY(1,1) NOT NULL,
[Created] [datetime] NOT NULL,
[Message] [varchar](max) NOT NULL,
I want to remove all records that are older that 3 months.
I have a non-clustered index on the Created column (ascending).
I am not sure which one of these is better (seem to take same time).
Query #1:
WHERE Created <= DATEADD(month, - 3, GETDATE())
Query #2:
SELECT #id = max(l.Id)
FROM ErrorLog l
WHERE l.Created <= DATEADD(month, - 3, GETDATE())
WHERE Id <= #id
Once you know the maximum clustered key you want to delete then it is definitely faster to use this key. The question is whether it worth selecting this key first using the date. The right decision depends on size of the table and what portion of data you need to delete. The smaller the table is and the smaller is also the number of records for deletion the more efficient should be the first option (Query #1). However, if the number of records to delete is large enough, then the non-clustered index on Date column will be ignored and SQL Server will start scanning the base table. In such a case the second option (Query #2) might be more optimal. And there are usually also other factors to consider.
I have solved similar issue recently (deleting about 600 million (2/3) old records from 1.5TB table) and I have decided for the second approach in the end. There were several reasons for it, but the main were as follows.
The table had to be available for new inserts while the old records were being deleted. So, I could not delete the records in one monstrous delete statement but rather I had to use several smaller batched in order to avoid lock escalation to the table level. Smaller batches kept also the transaction log size in reasonable limits. Furthermore, I had only about one hour long maintenance window each day and it was not possible to delete all required records within one day.
With above mentioned in mind the fastest solution for me was to select the maximum ID I needed to delete according to the Date column and then just start deleting from the beginning of the clustered index as far as to the selected Id one batch after the other (DELETE TOP(#BatchSize) FROM ErrorLog WITH(PAGLOCK) WHERE ID <= #myMaxId). I used the PAGLOCK hint in order to increase the batch size without escalating the lock to the table level. I deleted several batches each day in the end.

VACUUM on Redshift (AWS) after DELETE and INSERT

I have a table as below (simplified example, we have over 60 fields):
CREATE TABLE "fact_table" (
"pk_a" bigint NOT NULL ENCODE lzo,
"pk_b" bigint NOT NULL ENCODE delta,
"d_1" bigint NOT NULL ENCODE runlength,
"d_2" bigint NOT NULL ENCODE lzo,
"d_3" character varying(255) NOT NULL ENCODE lzo,
"f_1" bigint NOT NULL ENCODE bytedict,
"f_2" bigint NULL ENCODE delta32k
DISTKEY ( d_1 )
SORTKEY ( pk_a, pk_b );
The table is distributed by a high-cardinality dimension.
The table is sorted by a pair of fields that increment in time order.
The table contains over 2 billion rows, and uses ~350GB of disk space, both "per node".
Our hourly house-keeping involves updating some recent records (within the last 0.1% of the table, based on the sort order) and inserting another 100k rows.
Whatever mechanism we choose, VACUUMing the table becomes overly burdensome:
- The sort step takes seconds
- The merge step takes over 6 hours
We can see from SELECT * FROM svv_vacuum_progress; that all 2billion rows are being merged. Even though the first 99.9% are completely unaffected.
Our understanding was that the merge should only affect:
1. Deleted records
2. Inserted records
3. And all the records from (1) or (2) up to the end of the table
We have tried DELETE and INSERT rather than UPDATE and that DML step is now significantly quicker. But the VACUUM still merges all 2billion rows.
DELETE FROM fact_table WHERE pk_a > X;
-- 42 seconds
INSERT INTO fact_table SELECT <blah> FROM <query> WHERE pk_a > X ORDER BY pk_a, pk_b;
-- 90 seconds
VACUUM fact_table;
-- 23645 seconds
In fact, the VACUUM merges all 2 billion records even if we just trim the last 746 rows off the end of the table.
The Question
Does anyone have any advice on how to avoid this immense VACUUM overhead, and only MERGE on the last 0.1% of the table?
How often are you VACUUMing the table? How does the long duration effect you? our load processing continues to run during VACUUM and we've never experienced any performance problems with doing that. Basically it doesn't matter how long it takes because we just keep running BAU.
I've also found that we don't need to VACUUM our big tables very often. Once a week is more than enough. Your use case may be very performance sensitive but we find the query times to be within normal variations until the table is more than, say, 90% unsorted.
If you find that there's a meaningful performance difference, have you considered using recent and history tables (inside a UNION view if needed)? That way you can VACUUM the small "recent" table quickly.
Couldn't fix it in comments section, so posting it as answer
I think right now, if the SORT keys are same across the time series tables and you have a UNION ALL view as time series view and still performance is bad, then you may want to have a time series view structure with explicit filters as
create or replace view schemaname.table_name as
select * from table_20140901 where sort_key_date = '2014-09-01' union all
select * from table_20140902 where sort_key_date = '2014-09-02' union all .......
select * from table_20140925 where sort_key_date = '2014-09-25';
Also make sure to have stats collected on all these tables on sort keys after every load and try running queries against it. It should be able to push down any filter values into the view if you are using any. End of day after load, just run a VACUUM SORT ONLY or full vacuum on the current day's table which should be much faster.
Let me know if you are still facing any issues after the above test.

SQL get last rows in table WITHOUT primary ID

I have a table with 800,000 entries without a primary key. I am not allowed to add a primary key and I cant sort by TOP 1 ....ORDER BY DESC because it takes hours to complete this task. So I tried this work around:
Of course this doesn't work.
Anyway to do use this code/or better code to retrieve the last row in table?
If your table has no primary key or your primary key is not orderly... you can try the code below... if you want see more last record, you can change the number in code
Select top (select COUNT(*) from table) * From table
Select top ((select COUNT(*) from table)-(1)) * From table
I assume that when you are saying 'last rows', you mean 'last created rows'.
Even if you had primary key, it would still be not the best option to use it do determine rows creation order.
There is no guarantee that that the row with the bigger primary key value was created after the row with a smaller primary key value.
Even if primary key is on identity column, you can still always override identity values on insert by using
set identity_insert on.
It is a better idea to have timestamp column, for example CreatedDateTime with a default constraint.
You would have index on this field.Then your query would be simple, efficient and correct:
select top 1 *
from MyTable
order by CreatedDateTime desc
If you don't have timestamp column, you can't determine 'last rows'.
If you need to select 1 column from a table of 800,000 rows where that column is the min or max possible value, and that column is not indexed, then the unassailable fact is that SQL will have to read every row in the table in order to identify that min or max value.
(An aside, on the face of it reading all the rows of an 800,000 row table shouldn't take all that long. How wide is the column? How often is the query run? Are there concurrency, locking, blocking, or deadlocking issues? These may be pain points that could be addressed. End of aside.)
There are any number of work-arounds (indexes, views, indexed views, peridocially indexed copies of the talbe, run once store result use for T period of time before refreshing, etc.), but virtually all of them require making permanent modifications to the database. It sounds like you are not permitted to do this, and I don't think there's much you can do here without some such permanent change--and call it improvement, when you discuss it with your project manager--to the database.
You need to add an Index, can you?
Even if you don't have a primary key an Index will speed up considerably the query.
You say you don't have a primary key, but for your question I assume you have some type of timestamp or something similar on the table, if you create an Index using this column you will be able to execute a query like :
FROM table_name
WHERE timestamp_column_name=(
SELECT max(timestamp_column_name)
FROM table_name
If you're not allowed to edit this table, have you considered creating a view, or replicating the data in the table and moving it into one that has a primary key?
Sounds hacky, but then, your 800k row table doesn't have a primary key, so hacky seems to be the order of the day. :)
I believe you could write it simply as
Hope it helps.

how to select the newly added rows in a table efficiently?

I need to periodically update a local cache with new additions to some DB table. The table rows contain an auto-increment sequential number (SN) field. The cache keeps this number too, so basically I just need to fetch all rows with SN larger than the highest I already have.
SELECT * FROM table where SN > <max_cached_SN>
However, the majority of the attempts will bring no data (I just need to make sure that I have an absolutely up-to-date local copy). So I wander if this will be more efficient:
count = SELECT count(*) from table;
if (count > <cache_size>)
// fetch new rows as above
I suppose that selecting by an indexed numeric field is quite efficient, so I wander whether using count has benefit. On the other hand, this test/update will be done quite frequently and by many clients, so there is a motivation to optimize it.
this test/update will be done quite frequently and by many clients
this could lead to unexpected race competition for cache generation
I would suggest
upon new addition to your table add the newest id into a queue table
using like crontab to trigger the cache generation by checking queue table
upon new cache generated, delete the id from queue table
as you stress majority of the attempts will bring no data, the above will only trigger where there is new addition
and the queue table concept, even can expand for update and delete
I believe that
SELECT * FROM table where SN > <max_cached_SN>
will be faster, because select count(*) may call table scan. Just for clarification, do you never delete rows from this table?
SELECT COUNT(*) may involve a scan (even a full scan), while SELECT ... WHERE SN > constant can effectively use an index by SN, and looking at very few index nodes may suffice. Don't count items if you don't need the exact total, it's expensive.
You don't need to use SELECT COUNT(*)
There is two solution.
You can use a temp table that has one field that contain last count of your table, and create new Trigger after insert on your table and inc temp table field in Trigger.
You can use a temp table that has one field that contain last SN of your table is cached and create new Trigger after insert on your table and update temp table field in Trigger.
not much to this really
drop table if exists foo;
create table foo
foo_id int unsigned not null auto_increment primary key
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;
insert into foo values (null),(null),(null),(null),(null),(null),(null),(null),(null);
select * from foo order by foo_id desc limit 10;

Should I create a unique clustered index, or non-unique clustered index on this SQL 2005 table?

I have a table storing millions of rows. It looks something like this:
ID, Bigint (Identity col)
OutputFileID, int
Sequence, int
…(many other fields)
We find ourselves in a situation where the developer who designed it made the OutputFileID the clustered index. It is not unique. There can be thousands of records with this ID. It has no benefit to any processes using this table, so we plan to remove it.
The question, is what to change it to… I have two candidates, the ID identity column is a natural choice. However, we have a process which does a lot of update commands on this table, and it uses the Sequence to do so. The Sequence is non-unique. Most records only contain one, but about 20% can have two or more records with the same Sequence.
The INSERT app is a VB6 piece of crud throwing thousands insert commands at the table. The Inserted values are never in any particular order. So the Sequence of one insert may be 12345, and the next could be 12245. I know that this could cause SQL to move a lot of data to keep the clustered index in order. However, the Sequence of the inserts are generally close to being in order. All inserts would take place at the end of the clustered table. Eg: I have 5 million records with Sequence spanning 1 to 5 million. The INSERT app will be inserting sequence’s at the end of that range at any given time. Reordering of the data should be minimal (tens of thousands of records at most).
Now, the UPDATE app is our .NET star. It does all UPDATES on the Sequence column. “Update Table_Docs Set Feild1=This, Field2=That…WHERE Sequence =12345” – hundreds of thousands of these a day. The UPDATES are completely and totally, random, touching all points of the table.
All other processes are simply doing SELECT’s on this (Web pages). Regular indexes cover those.
So my question is, what’s better….a unique clustered index on the ID column, benefiting the INSERT app, or a non-unique clustered index on the Sequence, benefiting the UPDATE app?
First off, I would definitely recommend to have a clustered index!
Secondly, your clustered index should be:
static (never or hardly ever change)
so an INT IDENTITY is a very well thought out choice.
When your clustering key is not unique, SQL Server will add a 4-byte uniqueifier to those column values - thus making your clustering key and with it all non-clustered indices on that table larger and less optimal.
So in your case, I would pick the ID - it's narrow, static, unique and ever-increasing - can't be more optimal than that! Since the Sequence is used heavily in UPDATE statements, definitely put a non-clustered index on it, too!
See Kimberly Tripp's excellent blog posts on choosing the right clustering key for great background info on the topic.
As a general rule, you want your clustered index to be unique. If it is not, SQL Server will in fact add a hidden "uniquifier" to it to force it to be unique, and this adds overhead.
So, you are probably best using the ID column as your index.
Just as a side note, using a identity column as your primary key is normally referred to as a surrogate key since it is not inherent in your data. When you have a unique natural key available that is probably a better choice. In this case it looks like you do not, so using the unique surrogate key makes sense.
The worst thing about the inserts out of order is page splits.
When SQL Server needs to insert a new record into an existing index page and finds no place there, it takes half the records from the page and moves them into a new one.
Say, you have these records filling the whole page:
1 2 3 4 5 6 7 8 9
and need to insert a 10. In this case, SQL Server will just start the new page.
However, if you have this:
1 2 3 4 5 6 7 8 11
, 10 should go before 11. In this case, SQL Server will move records from 6 to 11 into the new page:
6 7 8 9 10 11
The old page, as it can be easily seen, will remain half filled (only records from 1 to 6 will go there which are very).
This will increase the index size.
Let's create two sample tables:
CREATE TABLE almost_perfect (id INT NOT NULL PRIMARY KEY, stuffing VARCHAR(300))
WITH q(num) AS
SELECT num + 1
WHERE num < 200000
INTO perfect
SELECT num, REPLICATE('*', 300)
WITH q(num) AS
SELECT num + 1
WHERE num < 200000
INTO almost_perfect
SELECT num + CASE num % 5 WHEN 0 THEN 2 WHEN 1 THEN 0 ELSE 1 END, REPLICATE('*', 300)
EXEC sp_spaceused N'perfect'
EXEC sp_spaceused N'almost_perfect'
perfect 200000 66960 KB 66672 KB 264 KB 24 KB
almost_perfect 200000 128528 KB 128000 KB 496 KB 32 KB
Even with only 20% probability of the records being out of order, the table becomes twice as large.
On the other hand, having a clustered key on Sequence will reduce the I/O twice (since it can be done with a single clustered index seek rather than two unclustered ones).
So I'd take a sample subset of your data, insert it into the test table with a clustered index on Sequence and measure the resulting table size.
If it less than twice the size of the same table with an index on ID, I'd go for the clustered index on Sequence (since the total resulting I/O will be less).
If you decide to create a clustered index on Sequence, make ID an unclustered PRIMARY KEY and make the clustered index UNIQUE on Sequence, ID. This will use a meaningful ID instead of opaque uniquiefier.