Delete top X rows with a table with composite key using NHibernate

Delete top X rows with a table with composite key using NHibernate - sql

We have the following table:
CREATE TABLE TagValueDouble(
TagIdentity [int] NOT NULL,
TimestampInUtcTicks [bigint] NOT NULL,
DoubleValue [float] NULL,
CONSTRAINT [PK_TagValueDouble] PRIMARY KEY CLUSTERED
(
TagIdentity ASC,
TimestampInUtcTicks ASC
)
This table gets filled up with many measurements from different sources (e.g wind speed). The TagIdentity, represents the source, combined with the timestamp represents a unique record.
This table gets large, say 2000 different sources with a 2Hz update rate.
Not often but sometimes we remove a source and we need to drop all the records of that source in the table. The problem is that using NHibernate the query times out.
My plan was to delete X rows at a time of records that are no longer part of the system. Something in the lines of:
DELETE FROM TagValueDouble
WHERE TagIdentity in
(SELECT TOP 10 TagIdentity, TimestampInUtcTicks
FROM TagValueDouble
Where TagIdentity not in (12, 14))
But this does not work.
Any ideas on how I can clean up the table without risking the timeout?
I'm looking for stability not performance. Deleting all values for a source is something that is done rarely.
PS. It has to work on SQL Server, Oracle, Posgres and SQL CE.

It seems you are missing the index on TagIdentity. Create the index en make sure it is filled correctly before deleting the records.

One idea was to do something like this:
DELETE FROM TagValueDouble
WHERE (TagIdentity + 4000) * 315360000000000 + TimestampInUtcTicks IN
(SELECT TOP 1000 (TagIdentity + 4000) * 315360000000000 + TimestampInUtcTicks
FROM TagValueDouble
WHERE TagIdentity NOT IN (5, 10, 12, 13))
Rationale beeing that I do not expect the system to be running in the year 4000. So I combine the TagIdentity and TimestampInUtcTicks to one unique value. But this ended up checking all entires in the table.
What I ended up with was querying for one value which was not in the "allowed" TagIdentity set. And then delete any value from that values timestamp and 24 hours back in time. If that query also timed out, reduce the number of hours until deletion succeeds. Doing this job until no more values where found cleans up the database.

Related

Remove records by clustered or non-clustered index

I have a table (let's say ErrorLog)
CREATE TABLE [dbo].[ErrorLog]
(
[Id] [int] IDENTITY(1,1) NOT NULL,
[Created] [datetime] NOT NULL,
[Message] [varchar](max) NOT NULL,
CONSTRAINT [PK_ErrorLog]
PRIMARY KEY CLUSTERED ([Id] ASC)
)
I want to remove all records that are older that 3 months.
I have a non-clustered index on the Created column (ascending).
I am not sure which one of these is better (seem to take same time).
Query #1:
DELETE FROM ErrorLog
WHERE Created <= DATEADD(month, - 3, GETDATE())
Query #2:
DECLARE #id INT
SELECT #id = max(l.Id)
FROM ErrorLog l
WHERE l.Created <= DATEADD(month, - 3, GETDATE())
DELETE FROM ErrorLog
WHERE Id <= #id

Once you know the maximum clustered key you want to delete then it is definitely faster to use this key. The question is whether it worth selecting this key first using the date. The right decision depends on size of the table and what portion of data you need to delete. The smaller the table is and the smaller is also the number of records for deletion the more efficient should be the first option (Query #1). However, if the number of records to delete is large enough, then the non-clustered index on Date column will be ignored and SQL Server will start scanning the base table. In such a case the second option (Query #2) might be more optimal. And there are usually also other factors to consider.
I have solved similar issue recently (deleting about 600 million (2/3) old records from 1.5TB table) and I have decided for the second approach in the end. There were several reasons for it, but the main were as follows.
The table had to be available for new inserts while the old records were being deleted. So, I could not delete the records in one monstrous delete statement but rather I had to use several smaller batched in order to avoid lock escalation to the table level. Smaller batches kept also the transaction log size in reasonable limits. Furthermore, I had only about one hour long maintenance window each day and it was not possible to delete all required records within one day.
With above mentioned in mind the fastest solution for me was to select the maximum ID I needed to delete according to the Date column and then just start deleting from the beginning of the clustered index as far as to the selected Id one batch after the other (DELETE TOP(#BatchSize) FROM ErrorLog WITH(PAGLOCK) WHERE ID <= #myMaxId). I used the PAGLOCK hint in order to increase the batch size without escalating the lock to the table level. I deleted several batches each day in the end.

Optimize tsql table and/or INSERT statement that will be source for other tables?

We have a a staging table that looks like this. This will store all our data in 15-min intervals:
CREATE TABLE [dbo].[15MinDataRawStaging](
[RawId] [int] IDENTITY(1,1) NOT NULL,
[CityId] [varchar](15) NOT NULL,
[Date] [int] NULL,
[Hour] [int] NULL,
[Minute] [int] NULL,
[CounterValue] [int] NOT NULL,
[CounterName] [varchar](40) NOT NULL
)
It currently stores 20 different Counters, which means that we insert about 400K rows every hour of every day to this table.
Right now, I'm deleting data from before 03/2016, but even with the first 8 days of March data, there's over 58M rows.
Once all the hourly data is stored in [15MinDataRawStaging], we start copying data from this table to other tables, which are then used for the reports.
So, for example, we have a Kpi called Downtime, which is composed of counters VeryLongCounterName1 and VeryLongCounterName2. Once the hourly data is stored in [15MinDataRawStaging], we run a stored procedure that inserts these counters to its own table, called [DownTime]. It looks something like this:
insert into [DownTime] (CityKey, Datekey, HourKey, MinuteKey, DownTime, DowntimeType)
select CityId, [date], [hour], [minute], CounterValue, CounterName
From [15MinDataRawStaging] p
where
[date] = #Date
and [Hour] = #Hour
and CounterName in ('VeryLongCounterName1', 'VeryLongCounterName2')
and CounterValue > 0
This runs automatically every hour (through a C# console app), and I've noticed that with this query I'm getting timeout issues. I just ran it, and it indeed takes about 35 seconds to complete.
So my questions are:
Is there a way to optimize the structure of the staging table so these types of INSERTs to other tables don't take that long?
Or is it possible to optimize the INSERT query? The reason I have the staging table is because I need to store the data, even if it's for the current month. No matter what's done, the staging table will have tons of rows.
Do you guys have any other suggestions?
Thanks.

It sounds like you want to partition 15MinDataRawStaging into daily or hourly chunks. The documentation explains how to do this (better than a Stack Overflow answer).
Partitioning basically stores the table in multiple different files (at least conceptually). Certain actions can be very efficient. For instance, dropping a partition is much, much faster than dropping the individual records. In addition, fetching data from a single partition should be fast -- and in your case, the most recent partition will be in memory, making everything faster.
Depending on how the data is used, indexes might also be appropriate. But for this volume of data and the way you are using it, partitions seem like the key idea.

Assuming that the staging table has only one purpose, viz for the INSERT into DownTime, you can trade off a small amount of storage and insert performance (into the staging table) to improve the final ETL performance by adding a clustered index matching the query used in extraction:
CREATE UNIQUE CLUSTERED INDEX MyIndex
ON [15MinDataRawStaging]([Date], [Hour], [Minute], RawId);
I've added RawId in order to allow uniqueness (otherwise a 4 byte uniquefier would have been added in any event).
You'll also want to do some trial and error by testing whether adding [CounterName] and / or [CounterValue] to the index (but before RawId) will improve the overall process throughput (i.e. both Staging insertion and Extraction into the final DownTime table)

Implement a ring buffer

We have a table logging data. It is logging at say 15K rows per second.
Question: How would we limit the table size to the 1bn newest rows?
i.e. once 1bn rows is reached, it becomes a ring buffer, deleting the oldest row when adding the newest.
Triggers might load the system too much. Here's a trigger example on SO.
We are already using a bunch of tweaks to keep the speed up (such as stored procedures, Table Parameters etc).
Edit (8 years on) :
My recent question/answer here addresses a similar issue using a time series database.

Unless there is something magic about 1 billion, I think you should consider other approaches.
The first that comes to mind is partitioning the data. Say, put one hour's worth of data into each partition. This will result in about 15,000*60*60 = 54 million records in a partition. About every 20 hours, you can remove a partition.
One big advantage of partitioning is that the insert performance should work well and you don't have to delete individual records. There can be additional overheads depending on the query load, indexes, and other factors. But, with no additional indexes and a query load that is primarily inserts, it should solve your problem better than trying to delete 15,000 records each second along with the inserts.

I don't have a complete answer but hopefully some ideas to help you get started.
I would add some sort of numeric column to the table. This value would increment by 1 until it reached the number of rows you wanted to keep. At that point the procedure would switch to update statements, overwriting the previous row instead of inserting new ones. You obviously won't be able to use this column to determine the order of the rows, so if you don't already I would also add a timestamp column so you can order them chronologically later.
In order to coordinate the counter value across transactions you could use a sequence, then perform a modulo division to get the counter value.
In order to handle any gaps in the table (e.g. someone deleted some of the rows) you may want to use a merge statement. This should perform an insert if the row is missing or an update if it exists.
Hope this helps.

Here's my suggestion:
Pre-populate the table with 1,000,000,000 rows, including a row number as the primary key.
Instead of inserting new rows, have the logger keep a counter variable that increments each time, and update the appropriate row according to the row number.
This is actually what you would do with a ring buffer in other contexts. You wouldn't keep allocating memory and deleting; you'd just overwrite the same array over and over.
Update: the update doesn't actually change the data in place, as I thought it did. So this may not be efficient.

Just an idea that is to complicated to write in a comment.
Create a few log tables, 3 as an example, Log1, Log2, Log3
CREATE TABLE Log1 (
Id int NOT NULL
CHECK (Id BETWEEN 0 AND 9)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log1] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
CREATE TABLE Log2 (
Id int NOT NULL
CHECK (Id BETWEEN 10 AND 19)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log2] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
CREATE TABLE Log3 (
Id int NOT NULL
CHECK (Id BETWEEN 20 AND 29)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log3] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
Then create a partitioned view
CREATE VIEW LogView AS (
SELECT * FROM Log1
UNION ALL
SELECT * FROM Log2
UNION ALL
SELECT * FROM Log3
)
If you are on SQL2012 you can use a sequence
CREATE SEQUENCE LogSequence AS int
START WITH 0
INCREMENT BY 1
MINVALUE 0
MAXVALUE 29
CYCLE
;
And then start to insert values
INSERT INTO LogView (Id, Message)
SELECT NEXT VALUE FOR LogSequence
,'SomeMessage'
Now you just have to truncate the logtables on some kind of schedule
If you don't have sql2012 you need to create the sequence some other way

I'm looking for something similar myself (using a table as a circular buffer) but it seems like a simpler approach (for me) will be just to periodically delete old entries (e.g. the lowest IDs or lowest create/lastmodified datetimes or entries over a certain age). It's not a circular buffer but perhaps it is a close enough approximation for some. ;)

Partitioned view and performance for the table with huge records

Currently I am facing performance issues with queries and
stored procedure. Following is the scenario:
We have 3-4 tables in a database (SQL Server 2000 SP4) which
have huge amounts of records. One of the tables has more than 25 million
records. These tables are maintaining sales records and thousands of
records added into them daily. Whenever a stored procedure is executed it takes
15-30 minutes to complete. There are 3-4 joins on the table. Users are
complaining about it frequently. Indexes are correct. To improve the performance
we have implemented partitioned views. The solution was implemented by
referring the following article on MSDN
We have split the sales records by year wise and performance
has improved, a query/stored procedure now takes takes 3-5 minutes to run. To improve
the performance further, we split the sales records by month wise. We are maintaining
4 years of data and now we are close to having 48 tables for sales data (After
splitting sales data by month). I was expecting this to improve the performance. But
that is not happening. The query is executing much slower than the previous one
(year wise splitting of data) which surprises me. Also after looking at the
query plan I found that it is doing an index scan on all 48 sales tables instead
of scanning only the relevant tables. E.g. When queried to store procedure for the
period 19-NOV-2012 and 20-DEC-2012, it should consider only 2 tables NOV-2012
and DEC-2012. But it is considering all 48 tables.  So my question is:
Why is it considering all tables instead considering only
relevant tables. E.g. In above example NOV-2012 and DEC-2012
Why the year wise logic (split sales records by year) is
performing better than month wise logic (Split sales records by month)
Following is the code for partitioned View.
For example year Other years are omitted.
SELECT * FROM tbl_Sales_Jan2010
UNION ALL
SELECT * FROM tbl_Sales_Feb2010
UNION ALL
SELECT * FROM tbl_Sales_Mar2010
UNION ALL
SELECT * FROM tbl_Sales_Apr2010
UNION ALL
SELECT * FROM tbl_Sales_May2010
UNION ALL
SELECT * FROM tbl_Sales_Jun2010
UNION ALL
SELECT * FROM tbl_Sales_Jul2010
UNION ALL
SELECT * FROM tbl_Sales_Aug2010
UNION ALL
SELECT * FROM tbl_Sales_Sep2010
UNION ALL
SELECT * FROM tbl_Sales_Oct2010
UNION ALL
SELECT * FROM tbl_Sales_Nov2010
UNION ALL
SELECT * FROM tbl_Sales_Dec2010
Following is the table structure.
CREATE TABLE [dbo].[tbl_Sales_Jan2010](
[SalesID] [numeric](10, 0) NOT NULL,
[StoreNumber] [char](3) NOT NULL,
[SomeColumn1] [varchar](15) NOT NULL,
[Quantity] [int] NOT NULL,
[SalePrice] [numeric](18, 2) NOT NULL,
[SaleDate] [datetime] NOT NULL,
[DeptID] [int] NOT NULL,
[CatCode] [char](3) NOT NULL,
[AuditDate] [datetime] NOT NULL CONSTRAINT [DF_tbl_Sales_Jan2010_EditDate] DEFAULT (getdate()),
[SomeColumn2] [varchar](15) NULL,
[SaleMonthYear] [int] NULL CONSTRAINT [DF__tbl_Sales__SaleY__Jan2010] DEFAULT (12010),
[SaleDateInIntFormat] [int] NULL,
CONSTRAINT [PK_tbl_Sales_Jan2010] PRIMARY KEY CLUSTERED
(
[SalesID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
ALTER TABLE [dbo].[tbl_Sales_Jan2010] WITH CHECK ADD CHECK (([SaleMonthYear] = 12010))
Following is the query
SELECT SUM(C.Quantity) as total
FROM Productdatabase.dbo.tbl_Product A , Productdatabase.dbo.tbl_Product_Category B, XDatabase.dbo.vw_Sales_Test C, tbl_Store D
WHERE A.ProductID = B.ProductID AND B.CategoryID = #CateID
AND C.SomeColumn = A.PRoductCode
AND D.StoreCode = C.StoreNumber
AND D.country = #country
AND D.status = 0
And C.SaleMonthYear between #BeginMonthYear and #EndMonthYear
AND C.SalDate between #FromSaleDate and #ToSaleDate

Whoever set up the partitioning did not really think of what he is doing. Besides not using partitioning (which is a SQL Server function), most likely for cost...
SELECT * FROM tbl_Sales_Jan2010
in the Union add the WHERE conditions to that, then the query analyzer can rule out the tables that are not relevant due to bad where clause right there. I.e. add:
(([SaleMonthYear] = 12010
right there.
Second, fix your other issues. Really. Point being:
We have 3-4 tables in a database (SQL Server 2000 SP4) which have huge
amounts of records. One of the tables has more than 25 million
records.
Let me laugh. 25 Million is not tiny, not small, but "Hugh" is what? I mean, I worked with tables adding hundreds of millions of rows PER DAY and keeping the data for 2 years. 25 Million is something a mid range Server handles easily. I suggest you have either bad Hardware (and I mean bad), or some really other things going on.
Design issues like:
[SaleMonthYear]
This should not exist - it should be SaleYearMonth, so you can make a range test (between 201005 and 201008) which you can not do efficiently now, and you totally bork any index ordering if you ever use that.
This is ridiculous because this being a number you totally bork the gain here.
Whenever a stored procedure is executed it takes 15-30 minutes to complete
Let me be clear here. On acceptable mid range Hardware for a sitaution like that (i.e. proper Server, 32-64gb ram, a dozen to 24 high Speed discs) there is NO WAY this takes 15 to 30 minutes. Not the code you wrote there.
Unless you have stuff like lock congestion (bad application design) or the Server overloaded with other things (bad application design / bad Administration). I would expcet a query like that, with proper indices, to return in way below a Minute.
Anyhow, partitioning works by eliminating a lot of the checks fast - and is also / mostly a delete optimization in your case (you can just drop tables, no Need to have a delete Statement make hard index updates). THe way you implemented it, though, is not the way MS sasys it should be done, not the way logic says it should be done and shall give no result as your Partition is not integrated into teh query.
If you look at tables and query, it still must check every table.

From the very same MSDN article you have quoted:
CHECK constraints are not needed for the partitioned view to return the correct results. However, if the CHECK constraints have not been defined, the query optimizer must search all the tables instead of only those that cover the search condition on the partitioning column. Without the CHECK constraints, the view operates like any other view with UNION ALL. The query optimizer cannot make any assumptions about the values stored in different tables and it cannot skip searching the tables that participate in the view definition.
In your question, you are specifying a query which has a date range - 19-Nov-2012 to 20-Dec-2012. I assume that would be the value contained in SaleDate column, but your constraint is on SaleMonthYear Column.
Are you sure that the constraint defined is correct? Could you also please post your query?
Raj

Why is this query faster without index?

I inherited a new system and I am trying to make some improvements on the data. I am trying to improve this table and can't seem to make sense of my findings.
I have the following table structure:
CREATE TABLE [dbo].[Calls](
[CallID] [varchar](8) NOT NULL PRIMARY KEY,
[RecvdDate] [varchar](10) NOT NULL,
[yr] [int] NOT NULL,
[Mnth] [int] NOT NULL,
[CallStatus] [varchar](50) NOT NULL,
[Category] [varchar](100) NOT NULL,
[QCall] [varchar](15) NOT NULL,
[KOUNT] [int] NOT NULL)
This table has about 220k records in it. I need to return all records that have a date greater than specific date. In this case 12/1/2009. This query will return about 66k records and it takes about 4 seconds to run. From past systems I have worked on this seems high. Especially given how few records are in the table. So I would like to bring that time down.
So I'm wondering what would be some good ways to bring that down? I tried adding a date column to the table and converting the string date to an actual date column. Then I added an index on that date column but the time stayed the same. Given that there aren't that many records I can see how a table scan could be fast but I would think that an index could bring that time down.
I have also considered just querying off the month and year columns. But I haven't tried it yet. And would like to keep it off the date column if possible. But if not I can change it.
Any help is appreciated.
EDIT: Here is the query I am trying to run and test the speed of the table. I usually put out the columns but just for simplicity I used * :
SELECT *
FROM _FirstSlaLevel_Tickets_New
WHERE TicketRecvdDateTime >= '12/01/2009'
EDIT 2: So I mentioned that I had tried to create a table with a date column that contained the recvddate data but as a date rather than a varchar. That is what TicketRecvdDateTime column is in the query above. The original query I am running against this table is:
SELECT *
FROM Calls
WHERE CAST(RecvdDate AS DATE) >= '12/01/2009'

You may be encountering what is referred to as the Tipping Point in SQL Server. Even though you have the appropriate index on the column, SQL Server may decided to do a table scan anyway if the expected number of rows returned exceeds some threshold (the 'tipping point').
In your example, this seems likely since your is turning 1/4 of the number of rows in the database. The following is a good article that explains this: http://www.sqlskills.com/BLOGS/KIMBERLY/category/The-Tipping-Point.aspx

SELECT * will usually give a poor performance.
Either the index will be ignored or you'll end up with a key/bookmark lookup into the clustered index. No matter: both can run badly.
For example, if you had this query, and the index on TicketRecvdDateTime INCLUDEd CallStatus, then it would most likely run as expected. This would be covering
SELECT CallStatus
FROM _FirstSlaLevel_Tickets_New
WHERE TicketRecvdDateTime >= '12/01/2009'
This is in addition to Randy Minder's answer: a key/bookmark lookup may be cheap enough for a handful of rows but not for a large chunk of the table data.

Your query is faster w/o an index (or, more precisly, is the same speed w/ or w/o the indeX) because and index on RecvdDate will always be ignored in an expression like CAST(RecvdDate AS DATE) >= '12/01/2009'. This is a non-SARG-able expression, as it requires the column to be transformed trough a function. In order for this index event to be considered, you have to express your filter criteria exactly on the column being indexed, not on an expression based on it. This would be the first step.
There are more steps:
Get rid of the VARCHAR(10) column for dates and replace it with the appropriate DATE or DATETIME column. Storing date and/or time as strings is riddled with problems. Not only for indexing, but also for correctness.
A table that is frequently scanned on a range based on a column (as most such call log tables are) should be clustered by that column.
It is highly unlikely you really need the yr and mnth columns. If you really do need them, then you probably need them as computed columns.
.
CREATE TABLE [dbo].[Calls](
[CallID] [varchar](8) NOT NULL,
[RecvdDate] [datetime](10) NOT NULL,
[CallStatus] [varchar](50) NOT NULL,
[Category] [varchar](100) NOT NULL,
[QCall] [varchar](15) NOT NULL,
[KOUNT] [int] NOT NULL,
CONSTRAINT [PK_Calls_CallId] PRIMARY KEY NONCLUSTERED ([CallID]));
CREATE CLUSTERED INDEX cdxCalls ON Calls(RecvDate);
SELECT *
FROM Calls
WHERE RecvDate >= '12/01/2009';
Of course, the proper structure of the table and indexes should be the result of careful analysis, considering all factors involved, including update performance, other queries etc. I recommend you start by going through all the topics included in Designing Indexes.

Can you alter your query? If few columns are needed, you can alter the SELECT clause to return fewer columns. And, then you can create a covering index that includes all columns referenced, including TicketRecvdDateTime.
You might create the index on TicketRecvdDateTime, but you may not avoid the tipping point that #Randy Minder discusses. However, a scan on the smaller index (smaller than table scan) would return fewer pages.

Assuming RecvdDate is the TicketRecvdDateTime you are talking about:
SQL Server only compares dates in single quotes if the field type is DATE. Your query is probably comparing them as VARCHAR. try adding a row with '99/99/0001' and see if it shows at the bottom.
If so, your query results are incorrect. Change type to DATE.
Note that VARCHAR does not index well , DATETIME does.
Check the query plan to see if its using indices. If the DB is small compared to available RAM, it may simply table scan and hold everything in memory.
EDIT: On seeing your CAST/DATETIME edit, let me point out that parsing a date from a VARCHAR is a very expensive operation. You are doing this 220k times. This will kill performance.
Also you are no longer checking on an indexed field. a compare with an expression involving an index field does not use the index.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas