Order by clause execution in SQL - sql

This question isn't about order of executions. It's about just the ORDER BY.
In standard execution is:
FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
TOP
EDIT: This question has been more or less the issue of "Does SQL Server apply short circuit evaluation when executing ORDER BY expressions?" The answer is SOMETIMES! I just haven't found a reasonable reason as to why. See Edit #4.
Now suppose I have a statement like this:
DECLARE #dt18YearsAgo AS DATETIME = DATEADD(YEAR,-18,GETDATE());
SELECT
Customers.Name
FROM
Customers
WHERE
Customers.DateOfBirth > #dt18YearsAgo
ORDER BY
Contacts.LastName ASC, --STATEMENT1
Contacts.FirstName ASC, --STATEMENT2
(
SELECT
MAX(PurchaseDateTime)
FROM
Purchases
WHERE
Purchases.CustomerID = Customers.CustomerID
) DESC --STATEMENT3
This isn't the real statement I'm trying to execute, but just an example.
There are three ORDER BY statements.
The third statement is only used for rare cases where the last name and first name match.
If there are no duplicate last names, does SQL Server not execute ORDER BY statements #2 and #3? And, logically, if there are no duplicate last name and first name, does SQL Server note execute statement #3.
This is really for optimization. Reading from the Purchases table should only be a last resort. In the case of my application, it wouldn't be efficient to read every single "PurchaseDateTime" from "Purchases" grouping by "CustomerID".
Please keep the answer related to my question and not a suggestion like building an index for CustomerID, PurchaseDateTime in Purchases. The real question is, does SQL Server skip unnecessary ORDER BY statements?
Edit: Apparently, SQL Server will always execute every statement as long as there is one row. Even with one row, this will give you a divide by zero error:
DECLARE #dt18YearsAgo AS DATETIME = DATEADD(YEAR,-18,GETDATE());
SELECT
Customers.Name
FROM
Customers
WHERE
Customers.DateOfBirth > #dt18YearsAgo
ORDER BY
Contacts.LastName ASC, --STATEMENT1
Contacts.FirstName ASC, --STATEMENT2
1/(Contacts.ContactID - Contacts.ContactID) --STATEMENT3
Edit2:
Apparently, this doesn't give divide by zero:
DECLARE #dt18YearsAgo AS DATETIME = DATEADD(YEAR,-18,GETDATE());
SELECT
Customers.Name
FROM
Customers
WHERE
Customers.DateOfBirth > #dt18YearsAgo
ORDER BY
Contacts.LastName ASC, --STATEMENT1
Contacts.FirstName ASC, --STATEMENT2
CASE WHEN 1=0
THEN Contacts.ContactID
ELSE 1/(Contacts.ContactID - Contacts.ContactID)
END --STATEMENT3
Well, the original answer to my question is YES, it does execute, but what's nice is that I can stop execute with a proper CASE WHEN
Edit 3: We can stop execution of an ORDER BY statement with a proper CASE WHEN. The trick, I guess, is to figure out how to use it properly. CASE WHEN will give me what I want, which a short circuit execution in an ORDER BY statement. I compared the Execution Plan in SSMS and depending on the CASE WHEN statement, the Purchases table isn't scanned at all EVEN THOUGH it's a clearly visible SELECT/FROM statement:
DECLARE #dt18YearsAgo AS DATETIME = DATEADD(YEAR,-18,GETDATE());
SELECT
Customers.Name
FROM
Customers
WHERE
Customers.DateOfBirth > #dt18YearsAgo
ORDER BY
Contacts.LastName ASC, --STATEMENT1
Contacts.FirstName ASC, --STATEMENT2
CASE WHEN 1=0
THEN
(
SELECT
MAX(PurchaseDateTime)
FROM
Purchases
WHERE
Purchases.CustomerID = Customers.CustomerID
)
ELSE Customers.DateOfBirth
END DESC
Edit 4: Now I'm completely confused. Here's an example by #Lieven
WITH Test (name, ID) AS
(SELECT 'Lieven1', 1 UNION ALL SELECT 'Lieven2', 2)
SELECT * FROM Test ORDER BY name, 1/ (ID - ID)
This yields no divide by zero, which means SQL Server does in fact, do short circuit evaluation on SOME tables, specifically those created with the WITH command.
Trying this with a TABLE variable:
DECLARE #Test TABLE
(
NAME nvarchar(30),
ID int
);
INSERT INTO #Test (Name,ID) VALUES('Lieven1',1);
INSERT INTO #Test (Name,ID) VALUES('Lieven2',2);
SELECT * FROM #Test ORDER BY name, 1/ (ID - ID)
will yield a divide by zero error.

First of all what you are calling "Statements" are no such thing. They are sub-clauses of the ORDER BY (major) clause. The difference is important, because "Statement" implies something separable, ordered and procedural, and SQL sub-clauses are none of those things.
Specifically, SQL sub-clauses (that is, the individual items of a SQL major clause (SELECT, FROM, WHERE, ORDER BY, etc.)) have no implicit (nor explicit) execution order of their own. SQL will re-order them in anyway that it finds convenient and will almost always execute all of them if it execute any of them. In short, SQL Server does not do that kind of "short-circuit" optimizations because they are trivially effective and seriously get in the way of the very different kind of optimizations that it does do (i.e., Statistical Data Access/Operator Optimizations).
So the correct answer to your original question (which you should not have changed) is NO, not reliably. You cannot rely on SQL Server to not use some sub-clause of the ORDER BY, simply because it looks like it does not need to.
The only common exception to this is that the CASE function can (in most circumstances) be used to short-circuit execution paths (within the CASE function though, not outside of it), but only because it is specifically designed for this. I cannot think of anything else in SQL that you can rely on to act like this.

DECLARE #MyTable TABLE
(
Data varchar(30)
)
INSERT INTO #MyTable (Data) SELECT 'One'
INSERT INTO #MyTable (Data) SELECT 'Two'
INSERT INTO #MyTable (Data) SELECT 'Three'
--SELECT *
--FROM #MyTable
--ORDER BY LEN(Data), LEN(Data)/0
-- Divide by zero error encountered.
SELECT *
FROM #MyTable
ORDER BY LEN(Data), CASE WHEN Data is null THEN LEN(Data)/0 ELSE 1 END
-- no problem
Also with SET STATISTICS IO ON I saw these results:
SELECT *
FROM #MyTable
ORDER BY LEN(Data)
--(3 row(s) affected)
--Table '#4F2895A9'. Scan count 1, logical reads 1
SELECT *
FROM #MyTable
ORDER BY LEN(Data), CASE WHEN Data = 'One' THEN (SELECT MAX(t2.Data) FROM #MyTable t2) ELSE Data END
--(3 row(s) affected)
--Table '#4F2895A9'. Scan count 2, logical reads 2
SELECT *
FROM #MyTable
ORDER BY LEN(Data), CASE WHEN Data = 'Zero' THEN (SELECT MAX(t2.Data) FROM #MyTable t2) ELSE Data END
--(3 row(s) affected)
--Table 'Worktable'. Scan count 0, logical reads 0
--Table '#4F2895A9'. Scan count 1, logical reads 1

I guess you have answered your question. However, why you are sorting the data on just firstname, lastname and if these two are same then purchase order otherwise you will do on DOB?
Logically, it should be firstname, lastname, DOB. If these three are the same, only then should you evaluate the purchaseorderdate. There are many people who have the same names, but very few have the same names and DOBs. This will reduce the time you will be querying the purchase table.

Related

Update where select, guarantee atomicity

I have a T-SQL query like this:
UPDATE
[MyTable]
SET
[MyField] = #myValue
WHERE
[Id] =
(
SELECT TOP(1)
[Id]
FROM
[MyTable]
WHERE
[MyField] IS NULL
-- AND other conditions on [MyTable]
ORDER BY
[Id] ASC
)
It seems that this query is not atomic (the select of 2 concurrent executions can return the same Id twice).
Edit:
If I execute this query, the Id returned by the SELECT will not be available for the next execution (because [MyField] will not be NULL anymore). However, if I execute this query twice at the same time, both executions could return the same Id (and the second UPDATE would overwrite the first one).
I've read that one solution to avoid that is to use a SERIALIZABLE isolation level. Is that the best / fastest / most simple way ?
As I can see, UPDLOCK would be enough (test code confirms that)
This would result the top 1 value but if its the same you have the potential to return more than one value.
Try using DISTINCT
I understand that top(1) should only return 1 row but the question states its returning more than one row. So i thought it might be because the value is the same. So you could possibly use something like this
SELECT DISTINCT TOP 1 name FROM [Class];
Its either that or where clause to narrow results
Calculate the max ID first then use it in a cross join for the the update.
SQL DEMO
WITH cte as (
SELECT TOP 1 ID
FROM [MyTable]
WHERE MYFIELD IS NULL
ORDER BY ID
)
UPDATE t
SET [ID] = cte.[ID]
FROM [MyTable] t
CROSS JOIN cte;
OUTPUT

Pagination in SQL - Performance issue

Am trying to use pagination and i got the perfect link in SO
https://stackoverflow.com/a/109290/1481690
SELECT *
FROM ( SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum, *
FROM Orders
WHERE OrderDate >= '1980-01-01'
) AS RowConstrainedResult
WHERE RowNum >= 1
AND RowNum < 20
ORDER BY RowNum
Exact same query am trying to use with additional join of few tables in my inner Query.
Am getting few performance issues in following scenarios
WHERE RowNum >= 1
AND RowNum < 20 ==>executes faster approx 2 sec
WHERE RowNum >= 1000
AND RowNum < 1010 ==> more time approx 10 sec
WHERE RowNum >= 30000
AND RowNum < 30010 ==> more time approx 17 sec
Everytime i select 10 rows but huge time difference. Any idea or suggestions ?
I chose this approach as am binding columns dynamically and forming Query. Is there any other better way i can organize the Pagination Query in SQl Server 2008.
Is there a way i can improve the performance of the query ?
Thanks
I always check how much data I am accessing in query and try to eliminate un necessary columns as well as rows.
Well these are just obvious points you might have already check yet just wanted to pointed out in case you haven’t already.
In your query the slow performance might be because you doing “Select *”. Selecting all columns from table does not allow to come with good Execution plan.
Check if you need only selected columns and make sure you have correct covering index on table Orders.
Because explicit SKIPP or OFFSET function is not available in SQL 2008 version we need to create one and that we can create by INNER JOIN.
In one query we will first generate ID with OrderDate and nothing else will be in that query.
We do the same in second query but here we also select some other interested columns from table ORDER or ALL if you need ALL column.
Then we JOIN this to query results by ID and OrderDate and ADD SKIPP rows filter for first query where data set is at its minimal size what is required.
Try this code.
SELECT q2.*
FROM
(
SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum, OrderDate
FROM Orders
WHERE OrderDate >= '1980-01-01'
)q1
INNER JOIN
(
SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum, *
FROM Orders
WHERE OrderDate >= '1980-01-01'
)q2
ON q1.RowNum=q2.RowNum AND q1.OrderDate=q2.OrderDate AND q1.rownum BETWEEN 30000 AND 30020
To give you the estimate, i tried this with following test data and no matter what window you query the results are back in less than 2
seconds, and note that the table is HEAP (no index) Table has total 2M
rows. test select is querying 10 rows from 50,000 to 50,010
The below Insert took around 8 minutes.
IF object_id('TestSelect','u') IS NOT NULL
DROP TABLE TestSelect
GO
CREATE TABLE TestSelect
(
OrderDate DATETIME2(2)
)
GO
DECLARE #i bigint=1, #dt DATETIME2(2)='01/01/1700'
WHILE #I<=2000000
BEGIN
IF #i%15 = 0
SELECT #DT = DATEADD(DAY,1,#dt)
INSERT INTO dbo.TestSelect( OrderDate )
SELECT #dt
SELECT #i=#i+1
END
Selecting the window 50,000 to 50,010 took less than 3 seconds.
Selecting the last single row 2,000,000 to 2,000,000 also took 3 seconds.
SELECT q2.*
FROM
(
SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum
,OrderDate
FROM TestSelect
WHERE OrderDate >= '1700-01-01'
)q1
INNER JOIN
(
SELECT ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum
,*
FROM TestSelect
WHERE OrderDate >= '1700-01-01'
)q2
ON q1.RowNum=q2.RowNum
AND q1.OrderDate=q2.OrderDate
AND q1.RowNum BETWEEN 50000 AND 50010
ROW_NUMBER is crappy way of doing pagination as the cost of the operation grows extensively.
Instead you should use double ORDER BY clause.
Say you want to get records with ROW_NUMBER between 1200 and 1210. Instead of using ROW_NUMBER() OVER (...) and later binding the result in WHERE you should rather:
SELECT TOP(11) *
FROM (
SELECT TOP(1210) *
FROM [...]
ORDER BY something ASC
) subQuery
ORDER BY something DESC.
Note that this query will give the result in reverse order. That shouldn't - generally speaking - be an issue as it's easy to reverse the set in the UI so i.e. C#, especially as the resulting set should be relatively small.
The latter is generally a lot faster. Note that the latter solution will be greatly improved by CLUSTERING (CREATE CLUSTERED INDEX ...) on the column you use to sort the query by.
Hope that helps.
Even though you always selecting the same number of rows, performance degrades when you want to select rows at the end of your data window. To get first 10 rows, the engine fetches just 10 rows; to get next 10 it has to fetch 20, discard first 10 , and return 10. To get 30000 -- 30010, it has to read all 30010, skip first 30k, and return 10.
Some tricks to improve performance (not a full list, building OLAP completely skipped).
You mentioned joins; if that's possible join not inside the inner query, but result of it. You can also try to add some logic to ORDER BY OrderDate - ASC or DESC depends on what bucket you are retrieving . Say if you want to grab the "last" 10, ORDER BY ... DESC will work much faster. Needles to say, it has to be an index orderDate.
Incredibly, no other answer has mentioned the fastest way to do paging in all SQL Server versions, specifically with respect to the OP's question where offsets can be terribly slow for large page numbers as is benchmarked here.
There is an entirely different, much faster way to perform paging in SQL. This is often called the "seek method" as described in this blog post here.
SELECT TOP 10 *
FROM Orders
WHERE OrderDate >= '1980-01-01'
AND ((OrderDate > #previousOrderDate)
OR (OrderDate = #previousOrderDate AND OrderId > #previousOrderId))
ORDER BY OrderDate ASC, OrderId ASC
The #previousOrderDate and #previousOrderId values are the respective values of the last record from the previous page. This allows you to fetch the "next" page. If the ORDER BY direction is DESC, simply use < instead.
With the above method, you cannot immediately jump to page 4 without having first fetched the previous 40 records. But often, you do not want to jump that far anyway. Instead, you get a much faster query that might be able to fetch data in constant time, depending on your indexing. Plus, your pages remain "stable", no matter if the underlying data changes (e.g. on page 1, while you're on page 4).
This is the best way to implement paging when lazy loading more data in web applications, for instance.
Note, the "seek method" is also called keyset paging.
declare #pageOffset int
declare #pageSize int
-- set variables at some point
declare #startRow int
set #startRow = #pageOffset * #pageSize
declare #endRow int
set #endRow + #pageSize - 1
SELECT
o.*
FROM
(
SELECT
ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum
, OrderId
FROM
Orders
WHERE
OrderDate >= '1980-01-01'
) q1
INNER JOIN Orders o
on q1.OrderId = o.OrderId
where
q1.RowNum between #startRow and #endRow
order by
o.OrderDate
#peru, regarding if there is a better way and to build on the explanation provided by #a1ex07, try the following -
If the table has a unique identifier such as a numeric (order-id) or (order-date, order-index) upon which a compare (greater-than, less-than) operation can be performed then use that as an offset instead of the row-number.
For example if the table orders has 'order_id' as primary-key then -
To get the first ten results -
1.
select RowNum, order_id from
( select
ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum,
o.order_id
from orders o where o.order_id > 0 ;
)
tmp_qry where RowNum between 1 and 10 order by RowNum; // first 10
Assuming that the last order-id returned was 17 then,
To select the next 10,
2.
select RowNum, order_id from
( select
ROW_NUMBER() OVER ( ORDER BY OrderDate ) AS RowNum,
o.order_id
from orders o where o.order_id > 17 ;
)
tmp_qry where RowNum between 1 and 10 order by RowNum; // next 10
Note that the row-num values have not been changed. Its the order-id value being compared that has been changed.
If such a key is not present then consider adding one !
Main drawback of your query is that it sorts whole table and calculates Row_Number for every query. You can make life easier for SQL Server by using less columns at sorting stage (for example as suggested by Anup Shah). However you still make it to read, sort and calculate row numbers for every query.
An alternative to calculations on the fly is reading values that were calculateed before.
Depending on volatility of your dataset and number of columns for sorting and filtering you can consider:
Add a rownumber column (or 2-3 columns ) and include it as a first columns in clustered index or create non-clustered inde).
Create views for most frequent combinations and then index those views. It is called indexed (materialised) views.
This will allow to read rownumber and performance will almost not depend on volume. Although maintaining of theese will, but less than sorting whole table for each query.
Note, that is this is a one off query and is run infrequently compared to all other queries, it is better to stick with query optimisation only: efforts to create extra columns/views might not pay off.

Table-Valued function - Order by is ignored in output

We are moving from SQL Server 2008 to SQL Server 2012 and immediately noticed that all our table-valued functions no longer deliver their temp table contents in the correctly sorted order.
CODE:
INSERT INTO #Customer
SELECT Customer_ID, Name,
CASE
WHEN Expiry_Date < GETDATE() then 1
WHEN Expired = 1 then 1
ELSE 0
END
from Customer **order by Name**
In SQL Server 2008 this function returns the customers sorted by Name. In SQL Server 2012 it returns the table unsorted. The "order by" is ignored in SQL 2012.
Do we have to re-write all the functions to include a sort_id and then sort them when they are called in the main application or is there an easy fix??
There were two things wrong with your original approach.
On inserting to the table it was never guaranteed that the ORDER BY on the INSERT ... SELECT ... ORDER BY would be the order that the rows were actually inserted.
On selecting from it SQL Server does not guarantee that SELECT without an ORDER BY will return the rows in any particular order such as insertion order anyway.
In 2012 it looks as though the behaviour has changed with respect to item 1. It now generally ignores the ORDER BY on the SELECT statement that is the source for an INSERT
DECLARE #T TABLE(number int)
INSERT INTO #T
SELECT number
FROM master..spt_values
ORDER BY name
2008 Plan
2012 Plan
The reason for the change of behaviour is that in previous versions SQL Server produced one plan that was shared between executions with SET ROWCOUNT 0 (off) and SET ROWCOUNT N. The sort operator was only there to ensure the correct semantics in case the plan was run by a session with a non zero ROWCOUNT set. The TOP operator to the left of it is a ROWCOUNT TOP.
SQL Server 2012 now produces separate plans for the two cases so there is no need to add these to the ROWCOUNT 0 version of the plan.
A sort may still appear in the plan in 2012 if the SELECT has an explicit TOP defined (other than TOP 100 PERCENT) but this still doesn't guarantee actual insertion order of rows, the plan might then have another sort after the TOP N is established to get the rows into clustered index order for example.
For the example in your question I would just adjust the calling code to specify ORDER BY name if that is what it requires.
Regarding your sort_id idea from Ordering guarantees in SQL Server it is guaranteed when inserting into a table with IDENTITY that the order these are allocated will be as per the ORDER BY so you could also do
DECLARE #Customer TABLE (
Sort_Id INT IDENTITY PRIMARY KEY,
Customer_ID INT,
Name INT,
Expired BIT )
INSERT INTO #Customer
SELECT Customer_ID,
Name,
CASE
WHEN Expiry_Date < Getdate() THEN 1
WHEN Expired = 1 THEN 1
ELSE 0
END
FROM Customer
ORDER BY Name
but you would still need to order by the sort_id in your selecting queries as there is no guaranteed ordering without that (perhaps this sort_id approach might be useful in the case where the original columns used for ordering aren't being copied into the table variable)
add a column named rowno to #Customer table
INSERT INTO #Customer
SELECT ROW_NUMBER()over(order by Name)rowno,Customer_ID, Name,
CASE
WHEN Expiry_Date < GETDATE() then 1
WHEN Expired = 1 then 1
ELSE 0
END
from Customer

SQLServer SQL query with a row counter

I have a SQL query, that returns a set of rows:
SELECT id, name FROM users where group = 2
I need to also include a column that has an incrementing integer value, so the first row needs to have a 1 in the counter column, the second a 2, the third a 3 etc
The query shown here is just a simplified example, in reality the query could be arbitrarily complex, with several joins and nested queries.
I know this could be achieved using a temporary table with an autonumber field, but is there a way of doing it within the query itself ?
For starters, something along the lines of:
SELECT my_first_column, my_second_column,
ROW_NUMBER() OVER (ORDER BY my_order_column) AS Row_Counter
FROM my_table
However, it's important to note that the ROW_NUMBER() OVER (ORDER BY ...) construct only determines the values of Row_Counter, it doesn't guarantee the ordering of the results.
Unless the SELECT itself has an explicit ORDER BY clause, the results could be returned in any order, dependent on how SQL Server decides to optimise the query. (See this article for more info.)
The only way to guarantee that the results will always be returned in Row_Counter order is to apply exactly the same ordering to both the SELECT and the ROW_NUMBER():
SELECT my_first_column, my_second_column,
ROW_NUMBER() OVER (ORDER BY my_order_column) AS Row_Counter
FROM my_table
ORDER BY my_order_column -- exact copy of the ordering used for Row_Counter
The above pattern will always return results in the correct order and works well for simple queries, but what about an "arbitrarily complex" query with perhaps dozens of expressions in the ORDER BY clause? In those situations I prefer something like this instead:
SELECT t.*
FROM
(
SELECT my_first_column, my_second_column,
ROW_NUMBER() OVER (ORDER BY ...) AS Row_Counter -- complex ordering
FROM my_table
) AS t
ORDER BY t.Row_Counter
Using a nested query means that there's no need to duplicate the complicated ORDER BY clause, which means less clutter and easier maintenance. The outer ORDER BY t.Row_Counter also makes the intent of the query much clearer to your fellow developers.
In SQL Server 2005 and up, you can use the ROW_NUMBER() function, which has options for the sort order and the groups over which the counts are done (and reset).
The simplest way is to use a variable row counter. However it would be two actual SQL commands. One to set the variable, and then the query as follows:
SET #n=0;
SELECT #n:=#n+1, a.* FROM tablename a
Your query can be as complex as you like with joins etc. I usually make this a stored procedure. You can have all kinds of fun with the variable, even use it to calculate against field values. The key is the :=
Heres a different approach.
If you have several tables of data that are not joinable, or you for some reason dont want to count all the rows at the same time but you still want them to be part off the same rowcount, you can create a table that does the job for you.
Example:
create table #test (
rowcounter int identity,
invoicenumber varchar(30)
)
insert into #test(invoicenumber) select [column] from [Table1]
insert into #test(invoicenumber) select [column] from [Table2]
insert into #test(invoicenumber) select [column] from [Table3]
select * from #test
drop table #test

SQL Duplicate Delete Query over Millions of Rows for Performance

This has been an adventure. I started with the looping duplicate query located in my previous question, but each loop would go over all 17 million records, meaning it would take weeks (just running *select count * from MyTable* takes my server 4:30 minutes using MSSQL 2005). I gleamed information from this site and at this post.
And have arrived at the query below. The question is, is this the correct type of query to run on 17 million records for any type of performance? If it isn't, what is?
SQL QUERY:
DELETE tl_acxiomimport.dbo.tblacxiomlistings
WHERE RecordID in
(SELECT RecordID
FROM tl_acxiomimport.dbo.tblacxiomlistings
EXCEPT
SELECT RecordID
FROM (
SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude, Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
FROM tl_acxiomimport.dbo.tblacxiomlistings
) al WHERE Rank = 1)
Seeing the QueryPlan would help.
Is this feasible?
SELECT m.*
into #temp
FROM tl_acxiomimport.dbo.tblacxiomlistings m
inner join (SELECT RecordID,
Rank() over (Partition BY BusinessName,
latitude,
longitude,
Phone
ORDER BY webaddress DESC,
caption1 DESC,
caption2 DESC ) AS Rank
FROM tl_acxiomimport.dbo.tblacxiomlistings
) al on (al.RecordID = m.RecordID and al.Rank = 1)
truncate table tl_acxiomimport.dbo.tblacxiomlistings
insert into tl_acxiomimport.dbo.tblacxiomlistings
select * from #temp
Something's up with your DB, server, storage or some combination thereof. 4:30 for a select count * seems VERY high.
Run a DBCC_SHOWCONTIG to see how fragmented your table is, this could cause a major performance hit over a table that size.
Also, to add on to the comment by RyanKeeter, run the show plan and if there are any table scans create an index for the PK field on that table.
Wouldn't it be more simple to do:
DELETE tl_acxiomimport.dbo.tblacxiomlistings
WHERE RecordID in
(SELECT RecordID
FROM (
SELECT RecordID,
Rank() over (Partition BY BusinessName,
latitude,
longitude,
Phone
ORDER BY webaddress DESC,
caption1 DESC,
caption2 DESC) AS Rank
FROM tl_acxiomimport.dbo.tblacxiomlistings
)
WHERE Rank > 1
)
Run this in query analyzer:
SET SHOWPLAN_TEXT ON
Then ask query analyzer to run your query. Instead of running the query, SQL Server will generate a query plan and put it in the result set.
Show us the query plan.
17 million records is nothing. If it takes 4:30 to just do a select count(*) then there is a serious problem, probably related to either lack of memory in the server or a really old processor.
For performance, fix the machine. Pump it up to 2GB. RAM is so cheap these days that its cost is far less than your time.
Is the processor or disk thrashing when that query is going? If not, then something is blocking the calls. In that case you might consider putting the database in single user mode for the amount of time it takes to run the cleanup.
So you're deleting all the records that aren't ranked first? It might be worth comparing a join against a top 1 sub query against (which might also work in 2000, as rank is 2005 and above only)
Do you need to remove all the duplicates in a single operation? I assume that you're preforming some sort of housekeeping task, you might be able to do it piece-wise.
Basically create a cursor that loops all the records (dirty read) and removes dupes for each. It'll be a lot slower overall, but each operation will be relatively minimal. Then your housekeeping becomes a constant background task rather than a nightly batch.
The suggestion above to select into a temporary table first is your best bet. You could also use something like:
set rowcount 1000
before running your delete. It will stop running after it deletes the 1000 rows. Then run it again and again until you get 0 records deleted.
if i get it correctly you query is the same as
DELETE tl_acxiomimport.dbo.tblacxiomlistings
FROM
tl_acxiomimport.dbo.tblacxiomlistings allRecords
LEFT JOIN (
SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude, Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
FROM tl_acxiomimport.dbo.tblacxiomlistings
WHERE Rank = 1) myExceptions
ON allRecords.RecordID = myExceptions.RecordID
WHERE
myExceptions.RecordID IS NULL
I think that should run faster, I tend to avoid using "IN" clause in favor of JOINs where possible.
You can actually test the speed and the results safely by simply calling SELECT * or SELECT COUNT(*) on the FROM part like e.g.
SELECT *
FROM
tl_acxiomimport.dbo.tblacxiomlistings allRecords
LEFT JOIN (
SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude, Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
FROM tl_acxiomimport.dbo.tblacxiomlistings
WHERE Rank = 1) myExceptions
ON allRecords.RecordID = myExceptions.RecordID
WHERE
myExceptions.RecordID IS NULL
That is another reason why I would prefer the JOIN approach
I hope that helps
This looks fine but you might consider selecting your data into a temporary table and using that in your delete statement. I've noticed huge performance gains from doing this instead of doing it all in that one query.
Remember when doing a large delete it is best to have a good backup first.(And I also usually copy the deleted records to another table just in case, I need to recover them right away. )
Other than using truncate as suggested, I've had the best luck using this template for deleting lots of rows from a table. I don't remember off hand, but I think using the transaction helped to keep the log file from growing -- may have been another reason though -- not sure. And I usually switch the transaction logging method over to simple before doing something like this:
SET ROWCOUNT 5000
WHILE 1 = 1
BEGIN
begin tran
DELETE FROM ??? WHERE ???
IF ##rowcount = 0
BEGIN
COMMIT
BREAK
END
COMMIT
END
SET ROWCOUNT 0