SQL Duplicate Delete Query over Millions of Rows for Performance - sql

This has been an adventure. I started with the looping duplicate query located in my previous question, but each loop would go over all 17 million records, meaning it would take weeks (just running *select count * from MyTable* takes my server 4:30 minutes using MSSQL 2005). I gleamed information from this site and at this post.
And have arrived at the query below. The question is, is this the correct type of query to run on 17 million records for any type of performance? If it isn't, what is?
SQL QUERY:
DELETE tl_acxiomimport.dbo.tblacxiomlistings
WHERE RecordID in
(SELECT RecordID
FROM tl_acxiomimport.dbo.tblacxiomlistings
EXCEPT
SELECT RecordID
FROM (
SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude, Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
FROM tl_acxiomimport.dbo.tblacxiomlistings
) al WHERE Rank = 1)

Seeing the QueryPlan would help.
Is this feasible?
SELECT m.*
into #temp
FROM tl_acxiomimport.dbo.tblacxiomlistings m
inner join (SELECT RecordID,
Rank() over (Partition BY BusinessName,
latitude,
longitude,
Phone
ORDER BY webaddress DESC,
caption1 DESC,
caption2 DESC ) AS Rank
FROM tl_acxiomimport.dbo.tblacxiomlistings
) al on (al.RecordID = m.RecordID and al.Rank = 1)
truncate table tl_acxiomimport.dbo.tblacxiomlistings
insert into tl_acxiomimport.dbo.tblacxiomlistings
select * from #temp

Something's up with your DB, server, storage or some combination thereof. 4:30 for a select count * seems VERY high.
Run a DBCC_SHOWCONTIG to see how fragmented your table is, this could cause a major performance hit over a table that size.
Also, to add on to the comment by RyanKeeter, run the show plan and if there are any table scans create an index for the PK field on that table.

Wouldn't it be more simple to do:
DELETE tl_acxiomimport.dbo.tblacxiomlistings
WHERE RecordID in
(SELECT RecordID
FROM (
SELECT RecordID,
Rank() over (Partition BY BusinessName,
latitude,
longitude,
Phone
ORDER BY webaddress DESC,
caption1 DESC,
caption2 DESC) AS Rank
FROM tl_acxiomimport.dbo.tblacxiomlistings
)
WHERE Rank > 1
)

Run this in query analyzer:
SET SHOWPLAN_TEXT ON
Then ask query analyzer to run your query. Instead of running the query, SQL Server will generate a query plan and put it in the result set.
Show us the query plan.

17 million records is nothing. If it takes 4:30 to just do a select count(*) then there is a serious problem, probably related to either lack of memory in the server or a really old processor.
For performance, fix the machine. Pump it up to 2GB. RAM is so cheap these days that its cost is far less than your time.
Is the processor or disk thrashing when that query is going? If not, then something is blocking the calls. In that case you might consider putting the database in single user mode for the amount of time it takes to run the cleanup.

So you're deleting all the records that aren't ranked first? It might be worth comparing a join against a top 1 sub query against (which might also work in 2000, as rank is 2005 and above only)
Do you need to remove all the duplicates in a single operation? I assume that you're preforming some sort of housekeeping task, you might be able to do it piece-wise.
Basically create a cursor that loops all the records (dirty read) and removes dupes for each. It'll be a lot slower overall, but each operation will be relatively minimal. Then your housekeeping becomes a constant background task rather than a nightly batch.

The suggestion above to select into a temporary table first is your best bet. You could also use something like:
set rowcount 1000
before running your delete. It will stop running after it deletes the 1000 rows. Then run it again and again until you get 0 records deleted.

if i get it correctly you query is the same as
DELETE tl_acxiomimport.dbo.tblacxiomlistings
FROM
tl_acxiomimport.dbo.tblacxiomlistings allRecords
LEFT JOIN (
SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude, Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
FROM tl_acxiomimport.dbo.tblacxiomlistings
WHERE Rank = 1) myExceptions
ON allRecords.RecordID = myExceptions.RecordID
WHERE
myExceptions.RecordID IS NULL
I think that should run faster, I tend to avoid using "IN" clause in favor of JOINs where possible.
You can actually test the speed and the results safely by simply calling SELECT * or SELECT COUNT(*) on the FROM part like e.g.
SELECT *
FROM
tl_acxiomimport.dbo.tblacxiomlistings allRecords
LEFT JOIN (
SELECT RecordID, Rank() over (Partition BY BusinessName, latitude, longitude, Phone ORDER BY webaddress DESC, caption1 DESC, caption2 DESC ) AS Rank
FROM tl_acxiomimport.dbo.tblacxiomlistings
WHERE Rank = 1) myExceptions
ON allRecords.RecordID = myExceptions.RecordID
WHERE
myExceptions.RecordID IS NULL
That is another reason why I would prefer the JOIN approach
I hope that helps

This looks fine but you might consider selecting your data into a temporary table and using that in your delete statement. I've noticed huge performance gains from doing this instead of doing it all in that one query.

Remember when doing a large delete it is best to have a good backup first.(And I also usually copy the deleted records to another table just in case, I need to recover them right away. )

Other than using truncate as suggested, I've had the best luck using this template for deleting lots of rows from a table. I don't remember off hand, but I think using the transaction helped to keep the log file from growing -- may have been another reason though -- not sure. And I usually switch the transaction logging method over to simple before doing something like this:
SET ROWCOUNT 5000
WHILE 1 = 1
BEGIN
begin tran
DELETE FROM ??? WHERE ???
IF ##rowcount = 0
BEGIN
COMMIT
BREAK
END
COMMIT
END
SET ROWCOUNT 0

Related

Reverse initial order of SELECT statement

I want to run a SQL query in Postgres that is exactly the reverse of the one that you'd get by just running the initial query without an order by clause.
So if your query was:
SELECT * FROM users
Then
SELECT * FROM users ORDER BY <something here to make it exactly the reverse of before>
Would it just be this?
ORDER BY Desc
You are building on the incorrect assumption that you would get rows in a deterministic order with:
SELECT * FROM users;
What you get is really arbitrary. Postgres returns rows in any way it sees fit. For simple queries typically in order of their physical storage, which typically is the order in which rows were entered. But there are no guarantees, and the order may change any time between two calls. For instance after any UPDATE (writing a new physical row version), or when any background process reorders rows - like VACUUM. Or a more complex query might return rows according to an index or a join. Long story short: there is no reliable order for table rows in a relational database unless you specify it with ORDER BY.
That said, assuming you get rows from the above simple query in the order of physical storage, this would get you the reverse order:
SELECT * FROM users
ORDER BY ctid DESC;
ctid is the internal tuple ID signifying physical order. Related:
In-order sequence generation
How list all tables with data changes in the last 24 hours?
here is a tsql solution, thid might give you an idea how to do it in postgres
select * from (
SELECT *, row_number() over( order by (select 1)) rowid
FROM users
) x
order by rowid desc

SQL stored proc runs extremely slow when filtering by high row numbers

This query is generated from a very long dynamic sequel stored procedure -- the procedure returns the requested number of records starting at a given index to be displayed in a Telerik Radgrid, effectively handling paging. A simplified version of the stored proc's output:
SELECT r.* FROM (
SELECT ROW_NUMBER() OVER(ORDER BY InventoryId DESC) as row,
v.* FROM vInventorySearch v
) as R WHERE [ROW] BETWEEN 1 AND 10
When the "BETWEEN" clause is between 1 and 10, it runs in a fraction of a second, but if it's between something like 10000 and 1010 it takes almost a full minute to execute.
I feel like I may be missing something fundamental here, but it seems to me that it shouldn't matter which 10 records I'm retrieving, it should take the same amount of time.
Thanks for any input, I'm looking forward to being embarrassed!
Solution, courtesy Martin Smith (below) :
SELECT r.*, inv.* FROM
(
SELECT ROW_NUMBER() OVER(ORDER BY InventoryId DESC) as row, v.InventoryID
FROM vInventorySearch v
WHERE 1=1
) as R
inner join vInventory inv on r.InventoryID = inv.InventoryID
WHERE [ROW] BETWEEN 10001 AND 10010
Thanks for your help!
Paginating by ROW_NUMBER can indeed be pretty inefficient for higher row numbers.
Sometimes it is better to break it up a bit and have the ROW_NUMBER query on a narrow index to retrieve the matching PKs with a join back onto the base table to retrieve the missing columns.
SQL 2012 has more efficient paging mechanism
http://stevestedman.com/2012/04/tsql-2012-offset-and-fetch/
SELECT DepartmentID, Revenue, Year
FROM REVENUE
ORDER BY Year, DepartmentID ASC
OFFSET 10 ROWS FETCH NEXT 10 ROWS ONLY;

Set-based alternative to loop in SQL Server

I know that there are several posts about how BAD it is to try to loop in SQL Server in a stored procedure. But I haven't quite found what I am trying to do. We are using data connectivity that can be linked internally directly into excel.
I have seen some posts where a few people have said they could convert most loops to a standard query. But for the life of me I am having trouble with this one.
I need all custIDs who have orders right before an event of type 38,40. But only get them if there is no other order between the event and the order in the first query.
So there are 3 parts. I first query for all orders (orders table) based on a time frame into a temporary table.
Select into temp1 odate, custId from orders where odate>'5/1/12'
Then I could use the temp table to inner join on the secondary table to get a customer event (LogEvent table) that may have occurred some time in the past prior to the current order.
Select into temp2 eventdate, temp1.custID from LogEvent inner join temp1 on
temp1.custID=LogEvent.custID where EventType in (38,40) and temp1.odate>eventdate
order by eventdate desc
The problem here is that the queries I am trying to run will return all rows for each of the customers from the first query where I only want the latest for each customer. So this is where on the client side I would loop to only get one Event instead of all the old ones. But as all the query has to run inside of Excel I can't really loop client side.
The third step then could use the results from the second query to make check if the event occurred between most current order and any previous order. I only want the data where the event precedes the order and no other orders are in between.
Select ordernum, shopcart.custID from shopcart right outer join temp2 on
shopcart.custID=temp2.custID where shopcart.odate >= temp2.eventdate and
ordernum is null
Is there a way to simplify this and make it set-based to run in SQL Server instead of some kind of loop that I is perform at the client?
THis is a great example of switching to set-based notation.
First, I combined all three of your queries into a single query. In general, having a single query let's the query optimizer do what it does best -- determine execution paths. It also prevents accidental serialization of queries on a multithreaded/multiprocessor machine.
The key is row_number() for ordering the events so the most recent has a value of 1. You'll see this in the final WHERE clause.
select ordernum, shopcart.custID
from (Select eventdate, temp1.custID,
row_number() over (partition by temp1.CustID order by EventDate desc) as seqnum
from LogEvent inner join
(Select odate, custId
from order
where odate>'5/1/12'
) temp1
on temp1.custID=LogEvent.custID
where EventType in (38,40) and temp1.odate>eventdate order by eventdate desc
) temp2 left outer join
ShopCart
on shopcart.custID=temp2.custID
where seqnum = 1 and shopcart.odate >= temp2.eventdate and ordernum is null
I kept your naming conventions, even though I think "from order" should generate a syntax error. Even if it doesn't it is bad practice to name tables and columns with reserved SQL words.
If you are using a newer version of sql server, then you can use the ROW_NUMBER function. I will write an example shortly.
;WITH myCTE AS
(
SELECT
eventdate, temp1.custID,
ROW_NUMBER() OVER (PARTITION BY temp1.custID ORDER BY eventdate desc) AS CustomerRanking
FROM LogEvent
JOIN temp1
ON temp1.custID=LogEvent.custID
WHERE EventType IN (38,40) AND temp1.odate>eventdate
)
SELECT * into temp2 from myCTE WHERE CustomerRanking = 1;
This gets you the most recent event for each customer without a loop.
Also, you could use RANK, however that will create duplicates for ties, whereas ROW_NUMBER will guarantee no duplicate numbers for your partition.

How to speed up group-based duplication-count queries on unindexed tables

When I need to know the number of rows containing more than n duplicates for certain colulmn c, I can do it like this:
WITH duplicateRows AS (
SELECT COUNT(1)
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
This leads to an unwanted behaviour: SQL Server counts all rows grouped by i, which (when no index is on this table) leads to horrible performance.
However, when altering the script such that SQL Server doesn't have to count all the rows doesn't solve the problem:
WITH duplicateRows AS (
SELECT 1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
Although SQL Server now in theory can stop counting after n + 1, it leads to the same query plan and query cost.
Of course, the reason is that the GROUP BY really introduces the cost, not the counting. But I'm not at all interested in the numbers. Is there another option to speed up the counting of duplicate rows, on a table without indexes?
The greatest two costs in your query are the re-ordering for the GROUP BY (due to lack of appropriate index) and the fact that you're scanning the whole table.
Unfortunately, to identify duplicates, re-ordering the whole table is the cheapest option.
You may get a benefit from the following change, but I highly doubt it would be significant, as I'd expect the execution plan to involve a sort again anyway.
WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fieldC) AS sequence_id
FROM
yourTable
)
SELECT
COUNT(*)
FROM
sequenced_data
WHERE
sequence_id = (n+1)
Assumes SQLServer2005+
Without indexing the GROUP BY solution is the best, every PARTITION-based solution involving both table(clust. index) scan and sort, instead of simple scan-and-counting in GROUP BY case
If the only goal is to determine if there are ANY rows in ANY group (or, to rephrase that, "there is a duplicate inside the table, given the distinction of column c"), adding TOP(1) to the SELECT queries could perform some performance magic.
WITH duplicateRows AS (
SELECT TOP(1)
1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT 1 FROM duplicateRows
Theoretically, SQL Server doesn't need to determine all groups, so as soon as the first group with a duplicate is found, the query is finished (but worst-case will take as long as the original approach). I have to say though that this is a somewhat imperative way of thinking - not sure if it's correct...
Speed and "without indexes" almost never go together.
Athough as others here have mentioned I seriously doubt that it will have performance benefits. Perhaps you could try restructuring your query with PARTITION BY.
For example:
WITH duplicateRows AS (
SELECT a.aFK,
ROW_NUMBER() OVER(PARTITION BY a.aFK ORDER BY a.aFK) AS DuplicateCount
FROM Address a
) SELECT COUNT(DuplicateCount) FROM duplicateRows
I haven't tested the performance of this against the actual group by clause query. It's just a suggestion of how you could restructure it in another way.

How to read the last row with SQL Server

What is the most efficient way to read the last row with SQL Server?
The table is indexed on a unique key -- the "bottom" key values represent the last row.
If you're using MS SQL, you can try:
SELECT TOP 1 * FROM table_Name ORDER BY unique_column DESC
select whatever,columns,you,want from mytable
where mykey=(select max(mykey) from mytable);
You'll need some sort of uniquely identifying column in your table, like an auto-filling primary key or a datetime column (preferably the primary key). Then you can do this:
SELECT * FROM table_name ORDER BY unique_column DESC LIMIT 1
The ORDER BY column tells it to rearange the results according to that column's data, and the DESC tells it to reverse the results (thus putting the last one first). After that, the LIMIT 1 tells it to only pass back one row.
If some of your id are in order, i am assuming there will be some order in your db
SELECT * FROM TABLE WHERE ID = (SELECT MAX(ID) FROM TABLE)
I think below query will work for SQL Server with maximum performance without any sortable column
SELECT * FROM table
WHERE ID not in (SELECT TOP (SELECT COUNT(1)-1
FROM table)
ID
FROM table)
Hope you have understood it... :)
I tried using last in sql query in SQl server 2008 but it gives this err:
" 'last' is not a recognized built-in function name."
So I ended up using :
select max(WorkflowStateStatusId) from WorkflowStateStatus
to get the Id of the last row.
One could also use
Declare #i int
set #i=1
select WorkflowStateStatusId from Workflow.WorkflowStateStatus
where WorkflowStateStatusId not in (select top (
(select count(*) from Workflow.WorkflowStateStatus) - #i ) WorkflowStateStatusId from .WorkflowStateStatus)
You can use last_value: SELECT LAST_VALUE(column) OVER (PARTITION BY column ORDER BY column)...
I test it at one of my databases and it worked as expected.
You can also check de documentation here: https://msdn.microsoft.com/en-us/library/hh231517.aspx
OFFSET and FETCH NEXT are a feature of SQL Server 2012 to achieve SQL paging while displaying results.
The OFFSET argument is used to decide the starting row to return rows from a result and FETCH argument is used to return a set of number of rows.
SELECT *
FROM table_name
ORDER BY unique_column desc
OFFSET 0 Row
FETCH NEXT 1 ROW ONLY
SELECT TOP 1 id from comission_fees ORDER BY id DESC
In order to retrieve the last row of a table for MS SQL database 2005, You can use the following query:
select top 1 column_name from table_name order by column_name desc;
Note: To get the first row of the table for MS SQL database 2005, You can use the following query:
select top 1 column_name from table_name;
If you don't have any ordered column, you can use the physical id of each lines:
SELECT top 1 sys.fn_PhysLocFormatter(%%physloc%%) AS [File:Page:Slot],
T.*
FROM MyTable As T
order by sys.fn_PhysLocFormatter(%%physloc%%) DESC
SELECT * from Employees where [Employee ID] = ALL (SELECT MAX([Employee ID]) from Employees)
This is how you get the last record and update a field in Access DB.
UPDATE compalints SET tkt = addzone &'-'& customer_code &'-'& sn where sn in (select max(sn) from compalints )
If you have a Replicated table, you can have an Identity=1000 in localDatabase and Identity=2000 in the clientDatabase, so if you catch the last ID you may find always the last from client, not the last from the current connected database.
So the best method which returns the last connected database is:
SELECT IDENT_CURRENT('tablename')
Well I'm not getting the "last value" in a table, I'm getting the Last value per financial instrument. It's not the same but I guess it is relevant for some that are looking to look up on "how it is done now". I also used RowNumber() and CTE's and before that to simply take 1 and order by [column] desc. however we nolonger need to...
I am using SQL server 2017, we are recording all ticks on all exchanges globally, we have ~12 billion ticks a day, we store each Bid, ask, and trade including the volumes and the attributes of a tick (bid, ask, trade) of any of the given exchanges.
We have 253 types of ticks data for any given contract (mostly statistics) in that table, the last traded price is tick type=4 so, when we need to get the "last" of Price we use :
select distinct T.contractId,
LAST_VALUE(t.Price)over(partition by t.ContractId order by created ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
from [dbo].[Tick] as T
where T.TickType=4
You can see the execution plan on my dev system it executes quite efficient, executes in 4 sec while the exchange import ETL is pumping data into the table, there will be some locking slowing me down... that's just how live systems work.
It is very simple
select top 10 * from TableName order by 1 desc
SELECT * FROM TABLE WHERE ID = (SELECT MAX(ID) FROM TABLE)
I am pretty sure that it is:
SELECT last(column_name) FROM table
Becaause I use something similar:
SELECT last(id) FROM Status