Time complexity: UPDATE ... WHERE vs UPDATE ALL - sql

I am having a database table DuplicatesRemoved with possibly large number of records. I execute certain operations to remove duplicates of users in my application, and every time I remove the duplicates, I keep a track of the duplicate UserID's by storing them in this table DuplicatesRemoved.
This table contains a bit field HistoryRecord. I need to update this field at the end of every "RemoveDuplicates" operation.
I do NOT have any indexes on DuplicatesRemoved.
I am wondering which of these would be better?
1.
UPDATE DuplicatesRemoved SET HistoryRecord=1 WHERE HistoryRecord<>1
OR
2.
UPDATE DuplicatesRemoved SET HistoryRecord=1
Will Query #1 take less time than Query #2?
I have referred this question but still am not sure about which one would be better for me.

In the first option:
UPDATE DuplicatesRemoved SET HistoryRecord=1 WHERE HistoryRecord<>1
You have to find those records and update only those.
In the second option:
UPDATE DuplicatesRemoved SET HistoryRecord=1
You have to update the entire table.
So first option will be better assuming you find the records quickly, and also minimizes the number of locks acquired during the time of the update, and the total size of the transaction that the engine writes to the log file (i.e the records that we need to be able to rollback).
Showing the execution plan will help in this decision.

In databases, you measure the number of disk accesses to evaluate the complexity of a query, since the time to read something from the external memory is order of magnitute greater than the time to perform few operations in main memory.
The two queries, if no index is present, have the same number of disk accesses, since both require the complete scan of the relation.

Related

Postgres SQL sentence performance

I´ve a Postgres instance running in a 16 cores/32 Gb WIndows Server workstation.
I followed performance improvements tips I saw in places like this: https://www.postgresql.org/docs/9.3/static/performance-tips.html.
When I run an update like:
analyze;
update amazon_v2
set states_id = amazon.states_id,
geom = amazon.geom
from amazon
where amazon_v2.fid = amazon.fid
where fid is the primary key in both tables and both has 68M records, it takes almost a day to run.
Is there any way to improve the performance of SQL sentences like this? Should I write a stored procedure to process it record by record, for example?
You don't show the execution plan but I bet it's probably performing a Full Table Scan on amazon_v2 and using an Index Seek on amazon.
I don't see how to improve performance here, since it's close to optimal already. The only think I can think of is to use table partitioning and parallelizing the execution.
Another totally different strategy, is to update the "modified" rows only. Maybe you can track those to avoid updating all 68 million rows every time.
Your query is executed in a very log transaction. The transaction may be blocked by other writers. Query pg_locks.
Long transactions have negative impact on performance of autovacuum. Does execution time increase other time? If,so check table bloat.
Performance usually increases when big transactions are dived into smaller. Unfortunately, the operation is no longer atomic and there is no golden rule on optimal batch size.
You should also follow advice from https://stackoverflow.com/a/50708451/6702373
Let's sum it up:
Update modified rows only (if only a few rows are modified)
Check locks
Check table bloat
Check hardware utilization (related to other issues)
Split the operation into batches.
Replace updates with delete/truncate & insert/copy (this works if the update changes most rows).
(if nothing else helps) Partition table

Selecting 80% of rows and table lock

One of my colleagues came to me with this statement:
Having a SELECT on a table that fetch 80% of the rows while having a
WHERE clause on a column with an index. So to avoid that add a WITH (NOLOCK) in your FROM clause.
His only argument was: Believe me I've experienced it myself. I cannot find a proper documentation for this.
I far has I know WITH (NOLOCK) only affects the table by letting UPDATE and INSERT occur while selecting and that can lead us to dirty read.
Is my colleague's assumption correct?
I think you're referring to lock escalation, https://technet.microsoft.com/en-us/library/ms184286(v=sql.105).aspx , combined with a table scan caused by an index with bad selectivity, and some possibilities for blocking.
If the statistics on a non clustered index show that the number of rows returned from a table for a specific value exceed some threshold, then the optimizer will choose to use a table scan to find the corresponding rows instead of an index seek with corresponding bookmark lookups, because they are slow in quantity.
I typically tell people that you want that percentage to be 5% or lower, but sometimes it will still index seek up to 10% or so. At 80%, it's definitely going to table scan.
Also, since the query is doing a table scan, the query has to be able to acquire some kind of lock on every single row in the table. If there are any other queries that are running performing updates, or otherwise preventing locks from being acquired on even a single row, the query will have to wait.
With lock escalation, it's not a percentage, but instead a specific magic number of 5,000. A query generally starts reading rows using row locks. If a single query reads 5,000 or more rows, it will escalate the locks that it is using against the table from row and/or page locks to full table locks.
This is when deadlocks happen, because another query may be trying to do the same thing.
These locks don't necessarily have anything to do with inserts/updates.
This is an actual thing. No, this does not mean that you should use NOLOCK. You'd be much better off looking at READPAST, TABLOCK, or TABLOCKX, https://msdn.microsoft.com/en-us/library/ms187373.aspx , if you're having issues with deadlocks.
Do not do any of these things just out of habit and only look into them for specific instances with highly transactional tables that are experiencing actual problems.
By default writers have priority and readers will wait on writers to finish. WITH(NOLOCK) will allow readers to read uncommitted data, avoiding waits on writers to finish. For read only queries against very large tables, this is ok if you are querying data such as an old partition of data or pulling back data that is not going to change often and changes are not critical in the presentation of data. This is the same as using the SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED directive in SP's.

How to efficiently keep count by reading, incrementing it & updating a column in the database

I have a column in the database which keeps counts of incoming requests, but updated from different sources and systems.
And the incoming requests are in thousands per minute.
What is the best way to update this column with the new request count?
The 2 ways at the top of my head are -
Read current value from column, increment it by one, and then update it back(All part of a sproc).
The problem I see with this is that every source/system that updates needs to lock this column and this might increase the wait time of read and updating of the column. And will slow down the DB.
Put requests in a queue, and a job reads the queue and updates the column, one at a time. This method looks safer, atleast to me, but is it too much work to get a count of requests coming in?
What is the approach you would typically take in such a read & update in a column in huge amounts scenario?
Thanks
1000s per minute is not "huge". Let's say its 10k per minute. That leaves 6ms of time per update. For an in-memory row with a simple integer increment and not too many indexes expect <1ms per update. Works out fine.
So just use
UPDATE T SET Count = Count + 1 WHERE ID = 1234
Put an index on the database and just do:
update table t
set request_count = requestcount + 1
where <whatever conditions are appropriate>;
Be sure that the conditions in the where clause all refer to indexes, so finding the row is likely to be as fast as possible.
Without strenuous effort, I would expect the update to be as fast enough. You should test this to see if this is true. You could also insert a row into a requests table and do the counting when you query that table. inserts are faster than updates, because the engine doesn't have to find the row first.
If this doesn't meet performance goals, then some sort of distributed mechanism may prove successful. I don't see that batching the requests using sequences would be a simple solution. Although the queue is likely to be distributed, you then have the problem that the request counts are out-of-sync with the actual updates.

T-SQL Optimize DELETE of many records

I have a table that can grew to millions records (50 millions for example). On each 20 minutes records that are older than 20 minutes are deleted.
The problems is that if the table has so many records such deletion can take a lot of time and I want to make it faster.
I can not do "truncate table" because I want to remove only records that are older than 20 minutes. I suppose that when doing the "delete" and filtering the information that need to be delete, the server is creating log file or something and this take much time?
Am I right? Is there a way to stop any flag or option to optimize the delete, and then to turn on the stopped option?
To expand on the batch delete suggestion, i'd suggest you do this far more regularly (every 20 seconds perhaps) - batch deletions are easy:
WHILE 1 = 1
BEGIN
DELETE TOP ( 4000 )
FROM YOURTABLE
WHERE YourIndexedDateColumn < DATEADD(MINUTE, -20, GETDATE())
IF ##ROWCOUNT = 0
BREAK
END
Your inserts may lag slightly whilst they wait for the locks to release but they should insert rather than error.
In regards to your table though, a table with this much traffic i'd expect to see on a very fast raid 10 array / perhaps even partitioned - are your disks up to it? Are your transaction logs on different disks to your data files? - they should be
EDIT 1 - Response to your comment
TO put a database into SIMPLE recovery:
ALTER DATABASE Database Name SET RECOVERY='SIMPLE'
This basically turns off transaction logging on the given database. Meaning in the event of data loss you would need loose all data since your last full backup. If you're OK with that, well this should save a lot of time when running large transactions. (NOTE that as the transaction is running, the logging still takes place in SIMPLE - to enable the rolling back of the transaction).
If there are tables within your database where you cant afford to loose data you'll need to leave your database in FULL recovery mode (i.e. any transaction gets logged (and hopefully flushed to *.trn files by your servers maintenance plans). As i stated in my question though, there is nothing stopping you having two databases, 1 in FULL and 1 in SIMPLE. the FULL database would be fore tables where you cant afford to loose any data (i.e. you could apply the transaction logs to restore data to a specific time) and the SIMPLE database would be for these massive high-traffic tables that you can allow data loss on in the event of a failure.
All of this is relevant assuming your creating full (*.bak) files every night & flushing your log files to *.trn files every half hour or so).
In regards to your index question, it's imperative your date column is indexed, if you check your execution plan and see any "TABLE SCAN" - that would be an indicator of a missing index.
Your date column i presume is DATETIME with a constraint setting the DEFAULT to getdate()?
You may find that you get better performance by replacing that with a BIGINT YYYYMMDDHHMMSS and then apply a CLUSTERED index to that column - note however that you can only have 1 clustered index per table, so if that table already has one you'll need to use a Non-Clustered index. (in case you didnt know, a clustered index basically tells SQL to store the information in that order, meaning that when you delete rows > 20 minutes SQL can literally delete stuff sequentially rather than hopping from page to page.
The log problem is probably due to the number of records deleted in the trasaction, to make things worse the engine may be requesting a lock per record (or by page wich is not so bad)
The one big thing here is how you determine the records to be deleted, i'm assuming you use a datetime field, if so make sure you have an index on the column otherwise it's a sequential scan of the table that will really penalize your process.
There are two things you may do depending of the concurrency of users an the time of the delete
If you can guarantee that no one is going to read or write when you delete, you can lock the table in exclusive mode and delete (this takes only one lock from the engine) and release the lock
You can use batch deletes, you would make a script with a cursor that provides the rows you want to delete, and you begin transtaction and commit every X records (ideally 5000), so you can keep the transactions shorts and not take that many locks
Take a look at the query plan for the delete process, and see what it shows, a sequential scan of a big table its never good.
Unfortunately for the purpose of this question and fortunately for the sake of consistency and recoverability of the databases in SQL server, putting a database into Simple recovery mode DOES NOT disable logging.
Every transaction still gets logged before committing it to the data file(s), the only difference would be that the space in the log would get released (in most cases) right after the transaction is either rolled back or committed in the Simple recovery mode, but this is not going to affect the performance of the DELETE statement in one way or another.
I had a similar problem when I needed to delete more than 70% of the rows from a big table with 3 indexes and a lot of foreign keys.
For this scenario, I saved the rows I wanted in a temp table, truncated the original table and reinserted the rows, something like:
SELECT * INTO #tempuser FROM [User] WHERE [Status] >= 600;
TRUNCATE TABLE [User];
INSERT [User] SELECT * FROM #tempuser;
I learned this technique with this link that explains:
DELETE is a a fully logged operation , and can be rolled back if something goes wrong
TRUNCATE Removes all rows from a table without logging the individual row deletions
In the article you can explore other strategies to resolve the delay in deleting many records, that one worked to me

SQL Server, Converting NTEXT to NVARCHAR(MAX)

I have a database with a large number of fields that are currently NTEXT.
Having upgraded to SQL 2005 we have run some performance tests on converting these to NVARCHAR(MAX).
If you read this article:
http://geekswithblogs.net/johnsPerfBlog/archive/2008/04/16/ntext-vs-nvarcharmax-in-sql-2005.aspx
This explains that a simple ALTER COLUMN does not re-organise the data into rows.
I experience this with my data. We actually have much worse performance in some areas if we just run the ALTER COLUMN. However, if I run an UPDATE Table SET Column = Column for all of these fields we then get an extremely huge performance increase.
The problem I have is that the database consists of hundreds of these columns with millions of records. A simple test (on a low performance virtual machine) had a table with a single NTEXT column containing 7 million records took 5 hours to update.
Can anybody offer any suggestions as to how I can update the data in a more efficient way that minimises downtime and locks?
EDIT: My backup solution is to just update the data in blocks over time, however, with our data this results in worse performance until all the records have been updated and the shorter this time is the better so I'm still looking for a quicker way to update.
If you can't get scheduled downtime....
create two new columns:
nvarchar(max)
processedflag INT DEFAULT 0
Create a nonclustered index on the processedflag
You have UPDATE TOP available to you (you want to update top ordered by the primary key).
Simply set the processedflag to 1 during the update so that the next update will only update where the processed flag is still 0
You can use ##rowcount after the update to see if you can exit a loop.
I suggest using WAITFOR for a few seconds after each update query to give other queries a chance to acquire locks on the table and not to overload disk usage.
How about running the update in batches - update 1000 rows at a time.
You would use a while loop that increments a counter, corresponding to the ID of the rows to be updated in each iteration of the the update query. This may not speed up the amount of time it takes to update all 7 million records, but it should make it much less likely that users will experience an error due to record locking.
If you can get scheduled downtime:
Back up the database
Change recovery model to simple
Remove all indexes from the table you are updating
Add a column maintenanceflag(INT DEFAULT 0) with a nonclustered index
Run:
UPDATE TOP 1000
tablename
SET nvarchar from ntext,
maintenanceflag = 1
WHERE maintenanceflag = 0
Multiple times as required (within a loop with a delay).
Once complete, do another backup then change the recovery model back to what it was originally on and add old indexes.
Remember that every index or trigger on that table causes extra disk I/O and that the simple recovery mode minimises logfile I/O.
Running a database test on a low performance virtual machine is not really indicative of production performance, the heavy IO involved will require a fast disk array, which the virtualisation will throttle.
You might also consider testing to see if an SSIS package might do this more efficiently.
Whatever you do, make it an automated process that can be scheduled and run during off hours. the feweer users you have trying to access the data, the faster everything will go. If at all possible, pickout the three or four most critical to change and take the database down for maintentance (during a normally off time) and do them in single user mode. Once you get the most critical ones, the others can be scheduled one or two a night.