Postgres SQL sentence performance

Postgres SQL sentence performance - sql

I´ve a Postgres instance running in a 16 cores/32 Gb WIndows Server workstation.
I followed performance improvements tips I saw in places like this: https://www.postgresql.org/docs/9.3/static/performance-tips.html.
When I run an update like:
analyze;
update amazon_v2
set states_id = amazon.states_id,
geom = amazon.geom
from amazon
where amazon_v2.fid = amazon.fid
where fid is the primary key in both tables and both has 68M records, it takes almost a day to run.
Is there any way to improve the performance of SQL sentences like this? Should I write a stored procedure to process it record by record, for example?

You don't show the execution plan but I bet it's probably performing a Full Table Scan on amazon_v2 and using an Index Seek on amazon.
I don't see how to improve performance here, since it's close to optimal already. The only think I can think of is to use table partitioning and parallelizing the execution.
Another totally different strategy, is to update the "modified" rows only. Maybe you can track those to avoid updating all 68 million rows every time.

Your query is executed in a very log transaction. The transaction may be blocked by other writers. Query pg_locks.
Long transactions have negative impact on performance of autovacuum. Does execution time increase other time? If,so check table bloat.
Performance usually increases when big transactions are dived into smaller. Unfortunately, the operation is no longer atomic and there is no golden rule on optimal batch size.
You should also follow advice from https://stackoverflow.com/a/50708451/6702373
Let's sum it up:
Update modified rows only (if only a few rows are modified)
Check locks
Check table bloat
Check hardware utilization (related to other issues)
Split the operation into batches.
Replace updates with delete/truncate & insert/copy (this works if the update changes most rows).
(if nothing else helps) Partition table

Related

Azure SQL server deletes

I have a SQL server with 16130000 rows. I need to delete around 20%. When I do a simple:
delete from items where jobid=12
Takes forever.
I stopped the query after 11 minutes. Selecting data is pretty fast why is delete so slow? Selecting 850000 rows takes around 30 seconds.
Is it because of table locks? And can you do anything about it? I would expect delete rows should be faster because you dont transfer data on the wire?
Best R, Thomas

Without telling us what reservation size you are using, it is hard to give feedback on whether X records in Y seconds is expected or not. I can tell you about how the system works so that you can make this determination with a bit more investigation by yourself, however. The log commit rate is limited by the reservation size you purchase. Deletes are fundamentally limited on the ability to write out log records (and replicate them to multiple machines in case your main machine dies). When you select records, you don't have to go over the network to N machines and you may not even need to go to the local disk if the records are preserved in memory, so selects are generally expected to be faster than inserts/updates/deletes because of the need to harden log for you.
You can read about the specific limits for different reservation sizes are here:
DTU Limits and vCore Limits
One common problem customers hit is to do individual operations in a loop (like a cursor or driven from the client). This implies that each statement has a single row updated and thus has to harden each log record serially because the app has to wait for the statement to return before submitting the next statement. You are not hitting that since you are running a big delete as a single statement. That could be slow for other reasons such as:
Locking - if you have other users doing operations on the table, it could block the progress of the delete statement. You can potentially see this by looking at sys.dm_exec_requests to see if your statement is blocking on other locks.
Query Plan choice. If you have to scan a lot of rows to delete a small fraction, you could be blocked on the IO to find them. Looking at the query plan shape will help here, as will set statistics time on (I suggest you change the query to do TOP 100 or similar to get a sense of whether you are doing lots of logical read IOs vs. actual logical writes). This could imply that your on-disk layout is suboptimal for this problem. The general solutions would be to either pick a better indexing strategy or to use partitioning to help you quickly drop groups of rows instead of having to delete all the rows explicitly.

Try to use batching techniques to improve performance, minimize log usage and avoid consuming database space.
declare
#batch_size int,
#del_rowcount int = 1
set #batch_size = 100
set nocount on;
while #del_rowcount > 0
begin
begin tran
delete top (#batch_size)
from dbo.LargeDeleteTest
set #del_rowcount = ##rowcount
print 'Delete row count: ' + cast(#del_rowcount as nvarchar(32))
commit tran
end
Drop any foreign keys, delete the rows and then recreate the foreign keys can speed up things also.

deleting 500 records from table with 1 million records shouldn't take this long

I hope someone can help me. I have a simple sql statement
delete from sometable
where tableidcolumn in (...)
I have 500 records I want to delete and recreate. The table recently grew to over 1 mill records. The problem is the statement above is taking over 5 minutes without completing. I have a primary key and 2 non clustered non unique indexes. My delete statement is using the primary key.
Can anyone help me understand why this statement is taking so long and how I can speed it up?

There are two areas I would look at first, locking and a bad plan.
Locking - run your query and while it is running see if it is being blocked by anything else "select * from sys.dm_exec_requests where blocking_session_id <> 0" if you see anything blocking your request then I would start with looking at:
https://www.simple-talk.com/sql/database-administration/the-dba-as-detective-troubleshooting-locking-and-blocking/
If there is no locking then get the execution plan for the insert, what is it doing? it it exceptionally high?
Other than that, how long do you expect it to take? Is it a little bit longer than that or a lot longer? Did it only get so slow after it grew significantly or has it been getting slower over a long period of time?
What is the I/O performance, what are your average read / write times etc etc.

TL;DR: Don't do that (instead of a big 'in' clause: preload and use a temporary table).
With the number of parameters, unknown backend configuration (even though it should be fine by today's standards) and not able to guess what your in-memory size may be during processing, you may be hitting (in order) a stack, batch or memory size limit, starting with this answer. Also possible to hit an instruction size limit.
The troubleshooting comments may lead you to another answer. My pivot's the 'in' clause, statement size, and that all of these links include advice to preload a temporary table and use that with your query.

Running Updates on a large, heavily used table

I have a large table (~170 million rows, 2 nvarchar and 7 int columns) in SQL Server 2005 that is constantly being inserted into. Everything works ok with it from a performance perspective, but every once in a while I have to update a set of rows in the table which causes problems. It works fine if I update a small set of data, but if I have to update a set of 40,000 records or so it takes around 3 minutes and blocks on the table which causes problems since the inserts start failing.
If I just run a select to get back the data that needs to be updated I get back the 40k records in about 2 seconds. It's just the updates that take forever. This is reflected in the execution plan for the update where the clustered index update takes up 90% of the cost and the index seek and top operator to get the rows take up 10% of the cost. The column I'm updating is not part of any index key, so it's not like it reorganizing anything.
Does anyone have any ideas on how this could be sped up? My thought now is to write a service that will just see when these updates have to happen, pull back the records that have to be updated, and then loop through and update them one by one. This will satisfy my business needs but it's another module to maintain and I would love if I could fix this from just a DBA side of things.
Thanks for any thoughts!

Actually it might reorganise pages if you update the nvarchar columns.
Depending on what the update does to these columns they might cause the record to grow bigger than the space reserved for it before the update.
(See explanation now nvarchar is stored at http://www.databasejournal.com/features/mssql/physical-database-design-consideration.html.)
So say a record has a string of 20 characters saved in the nvarchar - this takes 20*2+2(2 for the pointer) bytes in space. This is written at the initial insert into your table (based on the index structure). SQL Server will only use as much space as your nvarchar really takes.
Now comes the update and inserts a string of 40 characters. And oops, the space for the record within your leaf structure of your index is suddenly too small. So off goes the record to a different physical place with a pointer in the old place pointing to the actual place of the updated record.
This then causes your index to go stale and because the whole physical structure requires changing you see a lot of index work going on behind the scenes. Very likely causing an exclusive table lock escalation.
Not sure how best to deal with this. Personally if possible I take an exclusive table lock, drop the index, do the updates, reindex. Because your updates sometimes cause the index to go stale this might be the fastest option. However this requires a maintenance window.

You should batch up your update into several updates (say 10000 at a time, TEST!) rather than one large one of 40k rows.
This way you will avoid a table lock, SQL Server will only take out 5000 locks (page or row) before esclating to a table lock and even this is not very predictable (memory pressure etc). Smaller updates made in this fasion will at least avoid concurrency issues you are experiencing.
You can batch the updates using a service or firehose cursor.
Read this for more info:
http://msdn.microsoft.com/en-us/library/ms184286.aspx
Hope this helps
Robert

The mos brute-force (and simplest) way is to have a basic service, as you mentioned. That has the advantage of being able to scale with the load on the server and/or the data load.
For example, if you have a set of updates that must happen ASAP, then you could turn up the batch size. Conversely, for less important updates, you could have the update "server" slow down if each update is taking "too long" to relieve some of the pressure on the DB.
This sort of "heartbeat" process is rather common in systems and can be very powerful in the right situations.

Its wired that your analyzer is saying it take time to update the clustered Index . Did the size of the data change when you update ? Seems like the varchar is driving the data to be re-organized which might need updates to index pointers(As KMB as already pointed out) . In that case you might want to increase the % free sizes on the data and the index pages so that the data and the index pages can grow without relinking/reallocation . Since update is an IO intensive operation ( unlike read , which can be buffered ) the performance also depends on several factors
1) Are your tables partitioned by data 2) Does the entire table lies in the same SAN disk ( Or is the SAN striped well ?) 3) How verbose is the transaction logging . Can the buffer size of the transaction loggin increased to support larger writes to the log to suport massive inserts ?
Its also important which API/Language are you using? e.g JDBC support a batch update feature which makes the updates a little bit efficient if you are doing multiple updates .

Fastest way to do mass update

Let’s say you have a table with about 5 million records and a nvarchar(max) column populated with large text data. You want to set this column to NULL if SomeOtherColumn = 1 in the fastest possible way.
The brute force UPDATE does not work very well here because it will create large implicit transaction and take forever.
Doing updates in small batches of 50K records at a time works but it’s still taking 47 hours to complete on beefy 32 core/64GB server.
Is there any way to do this update faster? Are there any magic query hints / table options that sacrifices something else (like concurrency) in exchange for speed?
NOTE: Creating temp table or temp column is not an option because this nvarchar(max) column involves lots of data and so consumes lots of space!
PS: Yes, SomeOtherColumn is already indexed.

From everything I can see it does not look like your problems are related to indexes.
The key seems to be in the fact that your nvarchar(max) field contains "lots" of data. Think about what SQL has to do in order to perform this update.
Since the column you are updating is likely more than 8000 characters it is stored off-page, which implies additional effort in reading this column when it is not NULL.
When you run a batch of 50000 updates SQL has to place this in an implicit transaction in order to make it possible to roll back in case of any problems. In order to roll back it has to store the original value of the column in the transaction log.
Assuming (for simplicity sake) that each column contains on average 10,000 bytes of data, that means 50,000 rows will contain around 500MB of data, which has to be stored temporarily (in simple recovery mode) or permanently (in full recovery mode).
There is no way to disable the logs as it will compromise the database integrity.
I ran a quick test on my dog slow desktop, and running batches of even 10,000 becomes prohibitively slow, but bringing the size down to 1000 rows, which implies a temporary log size of around 10MB, worked just nicely.
I loaded a table with 350,000 rows and marked 50,000 of them for update. This completed in around 4 minutes, and since it scales linearly you should be able to update your entire 5Million rows on my dog slow desktop in around 6 hours on my 1 processor 2GB desktop, so I would expect something much better on your beefy server backed by SAN or something.
You may want to run your update statement as a select, selecting only the primary key and the large nvarchar column, and ensure this runs as fast as you expect.
Of course the bottleneck may be other users locking things or contention on your storage or memory on the server, but since you did not mention other users I will assume you have the DB in single user mode for this.
As an optimization you should ensure that the transaction logs are on a different physical disk /disk group than the data to minimize seek times.

Hopefully you already dropped any indexes on the column you are setting to null, including full text indexes. As said before, turning off transactions and the log file temporarily would do the trick. Backing up your data will usually truncate your log files too.

You could set the database recovery mode to Simple to reduce logging, BUT do not do this without considering the full implications for a production environment.
What indexes are in place on the table? Given that batch updates of approx. 50,000 rows take so long, I would say you require an index.

Have you tried placing an index or statistics on someOtherColumn?

This really helped me. I went from 2 hours to 20 minutes with this.
/* I'm using database recovery mode to Simple */
/* Update table statistics */
set transaction isolation level read uncommitted
/* Your 50k update, just to have a measures of the time it will take */
set transaction isolation level READ COMMITTED
In my experience, working in MSSQL 2005, moving everyday (automatically) 4 Million 46-byte-records (no nvarchar(max) though) from one table in a database to another table in a different database takes around 20 minutes in a QuadCore 8GB, 2Ghz server and it doesn't hurt application performance. By moving I mean INSERT INTO SELECT and then DELETE. The CPU usage never goes over 30 %, even when the table being deleted has 28M records and it constantly makes around 4K insert per minute but no updates. Well, that's my case, it may vary depending on your server load.
READ UNCOMMITTED
"Specifies that statements (your updates) can read rows that have been modified by other transactions but not yet committed." In my case, the records are readonly.
I don't know what rg-tsql means but here you'll find info about transaction isolation levels in MSSQL.

Try indexing 'SomeOtherColumn'...50K records should update in a snap. If there is already an index in place see if the index needs to be reorganized and that statistics have been collected for it.

If you are running a production environment with not enough space to duplicate all your tables, I believe that you are looking for trouble sooner or later.
If you provide some info about the number of rows with SomeOtherColumn=1, perhaps we can think another way, but I suggest:
0) Backup your table
1) Index the flag column
2) Set the table option to "no log tranctions" ... if posible
3) write a stored procedure to run the updates

SQL Server, Converting NTEXT to NVARCHAR(MAX)

I have a database with a large number of fields that are currently NTEXT.
Having upgraded to SQL 2005 we have run some performance tests on converting these to NVARCHAR(MAX).
If you read this article:
http://geekswithblogs.net/johnsPerfBlog/archive/2008/04/16/ntext-vs-nvarcharmax-in-sql-2005.aspx
This explains that a simple ALTER COLUMN does not re-organise the data into rows.
I experience this with my data. We actually have much worse performance in some areas if we just run the ALTER COLUMN. However, if I run an UPDATE Table SET Column = Column for all of these fields we then get an extremely huge performance increase.
The problem I have is that the database consists of hundreds of these columns with millions of records. A simple test (on a low performance virtual machine) had a table with a single NTEXT column containing 7 million records took 5 hours to update.
Can anybody offer any suggestions as to how I can update the data in a more efficient way that minimises downtime and locks?
EDIT: My backup solution is to just update the data in blocks over time, however, with our data this results in worse performance until all the records have been updated and the shorter this time is the better so I'm still looking for a quicker way to update.

If you can't get scheduled downtime....
create two new columns:
nvarchar(max)
processedflag INT DEFAULT 0
Create a nonclustered index on the processedflag
You have UPDATE TOP available to you (you want to update top ordered by the primary key).
Simply set the processedflag to 1 during the update so that the next update will only update where the processed flag is still 0
You can use ##rowcount after the update to see if you can exit a loop.
I suggest using WAITFOR for a few seconds after each update query to give other queries a chance to acquire locks on the table and not to overload disk usage.

How about running the update in batches - update 1000 rows at a time.
You would use a while loop that increments a counter, corresponding to the ID of the rows to be updated in each iteration of the the update query. This may not speed up the amount of time it takes to update all 7 million records, but it should make it much less likely that users will experience an error due to record locking.

If you can get scheduled downtime:
Back up the database
Change recovery model to simple
Remove all indexes from the table you are updating
Add a column maintenanceflag(INT DEFAULT 0) with a nonclustered index
Run:
UPDATE TOP 1000
tablename
SET nvarchar from ntext,
maintenanceflag = 1
WHERE maintenanceflag = 0
Multiple times as required (within a loop with a delay).
Once complete, do another backup then change the recovery model back to what it was originally on and add old indexes.
Remember that every index or trigger on that table causes extra disk I/O and that the simple recovery mode minimises logfile I/O.

Running a database test on a low performance virtual machine is not really indicative of production performance, the heavy IO involved will require a fast disk array, which the virtualisation will throttle.

You might also consider testing to see if an SSIS package might do this more efficiently.
Whatever you do, make it an automated process that can be scheduled and run during off hours. the feweer users you have trying to access the data, the faster everything will go. If at all possible, pickout the three or four most critical to change and take the database down for maintentance (during a normally off time) and do them in single user mode. Once you get the most critical ones, the others can be scheduled one or two a night.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas