How to optimise an SQL request containing update when? - sql

I have an SQL request that is take more than 10 seconds to load at peak hours on my server.
UPDATE "events"
SET "metas" = 732899,
"count" = "count" + 1,
"timestamp" = 1633450429
WHERE "hash" = 'my_counter_453751'
Do you see any ways I can optimize this request?
I've tried to change my server but it doesn't change anything.
Usually, this request is almost instant, but when I have lots of users connected on my server this request takes more than 10 seconds and I don't understand why.
Any advice would help!

UPDATE "events" SET ... WHERE "hash" = 'my_counter_453751'
For this to work efficiently, you need an index on hash (if at all possible, it would be somewhat better to retrieve an integer-type primary index).
You also need as few other indexes as possible, because an index on metas, for example, too would be updated if you changed that column, and that takes time.
The fact that under load the update time goes up in that way, though, is suspicious. One of the few things that could explain this is disk thrashing. With many users, the events table gets evicted from memory and requires reloading. With too many indexes, and even more if there is no index on hash at all, this can be calamitous.
So I would also review the memory assigned to the RDBMS server, and the size of the events table. You could trim it down, or benefit from sharding. Or you could split a historical table into several tables, one per month - a poor man's sharding. This complicates retrieving information from that table, but accelerates writes.
I have also reaped huge benefits from dirty tricks such as writing these updates to a different, smaller table (or even to a temporary file). So, none of my hundred simultaneous clients has ever to do anything to the huge table; they all access the tiny one. Then a periodic cron job would run every minute or so, and perform all the updates together, loading the huge table in memory only once - and not locking or interfering in any way with the other threads, then remove the processed records from the tiny table. Maybe some such strategy can be doable for you too.
For a historic log table, you could write to a daily log and then pour the daily log into the large table late every night.

Related

What is the best way to ensure consistent ordering in an Oracle query?

I have an program that needs to run queries on a number of very large Oracle tables (the largest with tens of millions of rows). The output of these queries is fed into another process which (as a side effect) can record the progress of the query (i.e., the last row fetched).
It would be nice if, in the event that the task stopped half way through for some reason, it could be restarted. For this to happen, the query has to return rows in a consistent order, so it has to be sorted. The obvious thing to do is to sort on the primary key; however, there is probably going to be a penalty for this in terms of performance (an index access) versus a non-sorted solution. Given that a restart may never happen this is not desirable.
Is there some trick to ensure consistent ordering in another way? Any other suggestions for maintaining performance in this case?
EDIT: I have been looking around and seen "order by rowid" mentioned. Is this useful or even possible?
EDIT2: I am adding some benchmarks:
With no order by: 17 seconds.
With order by PK: 46 seconds.
With order by rowid: 43 seconds.
So any order by has a savage effect on performance, and using rowid makes little difference. Accepted answer is - there is no easy way to do it.
The best advice I can think of is to reduce the chance of a problem occurring that might stop the process, and that means keeping the code simple. No cursors, no commits, no trying to move part of the data, just straight SQL statements.
Unless a complete restart would be a completely unacceptable disaster, I'd go for simplicity without any part-way restart code at all.
If you want some order and queried data is unsorted then you need to sort it anyway, and spend some resources to do sorting.
So, there are at least two variants for optimization:
Minimize resources spent on sorting;
Query already sorted data.
For the first variant Oracle on its own calculates a best variant to minimize data access and overall query time. It may be possible to choose sorting order involved in unique index which already used by optimizer, but it's a very questionable tactic.
Second variant is about index-organized tables and about forcing Oracle with hints to use some specific index. It seems Ok if you need to process nearly all records in some specific table, but if selectivity of query is high it's significantly slows a process, even on a single table.
Think about a table with surrogate primary key which holds data with 10-year transaction history. If you need data only for previous year and you force order by primary key then Oracle need to process records in all 10 years one-by-one to find all records which belongs to a single year.
But if you need data for 9 years from this table then full table scan may be faster than index-based choice.
So selectivity of your query is a key to choose between full table scan and result sorting.
For storing results and restarting query a good solution is to use Oracle Streams Advanced Queuing to fed another process.
All unprocessed messages in queue redirected to Exception Queue where it may be processed separately.
Because you don't specify exact ordering for selected messages I suppose that you need ordering only to maintain unprocessed part of records. If it's true then with AQ you don't need ordering at all and may, even, process records in parallel.
So, finally, from my point of view Buffered Queue is what you really need.
You could skip ordering and just update the records you processed with something like SET is_processed = 'Y' or SET date_processed = sysdate. Complete restartability and no ordering.
For performance you can partition by is_processed. Yes, partition key changes might be slow, but it is all about trade-offs.

Speed-up SQL Insert Statements

I am facing an issue with an ever slowing process which runs every hour and inserts around 3-4 million rows daily into an SQL Server 2008 Database.
The schema consists of a large table which contains all of the above data and has a clustered index on a datetime field (by day), a unique index on a combination of fields in order to exclude duplicate inserts, and a couple more indexes on 2 varchar fields.
The typical behavior as of late, is that the insert statements get suspended for a while before they complete. The overall process used to take 4-5 mins and now it's usually well over 40 mins.
The inserts are executed by a .net service which parses a series of xml files, performs some data transformations and then inserts the data to the DB. The service has not changed at all, it's just that the inserts take longer than they use to.
At this point I'm willing to try everything. Please, let me know whether you need any more info and feel free to suggest anything.
Thanks in advance.
Sounds like you have exhausted the buffer pools ability to cache all the pages needed for the insert process. Append-style inserts (like with your date table) have a very small working set of just a few pages. Random-style inserts have basically the entire index as their working set. If you insert a row at a random location the existing page that row is supposed to be written to must be read first.
This probably means tons of disk seeks for inserts.
Make sure to insert all rows in one statement. Use bulk insert or TVPs. This allows SQL Server to optimize the query plan by sorting the inserts by key value making IO much more efficient.
This will, however, not realize a big speedup (I have seen 5x in similar situations). To regain the original performance you must bring the working set back into memory. Add RAM, purge old data, or partition such that you only need to touch very few partitions.
drop index's before insert and set them up on completion

Running Updates on a large, heavily used table

I have a large table (~170 million rows, 2 nvarchar and 7 int columns) in SQL Server 2005 that is constantly being inserted into. Everything works ok with it from a performance perspective, but every once in a while I have to update a set of rows in the table which causes problems. It works fine if I update a small set of data, but if I have to update a set of 40,000 records or so it takes around 3 minutes and blocks on the table which causes problems since the inserts start failing.
If I just run a select to get back the data that needs to be updated I get back the 40k records in about 2 seconds. It's just the updates that take forever. This is reflected in the execution plan for the update where the clustered index update takes up 90% of the cost and the index seek and top operator to get the rows take up 10% of the cost. The column I'm updating is not part of any index key, so it's not like it reorganizing anything.
Does anyone have any ideas on how this could be sped up? My thought now is to write a service that will just see when these updates have to happen, pull back the records that have to be updated, and then loop through and update them one by one. This will satisfy my business needs but it's another module to maintain and I would love if I could fix this from just a DBA side of things.
Thanks for any thoughts!
Actually it might reorganise pages if you update the nvarchar columns.
Depending on what the update does to these columns they might cause the record to grow bigger than the space reserved for it before the update.
(See explanation now nvarchar is stored at http://www.databasejournal.com/features/mssql/physical-database-design-consideration.html.)
So say a record has a string of 20 characters saved in the nvarchar - this takes 20*2+2(2 for the pointer) bytes in space. This is written at the initial insert into your table (based on the index structure). SQL Server will only use as much space as your nvarchar really takes.
Now comes the update and inserts a string of 40 characters. And oops, the space for the record within your leaf structure of your index is suddenly too small. So off goes the record to a different physical place with a pointer in the old place pointing to the actual place of the updated record.
This then causes your index to go stale and because the whole physical structure requires changing you see a lot of index work going on behind the scenes. Very likely causing an exclusive table lock escalation.
Not sure how best to deal with this. Personally if possible I take an exclusive table lock, drop the index, do the updates, reindex. Because your updates sometimes cause the index to go stale this might be the fastest option. However this requires a maintenance window.
You should batch up your update into several updates (say 10000 at a time, TEST!) rather than one large one of 40k rows.
This way you will avoid a table lock, SQL Server will only take out 5000 locks (page or row) before esclating to a table lock and even this is not very predictable (memory pressure etc). Smaller updates made in this fasion will at least avoid concurrency issues you are experiencing.
You can batch the updates using a service or firehose cursor.
Read this for more info:
http://msdn.microsoft.com/en-us/library/ms184286.aspx
Hope this helps
Robert
The mos brute-force (and simplest) way is to have a basic service, as you mentioned. That has the advantage of being able to scale with the load on the server and/or the data load.
For example, if you have a set of updates that must happen ASAP, then you could turn up the batch size. Conversely, for less important updates, you could have the update "server" slow down if each update is taking "too long" to relieve some of the pressure on the DB.
This sort of "heartbeat" process is rather common in systems and can be very powerful in the right situations.
Its wired that your analyzer is saying it take time to update the clustered Index . Did the size of the data change when you update ? Seems like the varchar is driving the data to be re-organized which might need updates to index pointers(As KMB as already pointed out) . In that case you might want to increase the % free sizes on the data and the index pages so that the data and the index pages can grow without relinking/reallocation . Since update is an IO intensive operation ( unlike read , which can be buffered ) the performance also depends on several factors
1) Are your tables partitioned by data 2) Does the entire table lies in the same SAN disk ( Or is the SAN striped well ?) 3) How verbose is the transaction logging . Can the buffer size of the transaction loggin increased to support larger writes to the log to suport massive inserts ?
Its also important which API/Language are you using? e.g JDBC support a batch update feature which makes the updates a little bit efficient if you are doing multiple updates .

Fastest way to do mass update

Let’s say you have a table with about 5 million records and a nvarchar(max) column populated with large text data. You want to set this column to NULL if SomeOtherColumn = 1 in the fastest possible way.
The brute force UPDATE does not work very well here because it will create large implicit transaction and take forever.
Doing updates in small batches of 50K records at a time works but it’s still taking 47 hours to complete on beefy 32 core/64GB server.
Is there any way to do this update faster? Are there any magic query hints / table options that sacrifices something else (like concurrency) in exchange for speed?
NOTE: Creating temp table or temp column is not an option because this nvarchar(max) column involves lots of data and so consumes lots of space!
PS: Yes, SomeOtherColumn is already indexed.
From everything I can see it does not look like your problems are related to indexes.
The key seems to be in the fact that your nvarchar(max) field contains "lots" of data. Think about what SQL has to do in order to perform this update.
Since the column you are updating is likely more than 8000 characters it is stored off-page, which implies additional effort in reading this column when it is not NULL.
When you run a batch of 50000 updates SQL has to place this in an implicit transaction in order to make it possible to roll back in case of any problems. In order to roll back it has to store the original value of the column in the transaction log.
Assuming (for simplicity sake) that each column contains on average 10,000 bytes of data, that means 50,000 rows will contain around 500MB of data, which has to be stored temporarily (in simple recovery mode) or permanently (in full recovery mode).
There is no way to disable the logs as it will compromise the database integrity.
I ran a quick test on my dog slow desktop, and running batches of even 10,000 becomes prohibitively slow, but bringing the size down to 1000 rows, which implies a temporary log size of around 10MB, worked just nicely.
I loaded a table with 350,000 rows and marked 50,000 of them for update. This completed in around 4 minutes, and since it scales linearly you should be able to update your entire 5Million rows on my dog slow desktop in around 6 hours on my 1 processor 2GB desktop, so I would expect something much better on your beefy server backed by SAN or something.
You may want to run your update statement as a select, selecting only the primary key and the large nvarchar column, and ensure this runs as fast as you expect.
Of course the bottleneck may be other users locking things or contention on your storage or memory on the server, but since you did not mention other users I will assume you have the DB in single user mode for this.
As an optimization you should ensure that the transaction logs are on a different physical disk /disk group than the data to minimize seek times.
Hopefully you already dropped any indexes on the column you are setting to null, including full text indexes. As said before, turning off transactions and the log file temporarily would do the trick. Backing up your data will usually truncate your log files too.
You could set the database recovery mode to Simple to reduce logging, BUT do not do this without considering the full implications for a production environment.
What indexes are in place on the table? Given that batch updates of approx. 50,000 rows take so long, I would say you require an index.
Have you tried placing an index or statistics on someOtherColumn?
This really helped me. I went from 2 hours to 20 minutes with this.
/* I'm using database recovery mode to Simple */
/* Update table statistics */
set transaction isolation level read uncommitted
/* Your 50k update, just to have a measures of the time it will take */
set transaction isolation level READ COMMITTED
In my experience, working in MSSQL 2005, moving everyday (automatically) 4 Million 46-byte-records (no nvarchar(max) though) from one table in a database to another table in a different database takes around 20 minutes in a QuadCore 8GB, 2Ghz server and it doesn't hurt application performance. By moving I mean INSERT INTO SELECT and then DELETE. The CPU usage never goes over 30 %, even when the table being deleted has 28M records and it constantly makes around 4K insert per minute but no updates. Well, that's my case, it may vary depending on your server load.
READ UNCOMMITTED
"Specifies that statements (your updates) can read rows that have been modified by other transactions but not yet committed." In my case, the records are readonly.
I don't know what rg-tsql means but here you'll find info about transaction isolation levels in MSSQL.
Try indexing 'SomeOtherColumn'...50K records should update in a snap. If there is already an index in place see if the index needs to be reorganized and that statistics have been collected for it.
If you are running a production environment with not enough space to duplicate all your tables, I believe that you are looking for trouble sooner or later.
If you provide some info about the number of rows with SomeOtherColumn=1, perhaps we can think another way, but I suggest:
0) Backup your table
1) Index the flag column
2) Set the table option to "no log tranctions" ... if posible
3) write a stored procedure to run the updates

Stream with a lot of UPDATEs and PostgreSQL

I'm quite a newbie with PostgreSQL optimization and chosing whatever's appropriate job for it and whatever's not. So, I want to know whenever I'm trying to use PostgreSQL for inappropriate job, or it is suitable for it and I should set everything up properly.
Anyway, I have a need for a database with a lot of data that changes frequently.
For example, imagine an ISP, having a lot of clients, each having a session (PPP/VPN/whatever), with two self-describing frequently updated properties bytes_received and bytes_sent. There is a table with them, where each session is represented by a row with unique ID:
CREATE TABLE sessions(
id BIGSERIAL NOT NULL,
username CHARACTER VARYING(32) NOT NULL,
some_connection_data BYTEA NOT NULL,
bytes_received BIGINT NOT NULL,
bytes_sent BIGINT NOT NULL,
CONSTRAINT sessions_pkey PRIMARY KEY (id)
)
And as accounting data flows, this table receives a lot of UPDATEs like those:
-- There are *lots* of such queries!
UPDATE sessions SET bytes_received = bytes_received + 53554,
bytes_sent = bytes_sent + 30676
WHERE id = 42
When we receive a never ending stream with quite a lot (like 1-2 per second) of updates for a table with a lot (like several thousands) of sessions, probably thanks to MVCC, this makes PostgreSQL very busy. Are there any ways to speed everything up, or Postgres is just not exactly suitable for this task and I'd better consider it unsuitable for this job and put those counters to another storage like memcachedb, using Postgres only for fairly static data? But I'll miss an ability to infrequently query on this data, for example to find TOP10 downloaders, which is not really good.
Unfortunately, the amount of data cannot be lowered much. The ISP accounting example is all thought up to simplify the explanation. The real problem's with another system, which structure is somehow harder to explain.
Thanks for suggestions!
The database really isn't the best tool for collecting lots of small updates, but as I don't know your queryability and ACID requirements I can't really recommend something else. If it's an acceptable approach the application side update aggregation suggested by zzzeek can help lower the update load significantly.
There is an similar approach that can give you durability and ability to query the fresher data at some performance cost. Create a buffer table that can collect the changes to the values that need to be updated and insert the changes there. At regular intervals in a transaction rename the table to something else and create a new table in place of it. Then in a transaction aggregate all the changes, do the corresponding updates to the main table and truncate the buffer table. This way if you need a consistent and fresh snapshot of any data you can select from the main table and join in all the changes from the active and renamed buffer tables.
However if neither is acceptable you can also tune the database to deal better with heavy update loads.
To optimize the updating make sure that PostgreSQL can use heap-only tuples to store the updated versions of the rows. To do this make sure that there are no indexes on the frequently updated columns and change the fillfactor to something lower from the default 100%. You'll need to figure out a suitable fill factor on your own as it depends heavily on the details of the workload and the machine it is running on. The fillfactor needs to be low enough that allmost all of the updates fit on the same database page before autovacuum has the chance to clean up the old non-visible versions. You can tune autovacuum settings to trade off between the density of the database and vacuum overhead. Also, take into account that any long transactions, including statistical queries, will hold onto tuples that have changed after the transaction has started. See the pg_stat_user_tables view to see what to tune, especially the relationship of n_tup_hot_upd to n_tup_upd and n_live_tup to n_dead_tup.
Heavy updating will also create a heavy write ahead log (WAL) load. Tuning the WAL behavior (docs for the settings) will help lower that. In particular, a higher checkpoint_segments number and higher checkpoint_timeout can lower your IO load significantly by allowing more updates to happen in memory. See the relationship of checkpoints_timed vs. checkpoints_req in pg_stat_bgwriter to see how many checkpoints happen because either limit is reached. Raising your shared_buffers so that the working set fits in memory will also help. Check buffers_checkpoint vs. buffers_clean + buffers_backend to see how many were written to satisfy checkpoint requirements vs. just running out of memory.
You want to assemble statistical updates as they happen into an in-memory queue of some kind, or alternatively onto a message bus if you're more ambitious. A receiving process then aggregates these statistical updates on a periodic basis - which can be anywhere from every 5 seconds to every hour - depends on what you want. The counts of bytes_received and bytes_sent are then updated, with counts that may represent many individual "update" messages summed together. Additionally you should batch the update statements for multiple ids into a single transaction, ensuring that the update statements are issued in the same relative order with regards to primary key to prevent deadlocks against other transactions that might be doing the same thing.
In this way you "batch" activities into bigger chunks to control how much load is on the PG database, and also serialize many concurrent activities into a single stream (or multiple, depending on how many threads/processes are issuing updates). The tradeoff which you tune based on the "period" is, how much freshness vs. how much update load.