When to commit changes? - sql

Using Oracle 10g, accessed via Perl DBI, I have a table with a few tens of million rows being updated a few times per second while being read from much more frequently form another process.
Soon the update frequency will increase by an order of magnitude (maybe two).
Someone suggested that committing every N updates instead of after every update will help performance.
I have a few questions:
Will that be faster or slower or it depends (planning to benchmark both way as soon as can get a decent simulation of the new load)
Why will it help / hinder performance.
If "it depends ..." , on what ?
If it helps what's the best value of N ?
Why can't my local DBA have an helpful straight answer when I need one? (Actually I know the answer to that one) :-)
EDIT:
#codeslave : Thanks, btw losing
uncommited changes is not a problem, I
don't delete the original data used
for updating till I am sure everything
is fine , btw cleaning lady did
unplugs the server, TWICE :-)
Some googling showed it might help
because of issue related to rollback
segments, but I still don't know a
rule of thumb for N every few tens ?
hundreds? thousand ?
#diciu : Great info, I'll definitely
look into that.

A commit results in Oracle writing stuff to the disk - i.e. in the redo log file so that whatever the transaction being commited has done can be recoverable in the event of a power failure, etc.
Writing in file is slower than writing in memory so a commit will be slower if performed for many operations in a row rather then for a set of coalesced updates.
In Oracle 10g there's an asynchronous commit that makes it much faster but less reliable: https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-6158695.html
PS I know for sure that, in a scenario I've seen in a certain application, changing the number of coalesced updates from 5K to 50K makes it faster by an order of magnitude (10 times faster).

Reducing the frequency of commits will certainly speed things up, however as you are reading and writing to this table frequently there is the potential for locks. Only you can determine the likelihood of the same data being updated at the same time. If the chance of this is low, commit every 50 rows and monitor the situation. Trial and error I'm afraid :-)

As well as reducing the commit frequency, you should also consider performing bulk updates instead of individual ones.

If you "don't delete the original data used for updating till [you are] sure everything is fine", then why don't you remove all those incremental commits in between, and rollback if there's a problem? It sounds like you effectively have built a transaction systems on top of transactions.

#CodeSlave your your questions is answered by #stevechol , if i remove ALL the incremental commits there will be locks. I guess if nothing better comes along I'll follow his advice pick a random number , monitor the load and adjust accordingly. While applying #diciu twaks.
PS: the transaction on top of transaction is just accidental, I get the files used for updates by FTP and instead of deleting them immediately I set a cron job to deletes them a week later (if no one using the application has complained) that means if something goes wrong I have a week to catch the errors.

Faster/Slower?
It will probably be a little faster. However, you run a greater risk of running into deadlocks, losing uncommitted changes should something catastrophic happen (cleaning lady unplugs the server), FUD, Fire, Brimstone, etc.
Why would it help?
Obviously fewer commit operations, which in turn means fewer disk writes, etc.
DBA's and straight answers?
If it was easy, you won't need one.

Related

Redshift removal of explicit locks lead to missing rows/duplicated rows. Best resolution?

I have a use case where some of our redshift tables are used by multiple data scientists at the same time for tuning. If they are tuning at the same time and import data at the same time, we end up with missing rows and duplicated rows.
Awhile ago, they had removed the explicit lock from the table to reduce the amount of loads that would hang on those locks.
I'm assuming that this removal of the explicit lock is causing the duplicated and missing rows.
In terms of path forward, I was thinking about having kinesis, redis, or something similar to batch these to be one import instead of inserts (not great for redshift anyways). Or if the real solution is to add explicit locking back to the table and deal with the hanging loads.
Any guidance would be appreciated. Thanks
Putting the explicit locks back on works, but other procedures hang behind the table/proc locks and slows down significantly.
Yes removing locks is causing this and they shouldn’t do it. Tell them to stop.
They are likely running into this because they aren’t COMMITting their changes. Changing their connections to AUTOCOMMIT might fix things. If one person changes a table but doesn’t COMMIT the change then they have a local copy and a lock until they do. If they never disconnect then this situation can last forever. If many are doing this then you have a mess of local copies and locks waiting to be resolved but nobody COMMITs.
When people come from lock-on-write databases they can get confused about what is happening. Read up on MVCC database coherency.

web application receiving millions of requests and leads to generating millions of row inserts per 30 seconds in SQL Server 2008

I am currently addressing a situation where our web application receives at least a Million requests per 30 seconds. So these requests will lead to generating 3-5 Million row inserts between 5 tables. This is pretty heavy load to handle. Currently we are using multi threading to handle this situation (which is a bit faster but unable to get a better CPU throughput). However the load will definitely increase in future and we will have to account for that too. After 6 months from now we are looking at double the load size we are currently receiving and I am currently looking at a possible new solution that is scalable and should be easy enough to accommodate any further increase to this load.
Currently with multi threading we are making the whole debugging scenario quite complicated and sometimes we are having problem with tracing issues.
FYI we are already utilizing the SQL Builk Insert/Copy that is mentioned in this previous post
Sql server 2008 - performance tuning features for insert large amount of data
However I am looking for a more capable solution (which I think there should be one) that will address this situation.
Note: I am not looking for any code snippets or code examples. I am just looking for a big picture of a concept that I could possibly use and I am sure that I can take that further to an elegant solution :)
Also the solution should have a better utilization of the threads and processes. And I do not want my threads/processes to even wait to execute something because of some other resource.
Any suggestions will be deeply appreciated.
Update: Not every request will lead to an insert...however most of them will lead to some sql operation. The appliciation performs different types of transactions and these will lead to a lot of bulk sql operations. I am more concerned towards inserts and updates.
and these operations need not be real time there can be a bit lag...however processing them real time will be much helpful.
I think your problem looks more towards getting a better CPU throughput which will lead to a better performance. So I would probably look at something like an Asynchronous Processing where in a thread will never sit idle and you will probably have to maintain a queue in the form of a linked list or any other data structure that will suit your programming model.
The way this would work is your threads will try to perform a given job immediately and if there is anything that would stop them from doing it then they will push that job into the queue and these pushed items will be processed based on how it stores the items in the container/queue.
In your case since you are already using bulk sql operations you should be good to go with this strategy.
lemme know if this helps you.
Can you partition the database so that the inserts are spread around? How is this data used after insert? Is there a natural partion to the data by client or geography or some other factor?
Since you are using SQL server, I would suggest you get several of the books on high availability and high performance for SQL Server. The internals book muight help as well. Amazon has a bunch of these. This is a complex subject and requires too much depth for a simple answer on a bulletin board. But basically there are several keys to high performance design including hardware choices, partitioning, correct indexing, correct queries, etc. To do this effectively, you have to understand in depth what SQL Server does under the hood and how changes can make a big difference in performance.
Since you do not need to have your inserts/updates real time you might consider having two databases; one for reads and one for writes. Similar to having a OLTP db and an OLAP db:
Read Database:
Indexed as much as needed to maximize read performance.
Possibly denormalized if performance requires it.
Not always up to date.
Insert/Update database:
No indexes at all. This will help maximize insert/update performance
Try to normalize as much as possible.
Always up to date.
You would basically direct all insert/update actions to the Insert/Update db. You would then create a publication process that would move data over to the read database at certain time intervals. When I have seen this in the past the data is usually moved over on a nightly bases when few people will be using the site. There are a number of options for moving the data over, but I would start by looking at SSIS.
This will depend on your ability to do a few things:
have read data be up to one day out of date
complete your nightly Read db update process in a reasonable amount of time.

Postgresql Application Insertion and Trigger Performance

I'm working on designing an application with a SQL backend (Postgresql) and I've got a some design questions. In short, the DB will serve to store network events as they occur on the fly, so insertion speed and performance is critical due 'real-time' actions depending on these events. The data is dumped into a speedy default format across a few tables, and I am currently using postgresql triggers to put this data into some other tables used for reporting.
On a typical event, data is inserted into two different tables each share the same primary key (an event ID). I then need to move and rearrange the data into some different tables that are used by a web-based reporting interface. My primary goal/concern is to keep the load off the initial insertion tables, so they can do their thing. Reporting is secondary, but it would still be nice for this to occur on the fly via triggers as opposed to a cron job where I have to query and manage events that have already been processed. Reporting should/will never touch the initial insertion tables. Performance wise..does this make sense or am I totally off?
Once the data is in the appropriate reporting tables, I won't need to hang on to the data in the insertion tables too long, so I'll keep those regularly pruned for insertion performance. In thinking about this scenario, which I'm sure is semi-common, I've come up with three options:
Use triggers to trigger on the initial row insert and populate the reporting tables. This was my original plan.
Use triggers to copy the insertion data to a temporary table (same format), and then trigger or cron to populate the reporting tables. This was just a thought, but I figure that a simple copy operation to a temp table will offload any of the query-ing of the triggers in the solution above.
Modify my initial output program to dump all the data to a single table (vs across two) and then trigger on that insert to populate the reporting tables. So where solution 1 is a multi-table to multi-table trigger situation, this would be a single-table source to multi-table trigger.
Am I over thinking this? I want to get this right. Any input is much appreciated!
You may experience have a slight increase in performance since there are more "things" to do (although they should not affect operations in any way). But using Triggers/other PL is a good way to reduce it to minimum subce they are executed faster than code that gets sent from your application to the DB-Server.
I would go with your first idea 1) since it seems to me the cleanest and most efficient way.
2) is the most performance hungry solution since cron will do more queries than the other solutions that use server-side functions. 3) would be possible but will resulst in an "uglier" database layout.
This is an old one but adding my answer here.
Reporting is secondary, but it would still be nice for this to occur on the fly via triggers as opposed to a cron job where I have to query and manage events that have already been processed. Reporting should/will never touch the initial insertion tables. Performance wise..does this make sense or am I totally off?
That may be way off, I'm afraid, but under a few cases it may not be. It depends on the effects of caching on the reports. Keep in mind that disk I/O and memory are your commodities, and that writers and readers rarely block eachother on PostgreSQL (unless they explicitly elevate locks--- a SELECT ... FOR UPDATE will block writers for example). Basically if your tables fit comfortably in RAM you are better off reporting from them since you are keeping disk I/O free for the WAL segment commit of your event entry. If they don't fit in RAM then you may have cache miss issues induced by reporting. Here materializing your views (i.e. making trigger-maintained tables) may cut down on these but they have a significant complexity cost. This, btw, if your option 1. So I would chalk this one up provisionally as premature optimization. Also keep in mind you may induce cache misses and lock contention on materializing the views this way, so you might induce performance problems regarding inserts this way.
Keep in mind if you can operate from RAM with the exception of WAL commits, you will have no performance problems.
For #2. If you mean temporary tables as CREATE TEMPORARY TABLE, that's asking for a mess including performance issues and reports not showing what you want them to show. Don't do it. If you do this, you might:
Force PostgreSQL to replan your trigger on every insert (or at least once per session). Ouch.
Add overhead creating/dropping tables
Possibilities of OID wraparound
etc.....
In short I think you are overthinking it. You can get very far by bumping RAM up on your Pg box and making sure you have enough cores to handle the appropriate number of inserting sessions plus the reporting one. If you plan your hardware right, none of this should be a problem.

How much does wrapping inserts in a transaction help performance on Sql Server?

Ok so say I have 100 rows to insert and each row has about 150 columns (I know that sounds like a lot of columns, but I need to store this data in a single table). The inserts will occur at random, (ie whenever a set of users decide to upload a file containing the data), about a 20 times a month. However the database will be under continuous load processing other functions of a large enterprise application. The columns are varchars, ints, as well as a variety of other types.
Is the performance gain of wrapping these inserts in a transaction (as opposed to running them one at a time) going to be huge, minimal, or somewhere in between?
Why?
EDIT:
This is for Sql Server 2005, but I'd be interested in 2000/2008 if there is something different to be said. Also I should mention that I understand the point about transactions being primarily for data-consistency, but I want to focus on performance effects.
It can be an impact actually. The point of transactions is not about how many you do, it's about keeping the data update consistent. If you have rows that need to be inserted together and are dependent on each other, those are the records you wrap in a transaction.
Transactions are about keeping your data consistent. This should be the first thing you think about when using transactions. For example, if you have a debit (withdrawl) from your checking account, you want to make sure the credit (deposit) is also done. If either of those don't succeed, the whole "transaction" should be rolled back. Therefore, both actions MUST be wrapped in a transaction.
When doing batch inserts, break them up in to 3000 or 5000 records and cycle through the set. 3000-5000 has been a sweet number range for me for inserts; don't go above that unless you've tested that the server can handle it. Also, I will put GOs in the batch at about every 3000 or 5000 records for inserts. Updates and deletes I'll put a GO at about 1000, because they require more resources to commit.
If your doing this from C# code, then in my opinion, you should build a batch import routine instead of doing millions of inserts one at a time through coding.
While transactions are a mechanism for keeping data consistent they actually have a massive impact on performance if they are used incorrectly or overused. I've just finished a blog post on the impact on performance of explicitly specifying transactions as opposed to letting them occur naturally.
If you are inserting multiple rows and each insert occurs in its own transaction there is a lot of overhead on locking and unlocking data. By encapsulating all inserts in a single transactions you can dramatically improve performance.
Conversely if you have many queries running against your database and have large transactions also occurring they can block each other and cause performance issues.
Transactions are definitively linked with performance, regardless of their underlying intent.
It depends on what you call huge, but it will help (it really depends on the overall number of inserts you are doing). It will force SQL Server to not do a commit after every insert, which in time adds up. With 100 inserts, you probably won't notice too much an increase depending on how often and what else is going on with the database.
As others have said, transactions have nothing to do with performance, but instead have to do with the integrity of your data.
That being said, worrying about the performance one way or the other when you're only talking about inserting 100 rows of data about 20 times a month (meaning 2000 records per month) is silly. Premature optimization is a waste of time; unless you have repeatedly tested the performance impact of these inserts (as small as they are, and as infrequent) and found them to be a major issue, don't worry about the performance. It's negligible compared to the other things you mentioned as being server load.
Transactions are not for performance but for data-integrity. Depending on the implementation there will be real no gain/loss of performance for only 100 rows (they just will be logged additionally, so they can all be rolled back).
Things to consider about the performance issues:
TAs will interact with other queries
writing TAs will lock tuples/pages/files
commits just might be (depending on lock protocol) update of a timestamp
more logs might be written for TAs (one should be able to roll TAs back, but the DB might log extensively already, sequential logging is cheap)
the degree of isolation (I know that one can switch this level in some DBs - and that nearly nobody uses level 3)
All in all: use TAs for ensuring the integrity.
practically - extremely. with large inserts, 100++ (provided that you configured mysql to have increased query size and transaction size to support monstrous queries/transactions, sorry don't remember exact variable names) - insert times can commonly be 10 times as fast and even much more

Stored Procedure Execution Plan - Data Manipulation

I have a stored proc that processes a large amount of data (about 5m rows in this example). The performance varies wildly. I've had the process running in as little as 15 minutes and seen it run for as long as 4 hours.
For maintenance, and in order to verify that the logic and processing is correct, we have the SP broken up into sections:
TRUNCATE and populate a work table (indexed) we can verify later with automated testing tools.
Join several tables together (including some of these work tables) to product another work table
Repeat 1 and/or 2 until a final output is produced.
My concern is that this is a single SP and so gets an execution plan when it is first run (even WITH RECOMPILE). But at that time, the work tables (permanent tables in a Work schema) are empty.
I am concerned that, regardless of the indexing scheme, the execution plan will be poor.
I am considering breaking up the SP and calling separate SPs from within it so that they could take advantage of a re-evaluated execution plan after the data in the work tables is built. I have also seen reference to using EXEC to run dynamic SQL which, obviously might get a RECOMPILE also.
I'm still trying to get SHOWPLAN permissions, so I'm flying quite blind.
Are you able to determine whether there are any locking problems? Are you running the SP in sufficiently small transactions?
Breaking it up into subprocedures should have no benefit.
Somebody should be concerned about your productivity, working without basic optimization resources. That suggests there may be other possible unseen issues as well.
Grab the free copy of "Dissecting Execution Plan" in the link below and maybe you can pick up a tip or two from it that will give you some idea of what's really going on under the hood of your SP.
http://dbalink.wordpress.com/2008/08/08/dissecting-sql-server-execution-plans-free-ebook/
Are you sure that the variability you're seeing is caused by "bad" execution plans? This may be a cause, but there may be a number of other reasons:
"other" load on the db machine
when using different data, there may be "easy" and "hard" data
issues with having to allocate more memory/file storage
...
Have you tried running the SP with the same data a few times?
Also, in order to figure out what is causing the runtime/variability, I'd try to do some detailed measuring to pin the problem down to a specific section of the code. (Easiest way would be to insert some log calls at various points in the sp). Then try to explain why that section is slow (other than "5M rows ;-)) and figure out a way to make that faster.
For now, I think there are a few questions to answer before going down the "splitting up the sp" route.
You're right it is quite difficult for you to get a clear picture of what is happening behind the scenes until you can get the "actual" execution plans from several executions of your overall process.
One point to consider perhaps. Are your work tables physical of temporary tables? If they are physical you will get a performance gain by inserting new data into a new table without an index (i.e. a heap) which you can then build an index on after all the data has been inserted.
Also, what is the purpose of your process. It sounds like you are moving quite a bit of data around, in which case you may wish to consider the use of partitioning. You can switch in and out data to your main table with relative ease.
Hope what I have detailed is clear but please feel free to pose further questions.
Cheers, John
In several cases I've seen this level of diversity of execution times / query plans comes down to statistics. I would recommend some tests running update stats against the tables you are using just before the process is run. This will both force a re-evaluation of the execution plan by SQL and, I suspect, give you more consistent results. Additionally you may do well to see if the differences in execution time correlate with re-indexing jobs by your dbas. Perhaps you could also gather some index health statistics before each run.
If not, as other answerers have suggested, you are more likely suffering from locking and/or contention issues.
Good luck with it.
The only thing I can think that an execution plan would do wrong when there's no data is err on the side of using a table scan instead of an index, since table scans are super fast when the whole table will fit into memory. Are there other negatives you're actually observing or are sure are happening because there's no data when an execution plan is created?
You can force usage of indexes in your query...
Seems to me like you might be going down the wrong path.
Is this an infeed or outfeed of some sort or are you creating a report? If it is a feed, I would suggest that you change the process to use SSIS which should be able to move 5 million records very fast.