How much does wrapping inserts in a transaction help performance on Sql Server? - sql

Ok so say I have 100 rows to insert and each row has about 150 columns (I know that sounds like a lot of columns, but I need to store this data in a single table). The inserts will occur at random, (ie whenever a set of users decide to upload a file containing the data), about a 20 times a month. However the database will be under continuous load processing other functions of a large enterprise application. The columns are varchars, ints, as well as a variety of other types.
Is the performance gain of wrapping these inserts in a transaction (as opposed to running them one at a time) going to be huge, minimal, or somewhere in between?
Why?
EDIT:
This is for Sql Server 2005, but I'd be interested in 2000/2008 if there is something different to be said. Also I should mention that I understand the point about transactions being primarily for data-consistency, but I want to focus on performance effects.

It can be an impact actually. The point of transactions is not about how many you do, it's about keeping the data update consistent. If you have rows that need to be inserted together and are dependent on each other, those are the records you wrap in a transaction.
Transactions are about keeping your data consistent. This should be the first thing you think about when using transactions. For example, if you have a debit (withdrawl) from your checking account, you want to make sure the credit (deposit) is also done. If either of those don't succeed, the whole "transaction" should be rolled back. Therefore, both actions MUST be wrapped in a transaction.
When doing batch inserts, break them up in to 3000 or 5000 records and cycle through the set. 3000-5000 has been a sweet number range for me for inserts; don't go above that unless you've tested that the server can handle it. Also, I will put GOs in the batch at about every 3000 or 5000 records for inserts. Updates and deletes I'll put a GO at about 1000, because they require more resources to commit.
If your doing this from C# code, then in my opinion, you should build a batch import routine instead of doing millions of inserts one at a time through coding.

While transactions are a mechanism for keeping data consistent they actually have a massive impact on performance if they are used incorrectly or overused. I've just finished a blog post on the impact on performance of explicitly specifying transactions as opposed to letting them occur naturally.
If you are inserting multiple rows and each insert occurs in its own transaction there is a lot of overhead on locking and unlocking data. By encapsulating all inserts in a single transactions you can dramatically improve performance.
Conversely if you have many queries running against your database and have large transactions also occurring they can block each other and cause performance issues.
Transactions are definitively linked with performance, regardless of their underlying intent.

It depends on what you call huge, but it will help (it really depends on the overall number of inserts you are doing). It will force SQL Server to not do a commit after every insert, which in time adds up. With 100 inserts, you probably won't notice too much an increase depending on how often and what else is going on with the database.

As others have said, transactions have nothing to do with performance, but instead have to do with the integrity of your data.
That being said, worrying about the performance one way or the other when you're only talking about inserting 100 rows of data about 20 times a month (meaning 2000 records per month) is silly. Premature optimization is a waste of time; unless you have repeatedly tested the performance impact of these inserts (as small as they are, and as infrequent) and found them to be a major issue, don't worry about the performance. It's negligible compared to the other things you mentioned as being server load.

Transactions are not for performance but for data-integrity. Depending on the implementation there will be real no gain/loss of performance for only 100 rows (they just will be logged additionally, so they can all be rolled back).
Things to consider about the performance issues:
TAs will interact with other queries
writing TAs will lock tuples/pages/files
commits just might be (depending on lock protocol) update of a timestamp
more logs might be written for TAs (one should be able to roll TAs back, but the DB might log extensively already, sequential logging is cheap)
the degree of isolation (I know that one can switch this level in some DBs - and that nearly nobody uses level 3)
All in all: use TAs for ensuring the integrity.

practically - extremely. with large inserts, 100++ (provided that you configured mysql to have increased query size and transaction size to support monstrous queries/transactions, sorry don't remember exact variable names) - insert times can commonly be 10 times as fast and even much more

Related

How to maintain high performance in a medical production database with millions of rows

I have an application that is used to chart patient data during an ICU stay (Electronic record).
Patients are usually connected to several devices (monitors, ventilator, dialysis etc.)
that send data in a one minute interval. An average of 1800 rows are inserted per hour per patient.
Until now the integration enginge recieves the data and stores it in files on a dedicated drive.
The application reads it from there and plots it in graphs and data grids.
As there's a requirement for analysis we're thinking about writing the incoming signals immediately into the DB.
But there're a lot of concerns with respect to performance. Especially in this work environment people are very sensitive when it comes to performance.
Are there any techniques besides proper indexing to mitigate a possbile performance impact?
I'm thinking of a job to load the data into a dedicated table or maybe even into another database e.g. after 1 Month after the record was closed.
Any experiences how to keep the production DB small and lightweight?
I have no idea how many patients you have in you ICU unit but unless you have thousands of patients you should not have any problems - as long as you stick to inserts, use bind variables and do have as many freelists as necessary. Insert will only create locks on the free list. So you can do as many parallel insert as there are freelists available to determine a free block where to write the data to. You may want to look at the discussion over ra TKyte's site
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:950845531436
Generally speaking 1.800 records per hours (or 10-20 times that) is not a lot for any decent sized Oracle db. If you are really fancy you could choose to partition based on the patient_id. This would be specifically useful if you:
Access the data only for one patient at a time because you can just skip all other partitions.
If you want to remove the data for a patient en bloc once he leaves ICU. Instead of DELETEING you could just drop the patients partitions.
Define "immediately". One of the best things you can do to improve INSERT performance is to batch the commands instead of running them one-at-a-time.
Every SQL statement has overhead - sending the statement to the database, parsing it (remember to use bind variables so that you don't have to hard parse each statement), returning a message, etc. In many applications, that overhead takes longer than the actual INSERT.
You only have to do a small amount of batching to significantly reduce that overhead. Running an INSERT ALL with two rows instead of two separate statements reduces the overhead by 1/2, running with three rows reduces overhead by 2/3, etc. Waiting a minute, or even a few seconds, can make a big difference.
As long as you avoid the common row-by-row blunders, an Oracle database with "millions" of rows is nothing to worry about. You don't need to think about cryptic table settings or replication yet.

Is it better to cache some value in a database table, or to recompute it each time?

For example, I have a table of bank users (user id, user name), and a table for transactions (user id, account id, amount).
Accounts have the same properties across different users, but hold different amounts (like Alex -> Grocery, it is specific to Alex, but all other users also have Grocery account).
The question is, would it be better to create a separate table of accounts (account id, user id, amount left) or to get this value by selecting all transactions with the needed user id and account id and just summing the 'amount' values? It seems that the first approach would be faster, but more prone to error and database corruption - I would need to update accounts every time the transaction happens. The second approach seems to be cleaner, but would it lead to significant speed reduction?
What would you recommend?
good question!
In my opinion you should always avoid duplicated data so I would go with the "summing" every time option
"It seems that the first approach would be faster, but more prone to error and database corruption - I would need to update accounts every time the transaction happens"
said everything, you are subject to errors and you'll have to build a mechanism to maintain the data up-to-date.
Dont forget that the first approach would be faster to select only. inserts updates and deletes would be slower because you will have to update your second table.
This is an example of Denormalization.
In general, denormalization is discouraged, but there are certain exceptions - bank account balances are typically one such exception.
So if this is your exact situation, I would suggest going with the separate table of accounts solution - but if you have far fewer records than a bank would typically, then I recommend the derived approach, instead.
To some extent, it depends.
With "small" data volumes, performance will more than likely be OK.
But as data volumes grow, having to SUM all transactions may become costlier to the point at which you start noticing a performance problem.
Also to consider is data access/usage patterns. In a ready-heavy system, where you "write once, ready many", then the SUM approach hits performance on every read - in this scenario, it may make sense to take a performance hit once on write, to improve subsequent read performance.
If you anticipate "large" data volumes, I'd definitely go with the extra table to hold the high level totals. You need to ensure though that it is updated when a (monetary) transaction is made, within a (sql server) transaction to make it an atomic operation.
With smaller data volumes, you could get away without it...personally, I'd probably still go down that path, to simplify the read scenario.
It makes sense to go with the denormalized approach (the first solution) only if you face significant performance issues. Since you are doing just simple SUM (or group by and then sum) with proper indexes, your normalized solution will work really well and will be a lot easier to maintain (as you noted).
But depending on your queries, it can make sense to go with denormalized solution...for example, if your database is read/only (you periodically load data from some other data source and don't make inserts/updates at all or make them really rarely), then you can just load data in the easiest way to make queries...and in that case, denormalized solution might prove to be better.

SQL DB performance and repeated queries at short intervals

If a query is constantly sent to a database at short intervals, say every 5 seconds, could the number of reads generated cause problems in terms of performance or availability? If the database is Oracle are there any tricks that can be used to avoid a performance hit? If the queries are coming from an application is there a way to reduce any impact through software design?
Unless your query is very intensive or horribly written then it won't cause any noticeable issues running once every few seconds. That's not very often for queries that are generally measured in milliseconds.
You may still want to optimize it though, simply because there are better ways to do it. In Oracle and ADO.NET you can use an OracleDependency for the command that ran the query the first time and then subscribe to its OnChange event which will get called automatically whenever the underlying data would cause the query results to change.
It depends on the query. I assume the reason you want to execute it periodically is because the data being returned will changed frequently. If that's the case, then application level caching is obviously not an option.
Past that, is this query "big" in terms of the number of rows returned, tables joined, data aggregated / calculated? If so, it could be a problem if:
You are querying faster than it takes to execute the query. If you are calling it once a second, but it takes 2 seconds to run, that's going to become a problem.
If the query is touching a lot of data and you have a lot of other queries accessing the same tables, you could run into lock escalation issues.
As with most performance questions, the only real answer is to test. In this case test with realistic data in the DB and run this query concurrent with the other query load you expect on the system.
Along the lines of Samuel's suggestion, Oracle provides facilities in JDBC to do database change notification so that your application can subscribe to changes in the underlying data rather than re-running the query every few seconds. If the data is changing less frequently than you're running the query, this can be a major performance benefit.
Another option would be to use Oracle TimesTen as an in memory cache of the data on the middle tier machine(s). That will reduce the network round-trips and it will go through a very optimized retrieval path.
Finally, I'd take a look at using the query result cache to have Oracle cache the results.

Postgresql Application Insertion and Trigger Performance

I'm working on designing an application with a SQL backend (Postgresql) and I've got a some design questions. In short, the DB will serve to store network events as they occur on the fly, so insertion speed and performance is critical due 'real-time' actions depending on these events. The data is dumped into a speedy default format across a few tables, and I am currently using postgresql triggers to put this data into some other tables used for reporting.
On a typical event, data is inserted into two different tables each share the same primary key (an event ID). I then need to move and rearrange the data into some different tables that are used by a web-based reporting interface. My primary goal/concern is to keep the load off the initial insertion tables, so they can do their thing. Reporting is secondary, but it would still be nice for this to occur on the fly via triggers as opposed to a cron job where I have to query and manage events that have already been processed. Reporting should/will never touch the initial insertion tables. Performance wise..does this make sense or am I totally off?
Once the data is in the appropriate reporting tables, I won't need to hang on to the data in the insertion tables too long, so I'll keep those regularly pruned for insertion performance. In thinking about this scenario, which I'm sure is semi-common, I've come up with three options:
Use triggers to trigger on the initial row insert and populate the reporting tables. This was my original plan.
Use triggers to copy the insertion data to a temporary table (same format), and then trigger or cron to populate the reporting tables. This was just a thought, but I figure that a simple copy operation to a temp table will offload any of the query-ing of the triggers in the solution above.
Modify my initial output program to dump all the data to a single table (vs across two) and then trigger on that insert to populate the reporting tables. So where solution 1 is a multi-table to multi-table trigger situation, this would be a single-table source to multi-table trigger.
Am I over thinking this? I want to get this right. Any input is much appreciated!
You may experience have a slight increase in performance since there are more "things" to do (although they should not affect operations in any way). But using Triggers/other PL is a good way to reduce it to minimum subce they are executed faster than code that gets sent from your application to the DB-Server.
I would go with your first idea 1) since it seems to me the cleanest and most efficient way.
2) is the most performance hungry solution since cron will do more queries than the other solutions that use server-side functions. 3) would be possible but will resulst in an "uglier" database layout.
This is an old one but adding my answer here.
Reporting is secondary, but it would still be nice for this to occur on the fly via triggers as opposed to a cron job where I have to query and manage events that have already been processed. Reporting should/will never touch the initial insertion tables. Performance wise..does this make sense or am I totally off?
That may be way off, I'm afraid, but under a few cases it may not be. It depends on the effects of caching on the reports. Keep in mind that disk I/O and memory are your commodities, and that writers and readers rarely block eachother on PostgreSQL (unless they explicitly elevate locks--- a SELECT ... FOR UPDATE will block writers for example). Basically if your tables fit comfortably in RAM you are better off reporting from them since you are keeping disk I/O free for the WAL segment commit of your event entry. If they don't fit in RAM then you may have cache miss issues induced by reporting. Here materializing your views (i.e. making trigger-maintained tables) may cut down on these but they have a significant complexity cost. This, btw, if your option 1. So I would chalk this one up provisionally as premature optimization. Also keep in mind you may induce cache misses and lock contention on materializing the views this way, so you might induce performance problems regarding inserts this way.
Keep in mind if you can operate from RAM with the exception of WAL commits, you will have no performance problems.
For #2. If you mean temporary tables as CREATE TEMPORARY TABLE, that's asking for a mess including performance issues and reports not showing what you want them to show. Don't do it. If you do this, you might:
Force PostgreSQL to replan your trigger on every insert (or at least once per session). Ouch.
Add overhead creating/dropping tables
Possibilities of OID wraparound
etc.....
In short I think you are overthinking it. You can get very far by bumping RAM up on your Pg box and making sure you have enough cores to handle the appropriate number of inserting sessions plus the reporting one. If you plan your hardware right, none of this should be a problem.

When to commit changes?

Using Oracle 10g, accessed via Perl DBI, I have a table with a few tens of million rows being updated a few times per second while being read from much more frequently form another process.
Soon the update frequency will increase by an order of magnitude (maybe two).
Someone suggested that committing every N updates instead of after every update will help performance.
I have a few questions:
Will that be faster or slower or it depends (planning to benchmark both way as soon as can get a decent simulation of the new load)
Why will it help / hinder performance.
If "it depends ..." , on what ?
If it helps what's the best value of N ?
Why can't my local DBA have an helpful straight answer when I need one? (Actually I know the answer to that one) :-)
EDIT:
#codeslave : Thanks, btw losing
uncommited changes is not a problem, I
don't delete the original data used
for updating till I am sure everything
is fine , btw cleaning lady did
unplugs the server, TWICE :-)
Some googling showed it might help
because of issue related to rollback
segments, but I still don't know a
rule of thumb for N every few tens ?
hundreds? thousand ?
#diciu : Great info, I'll definitely
look into that.
A commit results in Oracle writing stuff to the disk - i.e. in the redo log file so that whatever the transaction being commited has done can be recoverable in the event of a power failure, etc.
Writing in file is slower than writing in memory so a commit will be slower if performed for many operations in a row rather then for a set of coalesced updates.
In Oracle 10g there's an asynchronous commit that makes it much faster but less reliable: https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-6158695.html
PS I know for sure that, in a scenario I've seen in a certain application, changing the number of coalesced updates from 5K to 50K makes it faster by an order of magnitude (10 times faster).
Reducing the frequency of commits will certainly speed things up, however as you are reading and writing to this table frequently there is the potential for locks. Only you can determine the likelihood of the same data being updated at the same time. If the chance of this is low, commit every 50 rows and monitor the situation. Trial and error I'm afraid :-)
As well as reducing the commit frequency, you should also consider performing bulk updates instead of individual ones.
If you "don't delete the original data used for updating till [you are] sure everything is fine", then why don't you remove all those incremental commits in between, and rollback if there's a problem? It sounds like you effectively have built a transaction systems on top of transactions.
#CodeSlave your your questions is answered by #stevechol , if i remove ALL the incremental commits there will be locks. I guess if nothing better comes along I'll follow his advice pick a random number , monitor the load and adjust accordingly. While applying #diciu twaks.
PS: the transaction on top of transaction is just accidental, I get the files used for updates by FTP and instead of deleting them immediately I set a cron job to deletes them a week later (if no one using the application has complained) that means if something goes wrong I have a week to catch the errors.
Faster/Slower?
It will probably be a little faster. However, you run a greater risk of running into deadlocks, losing uncommitted changes should something catastrophic happen (cleaning lady unplugs the server), FUD, Fire, Brimstone, etc.
Why would it help?
Obviously fewer commit operations, which in turn means fewer disk writes, etc.
DBA's and straight answers?
If it was easy, you won't need one.