SQL - Why transactions? - sql

I just realized I've had a headache for years. Well, metaphorically speaking. In reality I was looking at my database structure and somehow just realized I never use transactions. Doh.
There's a lot of data on the internet about transactions (begin transaction, rollback, commit, etc.), but surprisingly not much detail about exactly why they are vital, and just exactly how vital?
I understand the concept of handling if something goes wrong. This made sense when one is doing multiple updates, for example, in multiple tables in one go, but this is bad practice as far as I know and I don't do this. All of my queries just update one table. If a query errors, it cancels, transaction or no transaction. What else could go wrong or potentially corrupt a one table update, besides my pulling the plug out of my server?
In other words, my question is,
exactly how vital is it that i implement transactions on all of my tables - I am fully blasphemous for not having them, or does it really matter that much?
UPDATE
+1 to invisal, who pointed out that queries are automatically wrapped as transactions, which I did not know. Pointed out multiple good references on the subject of my question.

This made a lot of sense when one is doing multiple updates, for
example, in multiple tables in one go. But basically all of my queries
just update one table at a time. If a query errors, it cancels,
transaction or no transaction.
In your case, it does nothing. A single statement has its own transaction itself. For more information you can read the existed question and answers:
What does a transaction around a single statement do?
Transaction necessary for single update query?
Do i need transaction for joined query?

Most important property of the database is to keep your data, reliably.
Database reliability is assured by conforming to ACID principles (Atomicity, Consistency, Isolation, Durability). In the context of databases, a single logical operation on the data is called a transaction. Without transactions, such reliability would not be possible.
In addition to reliability, using transactions properly lets you improve performance of some data operations considerably. For example, you can start transaction, insert a lot of data (say 100k rows), and only then commit. Server does not have to actually write to disk until commit is called, effectively batching data in memory. This allows to improve performance a lot.

You should be aware that every updating action against your database is performed inside a transaction, even if only 1 table (SQL server automatically creates a transaction for it).
The reason for always doing transactions is to ensure ACID as others have mentioned. Here I'd like to elaborate on the isolation point. Without transaction isolation, you may have problems with: read uncommitted, unrepeatable read, phantom read,..

it depends if you are updating one table and one row, then the only advantage is going to be in the logging... but if you update multiple row in a table at one time... without transactions you could still run into somecurruption

Well it depends, SQL is most of the times used for supporting data for some host languages like c, c++, java, php, c# and others. Well I have not worked with much technologies.. but if you are using following combinations then here is my point of view:
SQL with C / C++ : Commit Required
SQL with Java : Not Required
SQL with C# : Not Required
SQL with PHP : Not Required
And it also depends which SQL you are using. It would also depend from different flavors of SQL like Oracle SQL, SQL Server, SQLite, MySQL etc...
When you are using Oracle SQL in its console, like Oracle 11g, Oracle 10g etc... COMMIT is required.
And as far as corruption of table and data is concerned. YES it happens, I had a very bad experience with it. So, if you pull out your wire or something while you are updating in your table, then you might end up with a massive disaster.
Well concluding, I will suggest you to do commit.

Related

Liquibase: Rollback for data modification/deletion

I have little experience with Liquibase so far, but I have read that you should always have a rollback strategy.
https://www.liquibase.org/get-started/best-practices
Always have a rollback plan
Write changesets in a way that they can
be rolled back. For example, use a relevant change clause instead of
using a custom tag. Include a clause whenever a
change doesn’t support an out-of-box rollback. (e.g., sql, insert,
etc). Learn more about rollbacks.
In case of needed data modification or when deleting (obsolete) data, I don't know how to handle this in a good way.
I found examples where database entries are deleted and the rollback definition contains two insert statements with fix data. But this is not a real world scenario. Here we possibly have tables with million of records.
Are there any best-practices regarding data manipulation / transformation / deletion, respectively the rollback strategy?
Should data manipulation / destructive operations generally be avoided at all costs?
Hint: I'm not talking about small applications, but enterprise solutions that are developed by many people in different teams and delivered to hundreds of customers. Only to give you a context.
There are various opinions about this but, in real life situations, I don't find rollback strategies very practical. You typically need roll-forward. With that said, liquibase was meant for structural (ddl) changes.
For data, row data in particular in the millions, I would orchestrate this outside of liquibase as a part of your release plans. I have seen large propriatery bank software handle this by renaming/copying table to another table and then making the change.
SO, if you have tableX, and you need to add a column. You can copy the table to tableX_old, modify tableX.
If the migration succeeds, delete tableX_old. If not, rollback using tableX_old to restore your data.

Synchronizing tables - does order of UPDATE INSERT DELETE matter?

I need to synchronize tables between 2 databases daily, the source is MSSQL 2008, the target is MSSQL 2005. If I use UPDATE, INSERT, and DELETE statements (i.e. UPDATE rows that changed, INSERT new rows, DELETE rows no longer present), will there be performance improvements if I perform the DELETE statement first? i.e. so that the UPDATE statement doesn't look at rows that don't need to be updated, because they will be deleted.
Here are some other things I need to consider. The tables have 1-3 million+ rows, and because of the amount of transactions and business requirements, the source DB needs to remain online, and the query needs to be as efficient as possible. The job will be run daily in a SQL server agent job on the target DB. On top of that, I am a DB rookie!
Thanks StackOverflow community, you are awesome!
I'd say, first you do delete, then update then insert, so you don't have to update rows which will be deleted anyway and you'll not update rows which are just inserted.
But actually, have you seen SQL Server merge syntax? It could save you a great amount of code.
update I have not checked performance of MERGE statement against INSERT/UPDATE/DELETE, here's related link given by Aaron Bertrand for more details.
Rule of Thumb: DELETE, then UPDATE, then INSERT.
Performance aside, my main concern is to avoid any potential Deadlocks when:
Updating something you will immediately Delete.
Inserting something you may immediately try to Update.
If you only modify what is necessary and use transactions correctly, then you may use any order.
P.S. Someone suggested using MERGE - I've tried it a few times and my preference is to never use it.
I think Roman's answer is what you were looking for in your current situation: DELETE, UPDATE, INSERT (or MERGE.)
Now there are other possible routes which can make things even faster, but with a rather different process:
1. Consider saving all orders in a file that you, once in a while, run against the target
Assuming both databases are exactly the same, for each SQL order that modifies the 2008 database, save that order in a .sql file which you later execute against the 2005 database. You have to consider locking the file while writing to it, and maybe have some kind of redundancy. However, this means you need no access to the 2008 database at all while doing the work on the 2005 database. In other words, no side effects to the 2008 database speed.
Pitfall: you may miss a statement and the destination will not be an exact equivalent...
2. Ongoing replication
I do not know about MSSQL enough to tell you of a good tool to do automatic replication (see here: http://technet.microsoft.com/en-us/library/ms151198.aspx), but I'd bet you can find a good tool. MySQL (http://dev.mysql.com/doc/refman/5.0/en/replication.html) and PostgreSQL (http://wiki.postgresql.org/wiki/Streaming_Replication) have such tools and those are all free.
This would be the solution I would choose. Depending on the tool you use, it can be really very well optimized meaning that the impact on the live system will be minimal and the 2005 duplicate will be up to date within seconds (depending on whether it is a long distance remote connection or not, the amount of work, the setup of each server, the Internet connections, etc.)
The pitfall is obviously that it adds an ongoing process on the database, but if you find an MSSQL tool that works like the streaming replication of PostgreSQL, it makes use of a copy of the journal which means it is dead fast (no heavy use of disk I/O.)
3. Cluster Database (like Cassandra)
This would involve a change of database which I'm totally sure you're not ready to do (especially because most of those systems do not offer SQL,) but I thought that it would be a good thing to talk about in your situation.
A system like Cassandra (http://cassandra.apache.org/) automatically replicate its data on many computers. It can actually be setup to replicate all the data 100% or X% of data per computer with redundancy in case of failure (a computer that breaks down). This alleviates the need for a specific copy on a separate computer because the performance can be increased simply by adding a few nodes to your system. (At less than $1,000 a computer, it is worth it! Frankly you could create a Peta Byte system for $50k or less and end up with something a lot faster than any SQL database...)
The main problem is that the use of those clusters is completely different than SQL. But that could be a solution for big businesses having large databases that need to be really fast and they do not want to invest in a mini-computer (think Cobol and $250k computers that manage 100 million rows in a few milli-seconds...)
With Cassandra you can run extremely heavy batch processes on back end computers that do not make a dent to the front end system!

SQL - Guaranteed execution of all statements

I want to run a series of SQL INSERT statements.
The problem is I want an all-or-nothing approach. Either they all execute or if one of them doesn't execute then no changes are made to the database.
The only solution i can think of is using a conditional loop but that would mean a lot of redundant code (determining changes made, dropping tables etc)
Is there a simpler solution?
I have searched extensively for a solution, but didnt find any similar questions, so apologies if it has been asked before
You need to use a Transaction, you can find an MSDN example here.
Fortunately, a good database is one that has the four ACID properties - Atomic, Consistent, Integrated and Durable.
The first property- Atomic - refers to the behavior of transactions where either the entire transaction is commited at a time or no changes take place at all.
Read Korth's book "Database System Concepts" for further reference.
If you are using an outstanding database like Oracle, MS SQL Server, MySQL, DB2, etc., all you have to do is study a little bit about how they handle transactions and place your DML statements within these transactions.
Find out about Oracle's transactions support here.
P.S.- Looks like you're working in banking domain. These people are hell-bent on these things.

Zero SQL deadlock by design - any coding patterns?

I am encountering very infrequent yet annoying SQL deadlocks on a .NET 2.0 webapp running on top of MS SQL Server 2005. In the past, we have been dealing with the SQL deadlocks in the very empirical way - basically tweaking the queries until it work.
Yet, I found this approach very unsatisfactory: time consuming and unreliable. I would highly prefer to follow deterministic query patterns that would ensure by design that no SQL deadlock will be encountered - ever.
For example, in C# multithreaded programming, a simple design rule such as the locks must be taken following their lexicographical order ensures that no deadlock will ever happen.
Are there any SQL coding patterns guaranteed to be deadlock-proof?
Writing deadlock-proof code is really hard. Even when you access the tables in the same order you may still get deadlocks [1]. I wrote a post on my blog that elaborates through some approaches that will help you avoid and resolve deadlock situations.
If you want to ensure two statements/transactions will never deadlock you may be able to achieve it by observing which locks each statement consumes using the sp_lock system stored procedure. To do this you have to either be very fast or use an open transaction with a holdlock hint.
Notes:
Any SELECT statement that needs more than one lock at once can deadlock against an intelligently designed transaction which grabs the locks in reverse order.
Zero deadlocks is basically an incredibly costly problem in the general case because you must know all the tables/obj that you're going to read and modify for every running transaction (this includes SELECTs). The general philosophy is called ordered strict two-phase locking (not to be confused with two-phase commit) (http://en.wikipedia.org/wiki/Two_phase_locking ; even 2PL does not guarantee no deadlocks)
Very few DBMS actually implement strict 2PL because of the massive performance hit such a thing causes (there are no free lunches) while all your transactions wait around for even simple SELECT statements to be executed.
Anyway, if this is something you're really interested in, take a look at SET ISOLATION LEVEL in SQL Server. You can tweak that as necessary. http://en.wikipedia.org/wiki/Isolation_level
For more info, see wikipedia on Serializability: http://en.wikipedia.org/wiki/Serializability
That said -- a great analogy is like source code revisions: check in early and often. Keep your transactions small (in # of SQL statements, # of rows modified) and quick (wall clock time helps avoid collisions with others). It may be nice and tidy to do a LOT of things in a single transaction -- and in general I agree with that philosophy -- but if you're experiencing a lot of deadlocks, you may break the trans up into smaller ones and then check their status in the application as you move along. TRAN 1 - OK Y/N? If Y, send TRAN 2 - OK Y/N? etc. etc
As an aside, in my many years of being a DBA and also a developer (of multiuser DB apps measuring thousands of concurrent users) I have never found deadlocks to be such a massive problem that I needed special cognizance of it (or to change isolation levels willy-nilly, etc).
There is no magic general purpose solution to this problem that work in practice. You can push concurrency to the application but this can be very complex especially if you need to coordinate with other programs running in separate memory spaces.
General answers to reduce deadlock opportunities:
Basic query optimization (proper index use) hotspot avoidanant design, hold transactions for shortest possible times...etc.
When possible set reasonable query timeouts so that if a deadlock should occur it is self-clearing after the timeout period expires.
Deadlocks in MSSQL are often due to its default read concurrency model so its very important not to depend on it - assume Oracle style MVCC in all designs. Use snapshot isolation or if possible the READ UNCOMMITED isolation level.
I believe the following useful read/write pattern is dead lock proof given some constraints:
Constraints:
One table
An index or PK is used for read/write so engine does not resort to table locks.
A batch of records can be read using a single SQL where clause.
Using SQL Server terminology.
Write Cycle:
All writes within a single "Read Committed" transaction.
The first update in the transaction is to a specific, always-present record
within each update group.
Multiple records may then be written in any order. (They are "protected"
by the write to the first record).
Read Cycle:
The default read committed transaction level
No transaction
Read records as a single select statement.
Benefits:
Secondary write cycles are blocked at the write of first record until the first write transaction completes entirely.
Reads are blocked/queued/executed atomically between the write commits.
Achieve transaction level consistency w/o resorting to "Serializable".
I need this to work too so please comment/correct!!
As you said, always access tables in the same order is a very good way to avoid deadlocks. Furthermore, shorten your transactions as much as possible.
Another cool trick is to combine 2 sql statements in one whenever you can. Single statements are always transactional. For example use "UPDATE ... SELECT" or "INSERT ... SELECT", use "##ERROR" and "##ROWCOUNT" instead of "SELECT COUNT" or "IF (EXISTS ...)"
Lastly, make sure that your calling code can handle deadlocks by reposting the query a configurable amount of times. Sometimes it just happens, it's normal behaviour and your application must be able to deal with it.
In addition to consistent sequence of lock acquisition - another path is explicit use of locking and isolation hints to reduce time/resources wasted unintentionally acquiring locks such as shared-intent during read.
Something that none has mentioned (surprisingly), is that where SQL server is concerned many locking problems can be eliminated with the right set of covering indexes for a DB's query workload. Why? Because it can greatly reduce the number of bookmark lookups into a table's clustered index (assuming it's not a heap), thus reducing contention and locking.
If you have enough design control over your app, restrict your updates / inserts to specific stored procedures and remove update / insert privileges from the database roles used by the app (only explicitly allow updates through those stored procedures).
Isolate your database connections to a specific class in your app (every connection must come from this class) and specify that "query only" connections set the isolation level to "dirty read" ... the equivalent to a (nolock) on every join.
That way you isolate the activities that can cause locks (to specific stored procedures) and take "simple reads" out of the "locking loop".
Quick answer is no, there is no guaranteed technique.
I don't see how you can make any application deadlock proof in general as a design principle if it has any non-trivial throughput. If you pre-emptively lock all the resources you could potentially need in a process in the same order even if you don't end up needing them, you risk the more costly issue where the second process is waiting to acquire the first lock it needs, and your availability is impacted. And as the number of resources in your system grows, even trivial processes have to lock them all in the same order to prevent deadlocks.
The best way to solve SQL deadlock problems, like most performance and availability problems is to look at the workload in the profiler and understand the behavior.
Not a direct answer to your question, but food for thought:
http://en.wikipedia.org/wiki/Dining_philosophers_problem
The "Dining philosophers problem" is an old thought experiment for examining the deadlock problem. Reading about it might help you find a solution to your particular circumstance.

Effectively transforming data from one SQL database to another in live environment

We have a bit of a messy database situation.
Our main back-office system is written in Visual Fox Pro with local data (yes, I know!)
In order to effectively work with the data in our websites, we have chosen to regularly export data to a SQL database. However the process that does this basically clears out the tables each time and does a re-insert.
This means we have two SQL databases - one that our FoxPro export process writes to, and another that our websites read from.
This question is concerned with the transform from one SQL database to the other (SqlFoxProData -> SqlWebData).
For a particular table (one of our main application tables), because various data transformations take places during this process, it's not a straightforward UPDATE, INSERT and DELETE statements using self-joins, but we're having to use cursors instead (I know!)
This has been working fine for many months but now we are starting to hit upon performance problems when an update is taking place (this can happen regularly during the day)
Basically when we are updating SqlWebData.ImportantTable from SqlFoxProData.ImportantTable, it's causing occasional connection timeouts/deadlocks/other problems on the live websites.
I've worked hard at optimising queries, caching etc etc, but it's come to a point where I'm looking for another strategy to update the data.
One idea that has come to mind is to have two copies of ImportantTable (A and B), some concept of which table is currently 'active', updating the non-active table, then switching the currenly actice table
i.e. websites read from ImportantTableA whilst we're updating ImportantTableB, then we switch websites to read from ImportantTableB.
Question is, is this feasible and a good idea? I have done something like it before but I'm not convinced it's necessarily good for optimisation/indexing etc.
Any suggestions welcome, I know this is a messy situation... and the long term goal would be to get our FoxPro application pointing to SQL.
(We're using SQL 2005 if it helps)
I should add that data consistency isn't particularly important in the instance, seeing as the data is always slightly out of date
There are a lot of ways to skin this cat.
I would attack the locking issues first. It is extremely rare that I would use CURSORS, and I think improving the performance and locking behavior there might resolve a lot of your issues.
I expect that I would solve it by using two separate staging tables. One for the FoxPro export in SQL and one transformed into the final format in SQL side-by-side. Then either swapping the final for production using sp_rename, or simply using 3 INSERT/UPDATE/DELETE transactions to apply all changes from the final table to production. Either way, there is going to be some locking there, but how big are we talking about?
You should be able to maintain one db for the website and just replicate to that table from the other sql db table.
This is assuming that you do not update any data from the website itself.
"For a particular table (one of our main application tables), because various data transformations take places during this process, it's not a straightforward UPDATE, INSERT and DELETE statements using self-joins, but we're having to use cursors instead (I know!)"
I cannot think of a case where I would ever need to perform an insert, update or delete using a cursor. If you can write the select for the cursor, you can convert it into an insert, update or delete. You can join to other tables in these statements and use the case stament for conditional processing. Taking the time to do this in a set -based fashion may solve your problem.
One thing you may consider if you have lots of data to move. We occassionally create a view to the data we want and then have two tables - one active and one that data will be loaded into. When the data is finsihed loading, as part of your process run a simple command to switch the table the view uses to the one you just finshed loading to. That way the users are only down for a couple of seconds at most. You won't create locking issues where they are trying to access data as you are loading.
You might also look at using SSIS to move the data.
Do you have the option of making the updates more atomic, rather than the stated 'clear out and re-insert'? I think Visual Fox Pro supports triggers, right? For your key tables, can you add a trigger to the update/insert/delete to capture the ID of records that change, then move (or delete) just those records?
Or how about writing all changes to an offline database, and letting SQL Server replication take care of the sync?
[sorry, this would have been a comment, if I had enough reputation!]
Based on your response to Ernie above, you asked how you replicate databases. Here is Microsoft's how-to about replication in SQL2005.
However, if you're asking about replication and how to do it, it indicates to me that you are a little light in experience for SQL server. That being said, it's fairly easy to muck things up and while I'm all for learning by experience, if this is mission critical data, you might be better off hiring a DBA or at the very least, testing the #$##$% out of this before you actually implement it.