SQL Query with Table Locking - sql

I am having an argument with a developer colleague on the team.
Problem: SQL query X runs for 1 second on the test system, but for an unknown amount of time on live system (150 users can run queries at the same time).
The query causes locks on 8 tables, of which 7 are useless.
His idea is to put a WITH (NOLOCK) on the 7 tables so there aren't any more locks.
My argument:
My suggestion is, with a nolock there is the chance that user 1 runs the select query which needs 10 seconds to complete because the server performance is low at the moment and user 2 changes a value in one of the 7 tables, e.g. a customer.
Then the query would be completely wrong or maybe the expected dataset can't be filled and it crashes and throws an error. So use a Rowlock.
His argument:
We don't need a rowlock, the chances of getting such a scenario are too low. We don't need to be perfect. Do what is asked of you and don't think.
What can I say to win against such people who don't count on perfectionism?

I believe, based on what you have said that you are correct in your reasoning.
If there is ANY chance that something could go wrong, no matter how small a chance in the operation that causes the database to lose integrity it MUST be fixed.
Integrity is one of the basic premises of database design your co worker sounds like he is not being rigorous in his work.
If you are trying to construct a technical argument to "beat" your co worker, note that it may not give you the desired outcome you imagine.
If your co worker is not amenable to what you are saying AND if you are REALLY sure that you are correct in your reasoning, then I would inform your team leader why you think this is important and show him your solution. If he agrees with your co worker because he believes that database integrity is not important, then perhaps you should look at working somewhere else.
Don't get me wrong, I realise that in the real world software cannot be 'perfect' otherwise it would never be released. But something as fundamental as data input checking should not be skipped over, and it isn't difficult to do. It's basically the same as saying, "well let's not bother to validate user input". This is something you learn how to do this in a first year Computer Science class!
We have enough crappy software on this planet and this is the age where we are capable of AMAZING THINGS. Sloppiness in Software Engineering doesn't have a place anymore and I hope that you do not let your co worker lower your standards. Keep your standards high and you will learn more than he does and eventually do better in the long run.

Locking hints in SQL Server 2000 (SS2k) were useful because SS2k was greedy about locking on UPDATE statements and would default to TABLELOCK and narrow it as it progressed. If you knew your UPDATE statement's pattern you could use locking hints to increase performance and SS2k would escalate the lock if needed.
NOLOCK was introduced for dirty reads of locked data. If a table is frequently updated and queries that don't rely on the validity of the underlying data are being blocked, you could use NOLOCK to read the data in whatever state it was in. If you need to read records to generate a search results page you might choose to specify the NOLOCK hint to ensure your query isn't blocked by any update statements.
I believe lock escalation was reworked in SQL Server 2005 and locking hints are no longer respected.

If you are using SQL Server, which it sounds like you are then you instead of worrying about using NOLOCK for readers blocking writers (a common issue on high use SQL Server DBs doing lots of reads and writes) you should consider using SQL Server Row Versioning transaction isolation. This works with SQL Server 2005 and above.
This makes SQL Server work much more like Oracle does and eliminates the issues caused by readers blocking writers. Please read into the disadvantages too before you make the decision to use it.

ACID: - atomicity, consistency, isolation and durability. These are the basic tenets of databases that you ignore at your peril.
What your colleague is stating is that it's okay to ignore isolation, the property that you don't get to see half-done transactions. That's okay in some situations.
For example, we have a set of reports that are not used for critical business purposes but merely to give an indication as to the general health of the system. For that, 95% accuracy is good enough and we don't want the reporting to get in the way of the real work.
But, for a statement from a bank to one of it's customers, 100% is the absolute minimum accuracy. In situations where you will rely on the data, isolation must be adhered to.
You need to decide which bucket your particular system falls into. I'd be willing to bet good money that the number of situations in which you can ignore any of the ACID principles is minimal.

From my experience, Murphy's law is true: If anything can go wrong, it will.
We don't need to be perfect is not an argument. You, and your colleage, have certainly requirements to conform with.

"Do what is want from you and don't
think."
Remember that you're always the person in charge of your own code, if something goes wrong, you can't say "He told me that to do it bla bla bla" ...
Your collegue is wrong, you always have to think, they pay you for use your brain, you're not a Teacher of aerobics (only a joke, sorry for all those Teachesr of aerobics that love programming).

Related

SQL Server NOLOCK vs READPAST in a Queue system

I am developing a brand new product at work that is supposed to go live soon. It's predominantly an ETL product that will be dealing with enormous volumes of data in a queue type operation. So records come in, we do work on them, and then they are picked up and sent back out. So everything that's being done in this system is done over and over again until all the records have completed processing.
I'm being told by my boss (who is open and reasonable, this isn't a demand) that I should add NOLOCK hints to all the queries.
I'm torn on this because I've always read that it's bad practice. I've also read about the READPAST hint and I'm thinking that might be a good alternative, but I want to get others' opinions as no one in my organization has used it before.
My understanding is that READPAST would pick up any records that aren't locked, and would just ignore locked records, which in some systems I could see being a problem. In this system where the job will run again a few minutes later and will likely pick up the previously locked record, I don't see that it would matter. It's not super time-sensitive, so if a record takes a few minutes longer because it's locked, that's acceptable.
Wondering what others thoughts on this are? I realize this is not a replacement for proper indexing and I'm working on that as well.

are performance/code-maintainability concerns surrounding SELECT * on MS SQL still relevant today, with modern ORMs?

summary: I've seen a lot of advice against using SELECT * in MS SQL, due to both performance and maintainability concerns. however, many of these posts are very old - 5 to 10 years! it seems, from many of these posts, that the performance concerns may have actually been quite small, even in their time, and as to the maintainability concerns ("oh no, what if someone changes the columns, and you were getting data by indexing an array! your SELECT * would get you in trouble!"), modern coding practices and ORMs (such as Dapper) seem - at least in my experience - to eliminate such concerns.
and so: are there concerns with SELECT * that are still relevant today?
greater context: I've started working at a place with a lot of old MS code (ASP scripts, and the like), and I've been helping to modernize a lot of it, however: most of my SQL experience is actually from MySQL and PHP frameworks and ORMs - this is my first time working with MS SQL - and I know there are subtle differences between the two. ALSO: my co-workers are a little older than I am, and have some concerns that - to me - seem "older". ("nullable fields are slow! avoid them!") but again: in this particular field, they definitely have more experience than I do.
for this reason, I'd also like to ask: whether SELECT * with modern ORMs is or isn't safe and sane to do today, are there recent online resources which indicate such?
thanks! :)
I will not touch maintainability in this answer, only performance part.
Performance in this context has little to do with ORMs.
It doesn't matter to the server how the query that it is running was generated, whether it was written by hand or generated by the ORM.
It is still a bad idea to select columns that you don't need.
It doesn't really matter from the performance point of view whether the query looks like:
SELECT * FROM Table
or all columns are listed there explicitly, like:
SELECT Col1, Col2, Col3 FROM Table
If you need just Col1, then make sure that you select only Col1. Whether it is achieved by writing the query by hand or by fine-tuning your ORM, it doesn't matter.
Why selecting unnecessary columns is a bad idea:
extra bytes to read from disk
extra bytes to transfer over the network
extra bytes to parse on the client
But, the most important reason is that optimiser may not be able to generate a good plan. For example, if there is a covering index that includes all requested columns, the server will usually read just this index, but if you request more columns, it would do extra lookups or use some other index, or just scan the whole table. The final impact can vary from negligible to seconds vs hours of run time. The larger and more complicated the database, the more likely you see the noticeable difference.
There is a detailed article on this topic Myth: Select * is bad on the Use the index, Luke web-site.
Now that we have established a common understanding of why selecting
everything is bad for performance, you may ask why it is listed as a
myth? It's because many people think the star is the bad thing.
Further they believe they are not committing this crime because their
ORM lists all columns by name anyway. In fact, the crime is to select
all columns without thinking about it—and most ORMs readily commit
this crime on behalf of their users.
I'll add answers to your comments here.
I have no idea how to approach an ORM that doesn't give me an option which fields to select. I personally would try not to use it. In general, ORM adds a layer of abstraction that leaks badly. https://en.wikipedia.org/wiki/Leaky_abstraction
It means that you still need to know how to write SQL code and how DBMS runs this code, but also need to know how ORM works and generates this code. If you choose not to know what's going on behind ORM you'll have unexplainable performance problems when your system grows beyond trivial.
You said that at your previous job you used ORM for a large system without problems. It worked for you. Good. I have a feeling, though, that your database was not really large (did you have billions of rows?) and the nature of the system allowed to hide performance questions behind the cache (it is not always possible). The system may never grow beyond the hardware capacity. If your data fits in cache, usually it will be reasonably fast in any case. It begins to matter only when you cross the certain threshold. After which suddenly everything becomes slow and it is hard to fix it.
It is common for a business/project manager to ignore the possible future problems which may never happen. Business always has more pressing urgent issues to deal with. If business/system grows enough when performance becomes a problem, it will either have accumulated enough resources to refactor the whole system, or it will continue working with increasing inefficiency, or if the system happens to be really critical to the business, just fail and give a chance to another company to overtake it.
Answering your question "whether to use ORMs in applications where performance is a large concern". Of course you can use ORM. But, you may find it more difficult than not using it. With ORM and performance in mind you have to inspect manually the SQL code that ORM generates and make sure that it is a good code from performance point of view. So, you still need to know SQL and specific DBMS that you use very well and you need to know your ORM very well to make sure it generates the code that you want. Why not just write the code that you want directly?
You may think that this situation with ORM vs raw SQL somewhat resembles a highly optimising C++ compiler vs writing your code in assembler manually. Well, it is not. Modern C++ compiler will indeed in most cases generate code that is better than what you can write manually in assembler. But, compiler knows processor very well and the nature of the optimisation task is much simpler than what you have in the database. ORM has no idea about the volume of your data, it knows nothing about your data distribution.
The simple classic example of top-n-per-group can be done in two ways and the best method depends on the data distribution that only the developer knows. If performance is important, even when you write SQL code by hand you have to know how DBMS works and interprets this SQL code and lay out your code in such a way that DBMS accesses the data in an optimal way. SQL itself is a high-level abstraction that may require fine-tuning to get the best performance (for example, there are dozens of query hints in SQL Server). DBMS has some statistics and its optimiser tries to use it, but it is often not enough.
And now on top of this you add another layer of ORM abstraction.
Having said all this, "performance" is a vague term. All these concerns become important after a certain threshold. Since modern hardware is pretty good, this threshold had been pushed rather far to allow a lot of projects to ignore all these concerns.
Example. An optimal query over a table with million rows returns in 10 milliseconds. A non-optimal query returns in 1 second. 100 times slower. Would the end-user notice? Maybe, but likely not critical. Grow the table to billion rows or instead of one user have 1000 concurrent users. 1 second vs 100 seconds. The end-user would definitely notice, even though the ratio (100 times slower) is the same. In practice the ratio would increase as data grows, because various caches would become less and less useful.
From a SQL-Server-Performance-Point-of-view, you should NEVER EVER use select *, because this means to sqlserver to read the complete row from disk or ram. Even if you need all fields, i would suggest to not do select *, because you do not know, who is appending any data to the table that your application does NOT need. For Details see answer of #sandip-patel
From a DBA-perspective: If you give exactly those columnnames you need the dbadmin can better analyse and optimize his databases.
From a ORM-Point-Of-View with changing column-names i would suggest to NOT use select *. You WANT to know, if the table changes. How do you want to give a guarantee for your application to run and give correct results if you do not get errors if the underlying tables change??
Personal Opinion: I really do not work with ORM in Applications needing to perform well...
This question is out some time now, and noone seems to be able to find, what Ben is looking for...
I think this is, because the answer is "it depends".
There just NOT IS THE ONE answer to this.
Examples
As i pointed out before, if a database is not yours, and it may be altered often, you cannot guarantee performance, because with select * the amount of data per row may explode
If you write an application using ITS OWN database, noone alters your DB (hopefully) and you need your columns, so whats wrong with select *
If you build some kind of lazy loading with "main properties" beeing loaded instantly and others beeing loaded later (of same entity), you cannot go with select * because you get all
If you use select * other developers will every time think about "did he think about select *" as they will try to optimize. So you should add enough comments...
If you build 3-Tier-Application building large caches in the middle-Tier and performance is a theme beeing done by cache, you may use select *
Expanding 3Tier: If you have many many concurrent users and/or really big data, you should consider every single byte, because you have to scale up your middle-Tier with every byte beeing wasted (as someone pointed out in the comments before)
If you build a small app for 3 users and some thousands of records, the budget may not give time to optimize speed/db-layout/something
Speak to your dba... HE will advice you WHICH statement has to be changed/optimized/stripped down/...
I could go on. There just is not ONE answer. It just depends on to many factors.
It is generally a better idea to select the column names explicitly. Should a table receive an extra column it would be loaded with a select * call, where the extra column is not needed.
This can have several implications:
More network traffic
More I/O (got to read more data from disk)
Possibly even more I/O (a covering index cannot be used - a table scan is performed to get the data)
Possibly even more CPU (a covering index cannot be used so data needs sorting)
EXCEPTION. The only place where Select * is OK, is in the sub-query after an Exists or Not Exists predicate clause, as in:
Select colA, colB
From table1 t1
Where Exists (Select * From Table2 Where column = t1.colA)
More Details -1
More Details -2
More Details -3
Maintainability point.
If you do a "Select * from Table"
Then I alter the Table and add a column.
Your old code will likely crash as it now has an additional column in it.
This creates a night mare for future revisions because you have to identify all the locations for the select *.
The speed differences is so minimal I would not be concerned about it. There is a speed difference in using Varchar vs Char, Char is faster. But the speed difference is so minimal it is just about not worth talking about.
Select *'s biggest issue is with changes (additions) to the table structure.
Maintainability nightmare. Sign of a Junior programmer, and poor project code. That being said I still use select * but intend to remove it before I go to production with my code.

How far can you really go with "eventual" consistency and no transactions (aka SimpleDB)?

I really want to use SimpleDB, but I worry that without real locking and transactions the entire system is fatally flawed. I understand that for high-read/low-write apps it makes sense, since eventually the system becomes consistent, but what about that time in between? Seems like the right query in an inconsistent db would perpetuate havoc throughout the entire database in a way that's very hard to track down. Hopefully I'm just being a worry wart...
This is the pretty classic battle between consistency and scalability and - to some extent - availability. Some data doesn't always need to be that consistent. For instance, look at digg.com and the number of diggs against a story. There's a good chance that value is duplicated in the "digg" record rather than forcing the DB to do a join against the "user_digg" table. Does it matter if that number isn't perfectly accurate? Probably not. Then using something like SimpleDB might be a good fit. However if you are writing a banking system, you should probably value consistency above all else. :)
Unless you know from day 1 that you have to deal with massive scale, I would stick to simple more conventional systems like RDBMS. If you are working somewhere with a reasonable business model, you will hopefully see a big spike in revenue if there's a big spike in traffic. Then you can use that money to help solving the scaling problems. Scaling is hard and scaling is hard to predict. Most of the scaling problems that hurt you will be ones that you never expect.
I would much rather get a site off the ground and spend a few weeks fixing scale issues when traffic picks up then spend so much time worrying about scale that we never make it to production because we run out of money. :)
Assuming you're talking about this SimpleDB, you're not being a worrywart; there are real reasons not to use it as a real world DBMS.
The properties that you get from transaction support in a DBMS can be abbreviated by the acronym "A.C.I.D.": Atomicity, Consistency, Isolation, and Durability. The A and D have mostly to do with system crashes, and the C and I have to do with regular operation. They're all things people totally take for granted when working with commercial databases, so if you work with a database that doesn't have one or more of them, you might be in for any number of nasty surprises.
Atomicity: Any transaction will either complete fully or not at all (i.e. it will either commit or abort cleanly). This applies to single statements (like "UPDATE table ...") as well as longer, more complicated transactions. If you don't have this, then anything that goes wrong (like, the disk getting full, the computer crashing, etc.) might leave something half-done. In other words, you can't ever rely on the DBMS to really do the things you tell it to, because any number of real-world problems can get in the way, and even a simple UPDATE statement might get partially completed.
Consistency: Any rules you've set up about the database will always be enforced. Like, if you have a rule that says A always equals B, then nothing anybody does to the database system can break that rule - it'll fail any operation that tries. This isn't quite as important if all your code is perfect ... but really, when is that ever the case? Plus, if you're missing this safety net, things get really yucky when you lose ...
Isolation: Any actions taken on the database will execute as if they happened serially (one at a time), even if in reality they're happening concurrently (interleaved with each other). If more than one user is going to hit this database at the same time, and you don't have this, then things you can't even dream up will go wrong; even atomic statements can interact with each other in unforeseen ways and screw things up.
Durability: If you lose power or the software crashes, what happens to database transactions that were in progress? If you have durability, the answer is "nothing - they're all safe". Databases do this by using something called "Undo / Redo Logging", where every little thing you do to the database is first logged (typically on a separate disk for safety) in a way such that you can reconstruct the current state after a failure. Without that, the other properties above are sort of useless, because you can never be 100% sure that things will stay consistent after a crash.
Do any of these things matter to you? The answer has everything to do with the types of transactions you're doing, and what guarantees you want in a failure situation. There may well be cases (like a read-only database) where you don't need these, but as soon as you start doing anything non-trivial, and something bad happens, you'll wish you had 'em. Maybe it's OK for you to just revert to a backup anytime something unexpected happens, but my guess is that it isn't.
Also note that dropping all of these protections doesn't make it a given that your database will perform better; in fact, it's probably the opposite. That's because real-world DBMS software also has tons of code to optimize query performance. So, if you write a query that joins 6 tables on SimpleDB, don't assume that it'll figure out the optimal way to run that query - you might end up waiting hours for it to complete, when a commercial DBMS could use an indexed hash join and get it in .5 seconds. There are a zillion little tricks that you can do to optimize query performance, and believe me, you'll really miss them when they're gone.
None of this is meant as a knock on SimpleDB; take it from the author of the software: "Although it is a great teaching tool, I can't imagine that anyone would want to use it for anything else."

Zero SQL deadlock by design - any coding patterns?

I am encountering very infrequent yet annoying SQL deadlocks on a .NET 2.0 webapp running on top of MS SQL Server 2005. In the past, we have been dealing with the SQL deadlocks in the very empirical way - basically tweaking the queries until it work.
Yet, I found this approach very unsatisfactory: time consuming and unreliable. I would highly prefer to follow deterministic query patterns that would ensure by design that no SQL deadlock will be encountered - ever.
For example, in C# multithreaded programming, a simple design rule such as the locks must be taken following their lexicographical order ensures that no deadlock will ever happen.
Are there any SQL coding patterns guaranteed to be deadlock-proof?
Writing deadlock-proof code is really hard. Even when you access the tables in the same order you may still get deadlocks [1]. I wrote a post on my blog that elaborates through some approaches that will help you avoid and resolve deadlock situations.
If you want to ensure two statements/transactions will never deadlock you may be able to achieve it by observing which locks each statement consumes using the sp_lock system stored procedure. To do this you have to either be very fast or use an open transaction with a holdlock hint.
Notes:
Any SELECT statement that needs more than one lock at once can deadlock against an intelligently designed transaction which grabs the locks in reverse order.
Zero deadlocks is basically an incredibly costly problem in the general case because you must know all the tables/obj that you're going to read and modify for every running transaction (this includes SELECTs). The general philosophy is called ordered strict two-phase locking (not to be confused with two-phase commit) (http://en.wikipedia.org/wiki/Two_phase_locking ; even 2PL does not guarantee no deadlocks)
Very few DBMS actually implement strict 2PL because of the massive performance hit such a thing causes (there are no free lunches) while all your transactions wait around for even simple SELECT statements to be executed.
Anyway, if this is something you're really interested in, take a look at SET ISOLATION LEVEL in SQL Server. You can tweak that as necessary. http://en.wikipedia.org/wiki/Isolation_level
For more info, see wikipedia on Serializability: http://en.wikipedia.org/wiki/Serializability
That said -- a great analogy is like source code revisions: check in early and often. Keep your transactions small (in # of SQL statements, # of rows modified) and quick (wall clock time helps avoid collisions with others). It may be nice and tidy to do a LOT of things in a single transaction -- and in general I agree with that philosophy -- but if you're experiencing a lot of deadlocks, you may break the trans up into smaller ones and then check their status in the application as you move along. TRAN 1 - OK Y/N? If Y, send TRAN 2 - OK Y/N? etc. etc
As an aside, in my many years of being a DBA and also a developer (of multiuser DB apps measuring thousands of concurrent users) I have never found deadlocks to be such a massive problem that I needed special cognizance of it (or to change isolation levels willy-nilly, etc).
There is no magic general purpose solution to this problem that work in practice. You can push concurrency to the application but this can be very complex especially if you need to coordinate with other programs running in separate memory spaces.
General answers to reduce deadlock opportunities:
Basic query optimization (proper index use) hotspot avoidanant design, hold transactions for shortest possible times...etc.
When possible set reasonable query timeouts so that if a deadlock should occur it is self-clearing after the timeout period expires.
Deadlocks in MSSQL are often due to its default read concurrency model so its very important not to depend on it - assume Oracle style MVCC in all designs. Use snapshot isolation or if possible the READ UNCOMMITED isolation level.
I believe the following useful read/write pattern is dead lock proof given some constraints:
Constraints:
One table
An index or PK is used for read/write so engine does not resort to table locks.
A batch of records can be read using a single SQL where clause.
Using SQL Server terminology.
Write Cycle:
All writes within a single "Read Committed" transaction.
The first update in the transaction is to a specific, always-present record
within each update group.
Multiple records may then be written in any order. (They are "protected"
by the write to the first record).
Read Cycle:
The default read committed transaction level
No transaction
Read records as a single select statement.
Benefits:
Secondary write cycles are blocked at the write of first record until the first write transaction completes entirely.
Reads are blocked/queued/executed atomically between the write commits.
Achieve transaction level consistency w/o resorting to "Serializable".
I need this to work too so please comment/correct!!
As you said, always access tables in the same order is a very good way to avoid deadlocks. Furthermore, shorten your transactions as much as possible.
Another cool trick is to combine 2 sql statements in one whenever you can. Single statements are always transactional. For example use "UPDATE ... SELECT" or "INSERT ... SELECT", use "##ERROR" and "##ROWCOUNT" instead of "SELECT COUNT" or "IF (EXISTS ...)"
Lastly, make sure that your calling code can handle deadlocks by reposting the query a configurable amount of times. Sometimes it just happens, it's normal behaviour and your application must be able to deal with it.
In addition to consistent sequence of lock acquisition - another path is explicit use of locking and isolation hints to reduce time/resources wasted unintentionally acquiring locks such as shared-intent during read.
Something that none has mentioned (surprisingly), is that where SQL server is concerned many locking problems can be eliminated with the right set of covering indexes for a DB's query workload. Why? Because it can greatly reduce the number of bookmark lookups into a table's clustered index (assuming it's not a heap), thus reducing contention and locking.
If you have enough design control over your app, restrict your updates / inserts to specific stored procedures and remove update / insert privileges from the database roles used by the app (only explicitly allow updates through those stored procedures).
Isolate your database connections to a specific class in your app (every connection must come from this class) and specify that "query only" connections set the isolation level to "dirty read" ... the equivalent to a (nolock) on every join.
That way you isolate the activities that can cause locks (to specific stored procedures) and take "simple reads" out of the "locking loop".
Quick answer is no, there is no guaranteed technique.
I don't see how you can make any application deadlock proof in general as a design principle if it has any non-trivial throughput. If you pre-emptively lock all the resources you could potentially need in a process in the same order even if you don't end up needing them, you risk the more costly issue where the second process is waiting to acquire the first lock it needs, and your availability is impacted. And as the number of resources in your system grows, even trivial processes have to lock them all in the same order to prevent deadlocks.
The best way to solve SQL deadlock problems, like most performance and availability problems is to look at the workload in the profiler and understand the behavior.
Not a direct answer to your question, but food for thought:
http://en.wikipedia.org/wiki/Dining_philosophers_problem
The "Dining philosophers problem" is an old thought experiment for examining the deadlock problem. Reading about it might help you find a solution to your particular circumstance.

SQL With A Safety Net

My firm have a talented and smart operations staff who are working very hard. I'd like to give them a SQL-execution tool that helps them avoid common, easily-detected SQL mistakes that are easy to make when they are in a hurry. Can anyone suggest such a tool? Details follow.
Part of the operations team remit is writing very complex ad-hoc SQL queries. Not surprisingly, operators sometimes make mistakes in the queries they write because they are so busy.
Luckily, their queries are all SELECTs not data-changing SQL, and they are running on a copy of the database anyway. Still, we'd like to prevent errors in the SQL they run. For instance, sometimes the mistakes lead to long-running queries that slow down the duplicate system they're using and inconvenience others until we find the culprit query and kill it. Worse, occasionally the mistakes lead to apparently-correct answers that we don't catch until much later, with consequent embarrassment.
Our developers also make mistakes in complex code that they write, but they have Eclipse and various plugins (such as FindBugs) that catch errors as they type. I'd like to give operators something similar - ideally it would see
SELECT U.NAME, C.NAME FROM USER U, COMPANY C WHERE U.NAME = 'ibell';
and before you executed, it would say "Hey, did you realise that's a Cartesian product? Are you sure you want to do that?" It doesn't have to be very smart - finding obviously missing join conditions and similar evident errors would be fine.
It looks like TOAD should do this but I can't seem to find anything about such a feature. Are there other tools like TOAD that can provide this kind of semi-intelligent error correction?
Update: I forgot to mention that we're using MySQL.
If your people are using the mysql(1) program to run queries, you can use the safe-updates option (aka i-am-a-dummy) to get you part of what you need. Its name is somewhat misleading; it not only prevents UPDATE and DELETE without a WHERE (which you're not worried about), but also adds an implicit LIMIT 1000 to SELECT statements, and aborts SELECTs that have joins and are estimated to consider over 1,000,000 tuples --- perfect for discouraging Cartesian joins.
..."writing very complex ad-hoc SQL queries.... they are so busy"
Danger Will Robinson!
Automate Automate Automate.
Ideally, the ops team should not be put into a position where they have to write queries on the fly in a high stress situation – it’s a recipe for disaster! Better for them to build up a library of pre-written scripts that have undergone the appropriate testing to make sure it a) does what you want b) provides an audit trail c) has a possible ‘undo’ type function.
Failing that, giving them a user ID that only has SELECT premissions might help :-)
You might find SQL Prompt from redgate useful. I'm not sure what database engine you're using, as it's only for MSSQL Server
I'm not expecting anything like this to exist. The tool would have to first implement everything that the SQL parser in your database implements, and then it would have to do a data model analysis to predict "bad" queries.
Your best bet might be to write a plugin for a text editor that did some basic checking for suspicious patterns and highlighted them differently than the standard .sql mode. But even that would be quite difficult.
I would be happy with a tool that set off alarm bells whenever I typed in an update statement without a where clause. And perhaps administered a mild electric shock, since it's usually about 1 in the morning after a long day when mistakes like that happen.
It would be pretty easy to build this by setting up a sample database with a extremely small amount of dummy data, which would receive the query first. A couple of things will happen:
You might get a SQL syntax error, which would not load the database much since it's a small database.
You might get back a response which could clearly be shown to contain every row in one or more tables, which is probably not what they want.
Things which pass the above conditions are likely to be okay, so you can run them against the copy of the production database.
Assuming your schema doesn't change much and is not particularly weird, writing the above is likely the quickest solution to your problem.
I'd start with some coding standards - for instance never use the type of join in your example - it often results in bad results (especially in SQL Server if you try to do an outer join that way, you will get bad results). require them to do explicit joins.
If you have complex relationships, you might consider putting them in views and then writing the adhoc queries from the views. Then at least they will never make the mistake of getting the joins wrong.
Can't you just limit the amount of time a query can run for? I'm not sure about MySQL, but for SQL Server, even just the default query analyzer can restrict how long queries will run before they time out. Couple that with limited rights so they can only run SELECT queries, and you should be pretty much covered.