Would you send this simple SQL back for rework? - sql

We have a web application in which the users perform ad-hoc queries based on parameters that have been entered. I might also mention that response time is of high importance to the users.
The web page dynamically construct a SQL to execute based on the parameters entered. For example, if the user enters "1" for "Business Unit" we construct a SQL like this:
SELECT * FROM FACT WHERE
BUSINESS_UNIT = '1'
--AND other criteria based on the input params
I found that where the user does not specify a BUSINESS_UNIT the following query is constructed
SELECT * FROM FACT WHERE
BUSINESS_UNIT LIKE '%'
--AND other criteria based on the input params
IMHO, this is unnecessarily (if not grossly) inefficient and warrants sending the code bad for modification but since I have a much higher rate of sending code back for rework than others, I believe that I may be earning a reputation as being "too picky."
If this is an inappropriate question because it is not a direct coding Q, let me know and I will delete it immediately. I am very confused whether subjective questions like this are allowed or not! I'll be watching your replies.
ty
Update:
I am using an Oracle database.
My impression is that Oracle does not optimize "LIKE '%'" by removing the condition and that leaving it in is less efficient. Could someone confirm?

The two queries are completely different (from the point of view of a result set)
SELECT * FROM FACT;
and
SELECT * FROM FACT WHERE
BUSINESS_UNIT LIKE '%';
The first will return all rows, the second, if there are NULL values those rows will not be returned because anything compared to NULL is NULL and therefore does not satisfy the predicate. This is how it would work in Oracle.

Although that looks grossly inefficient, I just tested it out in SQL Server, and the query optimizer was smart enough to filter it out.
In other words,
SELECT * FROM FACT WHERE
BUSINESS_UNIT LIKE '%'
and
SELECT * FROM FACT
generated the exact same query plan. So there shouldn't be a performance difference (depending on your DB engine I guess), although it does look kind of sloppy.
That may affect your decision to send it back or not. Personally I probably would, but if you're under a cloud already then it's something you can probably relax on, at least in terms of performance.

That query, with the BUSINESS_UNIT LIKE '%', doesn't look quite efficient -- I suppose it'll force a full scan of the table...
So, yes, I would try to have that query reworked, to deal with that kind of situation in a proper way -- or, at least, I would report that problem through our bug-tracker (Not sure of the priority I would assign, but, still, the problem would be logged somewhere, and someone will correct it someday, when there is no higher-priority problem to deal with).

SELECT * is a red flag. Specify a column list for the query.

I would send it back and I like the question.
All code review is to some extent subjective. The criteria should be based on a number of factors.
Does it work.
Does it meet reasonable expectations of performance, maintainability, usability, and scalability
While I have not tested this particular construct -- I suspect (as do you) this code will do horrible things to your SQL server. Thus performance and scalability are called into question.

The writer wanted a dummy statement so that he/she could append "and" to all the subsequent statements. changing it to a better "no-op" statement like 1==1 would help. Or they could do slightly more work and insert the "where" intelligently.

I would question why bind variables are not being used (unless there is a good reason in particular cases):
SELECT * FROM FACT WHERE
BUSINESS_UNIT = '1'
Why is it not:
SELECT * FROM FACT WHERE
BUSINESS_UNIT = :bu

Related

Performance in these two SQL scenarios

I had assumed the first call below was more efficient, because it only makes this check once to see if #myInputParameter is less than 5000.
If the check fails, I avoid a query altogether. However, I've seen other people code like the second example, saying it's just as efficient, if not more.
Can anyone tell me which is quicker? It seems like the second one would be much slower, especially if the call is combing-through a large data set.
First call:
IF (#myInputParameter < 5000)
BEGIN
SELECT
#myCount = COUNT(1)
FROM myTable
WHERE someColumn=#someInputParameter
AND someOtherColumn='Hello'
--and so on
END
Second call:
SELECT
#myCount = COUNT(1)
FROM myTable
WHERE someColumn=#someInputParameter
AND someOtherColumn='Hello'
AND #myInputParameter < 5000
--and so on
Edit: I'm using SQL Server 2008 R2, but I'm really asking to get a feel for which query is "best practice" for SQL. I'm sure the difference in query-time between these two statements is a thousandth of a second, so it's not THAT critical. I'm just interested in writing better SQL code in general. Thanks
Sometimes, SQL Server is clever enough to transform the latter into the former. This manifests itself as a "startup predicate" on some plan operator like a filter or a loop-join. This causes the query to evaluate very quickly in small, constant time. I just tested this, btw.
You can't rely on this for all queries but once having verified that it works for a particular query through testing I'd rely on it.
If you use OPTION (recompile) it becomes even more reliable because this option inlines the parameter values into the query plan causing the entire query to turn into a constant scan.
SQL Server is well designed. It will do literal evaluations without actually scanning the table/indexes if it doesn't need to. So, I would expect them to perform basically identical.
From a practice standpoint, I think one should use the if statement, especially if it wraps multiple statements. But, this really is a matter of preference to me. For me, code that can't be executed would logically be faster than code that "should" execute without actually hitting the data.
Also, there is the possibility that SQL Server should create a bad plan and actually do a hit the data. I've never seen this specific scenario with literals, but I've had bad execution plans be created.

is "where (ParamID = #ParamID) OR (#ParamID = -1)" a good practice in sql selection

i used to write sql statments like
select * from teacher where (TeacherID = #TeacherID) OR (#TeacherID = -1)
read more
and pass #TeacherID value = -1 to select all teachers
now i'm worry about the performance
can you tell me is that a good practice or bad one?
many thanks
If TeacherID is indexed and you are passing a value other than -1 as TeacherID to search for details of a specific teacher then this query will end up doing a full table scan rather than the potentially far more efficient option of seeking into the index to retrieve the details of the specific teacher...
... Unless you are on SQL 2008 SP1 CU5 and later and use the OPTION (RECOMPILE) hint. See Dynamic Search Conditions in T-SQL for the definitive article on the topic.
We use this in a very limited fashion in stored procedures.
The problem is that the database engine isn't able to keep a good query plan for it. When dealing with a lot of data this can have a serious negative performance impact.
However, for smaller data sets (I'd say less than 1000 records, but that's a guess) it should be fine. You'll have to test in your particular environment.
If it's in a stored procedure, you might want to include something like a WITH RECOMPILE option so that the plan is regenerated on each execution. This adds (slightly) to the time for each run, but over several runs can actually reduce the average execution time. Also, this allows the database to inspect the actual query and "short circuit" the parts that aren't necessary on each call.
If you are directly creating your SQL and passing it through, then I'd suggest you make the part that builds your sql a little smarter so that it only includes the part of the where clause you actually need.
Another path you might consider is using UNION ALL queries as opposed to optional parameters. For example:
SELECT * FROM Teacher WHERE (TeacherId = #TeacherID)
UNION ALL
SELECT * FROM Teacher WHERE (#TeacherId = -1)
This actually accomplishes the exact same thing; however, the query plan is cacheable. We've used this method in a few places as well and saw performance improvements over using WITH RECOMPILE. We don't do this everywhere because some of our queries are extremely complicated and I'd rather have a performance hit than to complicate them further.
Ultimately though, you need to do a lot of testing.
There is a second part here that you should reconsider. SELECT *. It is ALWAYS preferable to actually name the columns you want returned and to make sure that you are only returning the ones you will actually need. Moving data across network boundaries is very expensive and you can generally get a fair amount of performance boost simply by specifying exactly what you want. In addition if what you need is very limited you can sometimes do covering indexes so that the database engine doesn't even have to touch the underlying tables to get the data you want.
If you're really worried about performance, you could break up your procedure to call on two different procs: one for all records, and one based on the parameter.
If #TeacherID = -1
exec proc_Get_All_Teachers
else
exec proc_Get_Teacher_By_TeacherID #TeacherID
Each one can be optimized individually.
It's your system, compare the performance. Consider optimizing on the most popular choice. If most users are going to select a single record, why hider their preformance just to accomodate the few that selct all teachers (And should have a reasonable expectation of performance.).
I know a single select query is easier to maintain, but at some point ease of maintenance eventually gives way to performance.

SQL Query theory question - single-statement vs multi-statement queries

When I write SQL queries, I find myself often thinking that "there's no way to do this with a single query". When that happens I often turn to stored procedures or multi-statement table-valued functions that use temp tables (of one sort or another) and end up simply combining the results and returning the result table.
I'm wondering if anyone knows, simply as a matter of theory, whether it should be possible to write ANY query that returns a single result set as a single query (not multiple statements). Obviously, I'm ignoring relevant points such as code readability and maintainability, maybe even query performance/efficiency. This is more about theory - can it be done... and don't worry, I certainly don't plan to start forcing myself to write a single-statement query when multi-statement will better suit my purpose in all cases, but it might make me think twice or a little bit longer on whether there is a viable way to get the result from a single query.
I guess a few parameters are in order - I'm thinking of a relational database (such as MS SQL) with tables that follow common best practices (such as all tables having a primary key and so forth).
Note: in order to win 'Accepted Answer' on this, you'll need to provide a definitive proof (reference to web material or something similar.)
I believe it is possible. I've worked with very difficult queries, very long queries, and often, it is possible to do it with a single query. But most of the time, it's harder to mantain, so if you do it with a single query, make sure you comment your query carefully.
I've never encountered something that could not be done in a single query.
But sometimes it's best to do it in more than one query.
At least with the a recent version of Oracle is absolutely possible. It has a 'model clause' which makes sql turing complete. ( http://blog.schauderhaft.de/2009/06/18/building-a-turing-engine-in-oracle-sql-using-the-model-clause/ ). Of course this is all with the usual limitation that we don't really have unlimited time and memory.
For a normal sql dialect without these abdominations I don't think it is possible.
A task that I can't see how to implement in 'normal sql' would be:
Assume a table with a single column of type integer
For every row
'take the value at the current row and go that many rows back, fetch that value, go that many rows back, and continue until you fetch the same value twice consecutively and return that as the result.'
I can't prove it, but I believe the answer is a cautious yes - provided your database design is done properly. Usually being forced to write multiple statements to get a certain result is a sign that your schema may need some improvements.
I'd say "yes" but can't prove it. However, my main thought process:
Any select should be a set based operation
Your assumption is that you are dealing with mathematically correct sets (ie normalised correctly)
Set theory should guarantee it's possible
Other thoughts:
Multiple SELECT statement often load temp tables/table variables. These can be derived or separated in CTEs.
Any RBAR processing (for good or bad) now be dealt with CROSS/OUTER APPLY onto derived tables
UDFs would be classed as "cheating" in this context I feel, because it allows you to put a SELECT into another module rather than in your single one
No writes allowed in your "before" sequence of DML: this changes state from SELECT to SELECT
Have you seen some of the code in our shop?
Edit, glossary
RBAR = Row By Agonising Row
CTE = Common Table Expression
UDF = User Defined Function
Edit: APPLY: cheating?
SELECT
*
FROM
MyTable1 t1
CROSS APPLY
(
SELECT * FROM MyTable2 t2
WHERE t1.something = t2.something
) t2
In theory yes, if you use functions or a torturous maze of OUTER APPLYs or sub-queries; however, for readability and performance, we have always ended up going with temp tables and multi-statement stored procedures.
As someone above commented, this is usually a sign that your data structure is starting to smell; not that it's bad, but that maybe it's time to denormalise for performance reasons (happens to the best of us), or maybe put a denormalised querying layer in front of your normalised "real" data.

Performance of SQL "EXISTS" usage variants

Is there any difference in the performance of the following three SQL statements?
SELECT * FROM tableA WHERE EXISTS (SELECT * FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT y FROM tableB WHERE tableA.x = tableB.y)
SELECT * FROM tableA WHERE EXISTS (SELECT 1 FROM tableB WHERE tableA.x = tableB.y)
They all should work and return the same result set. But does it matter if the inner SELECT selects all fields of tableB, one field, or just a constant?
Is there any best practice when all statements behave equal?
The truth about the EXISTS clause is that the SELECT clause is not evaluated in an EXISTS clause - you could try:
SELECT *
FROM tableA
WHERE EXISTS (SELECT 1/0
FROM tableB
WHERE tableA.x = tableB.y)
...and should expect a divide by zero error, but you won't because it's not evaluated. This is why my habit is to specify NULL in an EXISTS to demonstrate that the SELECT can be ignored:
SELECT *
FROM tableA
WHERE EXISTS (SELECT NULL
FROM tableB
WHERE tableA.x = tableB.y)
All that matters in an EXISTS clause is the FROM and beyond clauses - WHERE, GROUP BY, HAVING, etc.
This question wasn't marked with a database in mind, and it should be because vendors handle things differently -- so test, and check the explain/execution plans to confirm. It is possible that behavior changes between versions...
Definitely #1. It "looks" scary, but realize the optimizer will do the right thing and is expressive of intent. Also ther is a slight typo bonus should one accidently think EXISTS but type IN. #2 is acceptable but not expressive. The third option stinks in my not so humble opinion. It's too close to saying "if 'no value' exists" for comfort.
In general it's important to not be scared to write code that mearly looks inefficient if it provides other benefits and does not actually affect performance.
That is, the optimizer will almost always execute your complicated join/select/grouping wizardry to save a simple EXISTS/subquery the same way.
After having given yourself kudos for cleverly rewriting that nasty OR out of a join you will eventually realize the optimizer still used the same crappy execution plan to resolve the much easier to understand query with embedded OR anyway.
The moral of the story is know your platforms optimizer. Try different things and see what is actually being done because the rampant knee jerks assumptions regarding 'decorative' query optimization are almost always incorrect and irrelevant from my experience.
I realize this is an old post, but I thought it important to add clarity about why one might choose one format over another.
First, as others have pointed out, the database engine is supposed to ignore the Select clause. Every version of SQL Server has/does, Oracle does, MySQL does and so on. In many, many moons of database development, I have only ever encountered one DBMS that did not properly ignore the Select clause: Microsoft Access. Specifically, older versions of MS Access (I can't speak to current versions).
Prior to my discovery of this "feature", I used to use Exists( Select *.... However, i discovered that MS Access would stream across every column in the subquery and then discard them (Select 1/0 also would not work). That convinced me switch to Select 1. If even one DBMS was stupid, another could exist.
Writing Exists( Select 1... is as abundantly clear in conveying intent (It is frankly silly to claim "It's too close to saying "if 'no value' exists" for comfort.") and makes the odds of a DBMS doing something stupid with the Select statement nearly impossible. Select Null would serve the same purpose but is simply more characters to write.
I switched to Exists( Select 1 to make absolutely sure the DBMS couldn't be stupid. However, that was many moons ago, and today I would expect that most developers would expect seeing Exists( Select * which will work exactly the same.
That said, I can provide one good reason for avoiding Exists(Select * even if your DBMS evaluates it properly. It is much easier to find and trounce all uses of Select * if you don't have to skip every instance of its use in an Exists clause.
In SQL Server at least,
The smallest amount of data that can be read from disk is a single "page" of disk space. As soon as the processor reads one record that satisfies the subquery predicates it can stop. The subquery is not executed as though it was standing on it's own, and then included in the outer query, it is executed as part of the complete query plan for the whole thing. So when used as a subquery, it really doesn't matter what is in the Select clause, nothing is returned" to the outer query anyway, except a boolean to indicate whether a single record was found or not...
All three use the exact same execution plan
I always use [Select * From ... ] as I think it reads better, by not implying that I want something in particular returned from the subquery.
EDIT: From dave costa comment... Oracle also uses the same execution plan for all three options
This is one of those questions that verges on initiating some kind of holy war.
There's a fairly good discussion about it here.
I think the answer is probably to use the third option, but the speed increase is so infinitesimal it's really not worth worrying about. It's easily the kind of query that SQL Server can optimise internally anyway, so you may find that all options are equivalent.
The EXISTS returns a boolean not actual data, that said best practice is to use #3.
Execution Plan.
Learn it, use it, love it
There is no possible way to guess, really.
In addition to what others have said, the practice of using SELECT 1 originated on old Microsoft SQL Server (prior 2005) - its query optimizer wasn't clever enough to avoid physically fetching fields from the table for SELECT *. No other DBMS, to my knowledge, has this deficiency.
The EXISTS tests for existence of rows, not what's in them, so other than some optimizer quirk similar to above, it doesn't really matter what's in the SELECT list.
The SELECT * seems to be most usual, but others are acceptable as well.
#3 Should be the best one, as you donĀ“t need the returned data anyway. Bringing the fields will only add an extra overhead

LEFT JOIN vs. multiple SELECT statements

I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?
This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.
There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.
I'm with you - a single SQL would be better
There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.
It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.
The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!
You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.
Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.
The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.
A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.
Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)
You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.
There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.