Can this SQL Query be optimized - sql

I have got below SQL query in Procedure, can this more optimized for best results.
SELECT DISTINCT
[PUBLICATION_ID] as n
,[URL] as u
FROM [LINK_INFO]
WHERE Component_Template_Priority > 0
AND PUBLICATION_ID NOT IN (232,481)
ORDER BY URL
Please suggest, is using NOT Exists is better way in this.
Thanks

It is possible to use NOT EXISTS. Just going from the code above you probably shouldn't, but it's technically possible. As a general rule; a very small, quickly resolved set (two literals would definitely apply) will perform better as a NOT IN than as a NOT EXISTS. NOT EXISTS wins when NOT IN has to do enough comparisons against each row that the correlated subquery for NOT EXISTS (which stops at the first match) resolves more quickly.
This assumes that the comparison set cannot include NULL. Otherwise NOT IN and NOT EXISTS do not return the same results because NOT IN ( NULL, ...) always returns NULL and therefore no rows whereas NOT EXISTS excludes rows for which it finds a match and NULL won't generate a match and so won't exclude the row.
A third way to compare two sets for mismatches is with an OUTER JOIN. I don't see a reason to go into that from what we've got so far, so I'll let that one go for now.
A definitive answer would depend on a lot of variables (hence the comments on your question)...
What is the cardinality (number of different values) of the publication_id column?
Is there an index on the column?
How many rows are in the table?
Where did you get the values in your NOT IN clause?
Will they always be literals or are they going to come from parameters or a subquery?
... just to name a few. Of course, the best way to find out is by writing the query different ways and looking at execution times and query plans.
EDIT Another is with set operators like EXCEPT. Again, probably overkill to go into that.

Related

SQL Exist Predicate Explanation

Can someone please explain to me how EXIST works in SQL?
I am not entirely sure my result is working the way I need it to.
I just need an example using 3 different queries.
Please be very specific and explain it to me like you are talking to a 5 year old :)
Edit: Here is what I do not understand and need someone to make clear:
The subquery will generally only be executed long enough to determine
whether at least one row is returned, not all the way to completion.
It is unwise to write a subquery that has side effects (such as
calling sequence functions); whether the side effects occur might be
unpredictable.
Since the result depends only on whether any rows are returned, and
not on the contents of those rows, the output list of the subquery is
normally unimportant. A common coding convention is to write all
EXISTS tests in the form EXISTS(SELECT 1 WHERE ...). There are
exceptions to this rule however, such as subqueries that use
INTERSECT.
I am trying to create a query that selects values that do not exist in another query. Based on what I just quoted above, is this not a good idea to use EXISTS?
Edit Part 2:
If I am trying to make sure a value doesn't exist in column 1 OR Column 2 OR column 3 OR column 4, should I use ALL or Exist?
My result looks weird with Exist, so I just wanted to make sure I understood it correctly, how it works.

Determine if a SQL Insert/Update statement affects the result from a stored Select Statement

Thought this would be a good place to ask for some "brainstorming." Apologies if it's a little broad/off subject.
I was wondering if anyone here had any ideas on how to approach the following problem:
First assume that I have a select statement stored somewhere as an object (this can be the tree form of the query). For example (for simplicity):
SELECT A, B FROM table_A WHERE A > 10;
It's easy to determine the below would change the result of the above query:
INSERT INTO table_A (A,B) VALUES (12,15);
But, given any possible Insert/Update/Whatever statement, as well as any possible starting Select (but we know the Selects and can analyze them all day) I'd like to determine if it would affect the result of the Select Statement.
It's fine to assume that there won't be any "outside" queries, and that we know about all the queries being sent to the DB. It is also assumed we know the DB schema.
No, this isn't for homework. Just a brain teaser I've been thinking about and started to get stuck on (obviously, SQL can get very complicated.)
Based on the reply to the comment, I'd say that without additional criteria, this ranges between very hard and impossible.
Very hard (leastways, it would be for me) because you'd have to write something to parse and interpret your SQL statements into a workable frame of reference for your goals. Doable, but can it be worth the effort?
Impossible because some queries transcend phrases like "Byzantinely complex". (Think nested queries, correlated subqueries, views, common table expressions, triggers, outer joins, and who knows what all.) Without setting criteria such as "no subqueries, no views or triggers, no more than X joins" and so forth, the problem becomes open-ended enough to warrant an NP Complete answer.
My first thought would be to put a trigger on table_A, where if any of the columns you're affecting (col A in this case) changes to meet (or no longer meet) the condition (> 10 here), then the trigger records that an "affecting" change has taken place.
E.g. have another little table to record a "last update timestamp", which the trigger could pop a getdate() into when it detects such a change.
Then, you could check that table to see if the timestamp has changed since the last time you ran the select query - if it has, then you know you need to re-run it, if it hasn't, then you know the results would be the same.
The table could hold many such timestamps (one per row, perhaps with the table/trigger name as a key value in another column) to service many such triggers.
Advantage? Being done in a trigger on the table means no risk of a change that could affect the select statement being missed.
Disadvantage? I guess depending on how your select statements come into existence, you might have an undesirable/unmanageable overhead in creating the trigger(s).

SQL: IN vs EXISTS

I read that normally you should use EXISTS when the results of the subquery are large, and IN when the subquery results are small.
But it would seem to me that it's also relevant if a subquery has to be re-evaluated for each row, or if it can be evaluated once for the entire query.
Consider the following example of two equivalent queries:
SELECT * FROM t1
WHERE attr IN
(SELECT attr FROM t2
WHERE attr2 = ?);
SELECT * FROM t1
WHERE EXISTS
(SELECT * FROM t2
WHERE t1.attr = t2.attr
AND attr2 = ?);
The former subquery can be evaluated once for the entire query, the latter has to be evaluated for each row.
Assume that the results of the subquery are very large. Which would be the best way to write this?
This is a good question. Especially as in Oracle you can convert every EXISTS clause into an IN clause and vice versa, because Oracle's IN clause can deal with tuples (where (abc) in (select x,y,z from ...), which most other dbms cannot.
And your reasoning is good. Yes, with the IN clause you suggest to load all the subquery's data once instead of looking up the records in a loopg. However this is just partly true, because:
As good as it seems to get all subquery data selected just once, the outer query must loop through the resulting array for every record. This can be quite slow, because it's just an array. If Oracle looks up data in a table instead there are often indexes to help it, so the nested loop with repeated table lookups is eventually faster.
Oracle's optimizer re-writes queries. So it can come to the same execution plan for the two statements or even get to quite unexpected plans. You never know ;-)
Oracle might decide not to loop at all. It may decide for a hash join instead, which works completely different and is usually very effective.
Having said this, Oracle's optimizer should notice that the two statements are exactly the same actually and should generate the same execution plan. But experience shows that the optimizer sometimes doesn't notice, and quite often the optimizer does better with the EXISTS clause for whatever reason. (Not as much difference as in MySQL, but still, EXISTS seems preferable over IN in Oracle, too.)
So as to your question "Assume that the results of the subquery are very large. Which would be the best way to write this?", it is unlikely for the IN clause to be faster than the EXISTS clause.
I often like the IN clause better for its simplicity and mostly find it a bit more readable. But when it comes to performance, it is sometimes better to use EXISTS (or even outer joins for that matter).

SQL SELECT clause tuning

Why does the sql query execute faster if I use the actual column names in the SELECT statement instead of SELECT *?
A noticeable difference at all seems odd... since I'd expect it to be a very minuscule difference and am intrigued to test it out.
Any difference might in a statement using Select * might be due to it taking extra time to find out what all of the column names are.
Because depending on the query it has to work out if there are unique names, what they all are, etc. Where as, if you specificy it, its all done for it.
Generally, the more you tell it, the less it has to calculate. This is the same for many systems.
Its possible the performance is way better when you select certain column names than Select *, one good reason, just check whether, you have used the columns which are already indexed, in this case, the optimizer will make a plan to select all the data only from index instead from actual table. But check the plan once for sure.

Where should I do the rowcount when checking for existence: sql or php?

In the case when I want to check, if a certain entry in the database exists I have two options.
I can create an sql query using COUNT() and then check, if the result is >0...
...or I can just retrieve the record(s) and then count the number of rows in the returned rowset. For example with $result->num_rows;
What's better/faster? in mysql? in general?
YMMV, but I suspect that if you are only checking for existence, and don't need to use the retrieved data in any way, the COUNT() query will be faster. How much faster will depend on how much data.
The fastest is probably asking the database if something exists:
SELECT EXISTS ([your query here])
SELECT 1
FROM (SELECT 1) t
WHERE EXISTS( SELECT * FROM foo WHERE id = 42 )
Just tested, works fine on MySQL v5
COUNT(*) is generally less efficient if:
you can have duplicates (because the
DBMS will have to exhaustively
search all of the records/indexes to
give you the exact answer) or
have NULL entries (for the same
reason)
If you are COUNT'ing based on a WHERE clause that is guaranteed to produce a single record (or 0) and the DBMS knows this (based upon UNIQUE indexes), then it ought to be just as efficient. But, it is unlikely that you will always have this condition. Also, the DBMS may not always pick up on this depending on the version and DBMS.
Counting in the application (when you don't need the row) is almost always guaranteed to be slower/worse because:
You have to send data to the client, the client has to buffer it and do some work
You may bump out things in the DBMS MRU/LRU data cache that are more important
Your DBMS will (generally) have to do more disk I/O to fetch record data that you will never use
You have more network activity
Of course, if you want to DO something with the row if it exists, then it is definitely faster/best to simply try and fetch the row to begin with!
If all you are doing is checking for the existance, then
Select count(*) ...
But if you will retrieve the data if it exists, then just get the data and check it in PHP, otherwise you'll have two calls.
For me is in the database.
Making a count(1) is faster than $result->num_rows because in the $result->num_rows you make 2 operations 1 select and a count if the select has a count is faster to get the result.
Except if you also want the information from the db.
If you want raw speed, benchmark! In addition to the methods others have suggested:
SELECT 1 FROM table_name WHERE ... LIMIT 1
may be faster due to avoiding the subselect. Benchmark it.
SELECT COUNT(*) FROM table
is the best choice, this operation is extremely fast both on small tables and large tables. While it's possible that
SELECT id FROM table
is faster on small tables, the difference in speed will be microscopic. But if you have a large table, this operation can be very slow.
Therefore, your best bet is to always choose to COUNT(*) the table (and it's faster to do * than it is to pick a specific column) as overall, it will be the fastest operation.
I would definitely do it in the PHP to decrease load on the database.
In order to get a count and get the returned rows in SQL you would have to do two queries.. a COUNT and then a SELECT
The PHP way gives you everything you need in one result object.