comparing a column with itself in WHERE clause of oracle SELECT - sql

I have a multi-table SELECT query which compares column values with itself like below:
SELECT * FROM table1 t1,table2 t2
WHERE t1.col1=t2.col1 --Different tables,So OK.
AND t1.col1=t1.col1 --Same tables??
AND t2.col1=t2.col1 --Same tables??
This seems redundant to me.
My query is, Will removing them have any impact on logic/performance?
Thanks in advance.

This seems redundant, its only effect is removing lines that have NULL values in these columns. Make sure the columns are NOT NULL before removing those clauses.
If the columns are nullable you can safely replace these lines with (easier to read, easier to maintain):
AND t1.col1 IS NOT NULL
AND t2.col1 IS NOT NULL
Update following Jeffrey's comment
You're absolutely right, I don't know how I didn't see it myself: the join condition t1.col1=t2.col1 implies that only the rows with the join columns not null will be considered. The clauses tx.col1=tx.col1 are therefore completely redundant and can be safely removed.

Don't remove them until you understand the impact. If, as others are pointing out, they have no effect on the query and are probably optimised out, there's no harm in leaving them there but there may be harm in removing them.
Don't try to fix something that's working until your damn sure you're not breaking something else.
The reason I mention this is because we inherited a legacy reporting application that had exactly this construct, along the lines of:
where id = id
And, being a sensible fellow, I ditched it, only to discover that the database engine wasn't the only thing using the query.
It first went through a pre-processor which extracted every column that existed in a where clause and ensured they were indexed. Basically an auto-tuning database.
Well, imagine our surprise on the next iteration when the database slowed to a fraction of its former speed when users were doing ad-hoc queries on the id field :-)
Turns out this was a kludge put in by the previous support team to ensure common ad-hoc queries were using indexed columns as well, even though none of our standard queries did so.
So, I'm not saying you can't do it, just suggesting that it might be a good idea to understand why it was put in first.

Yes, the conditions are obviously redundant since they're identical!
SELECT * FROM table1 t1,table2 t2
WHERE t1.col1=t2.col1
But you do need at least one of them. Otherwise, you'd have a cartesian join on your hands: each row from table1 will be joined to every row in table2. If table1 had 100 rows, and table2 had 1,000 rows, the resultant query would return 100,000 results.
SELECT * FROM table1 t1,table2 t2 --warning!

Related

Querying Table vs Querying Subquery of Same Table

I'm working with some very old legacy code, and I've seen a few queries that are structured like this
SELECT
FieldA,
FieldB,
FieldC
FROM
(
SELECT * FROM TABLE1
)
LEFT JOIN TABLE2 ON...
Is there any advantage to writing a query this way?
This is in Oracle.
There would seem to be no advantage to using a subquery like this. The reason may be a historical relic, regarding the code.
Perhaps once upon a time, there was a more complicated query there. The query was replaced by a table/view, and the author simply left the original structure.
Similarly, once upon a time, perhaps a column needed to be calculated (say for the outer query or select). This column was then included in the table/view, but the structure remained.
I'm pretty sure that Oracle is smart enough to ignore the subquery when optimizing the query. Not all databases are that smart, but you might want to clean-up the code. At the very least, such as subquery looks awkward.
As a basic good practice in SQL, you should not code a full-scan from a table (SELECT * FROM table, without a WHERE clause), unless necessary, for performance issues.
In this case, it's not necessary: The same result can be obtained by:
SELECT
Fields
FROM
TABLE1 LEFT JOIN TABLE2 ON...

Inner joins involving three tables

I have a SELECT statement that has three inner joins involving two tables.
Apart from creating indexes on the columns referenced in the ON and WHERE clauses, is there other things I can do to optimize the joins, as in rewriting the query?
SELECT
...
FROM
my_table AS t1
INNER JOIN
my_table AS t2
ON
t2.id = t1.id
INNER JOIN
other_table AS t3
ON
t2.id = t3.id
WHERE
...
You can tune PostgreSQL config, VACUUM ANALIZE and all general optimizations.
If this is not enough and you can spend few days you may write code to create materialized view as described in postgresql wiki.
You likely have an error in your example, because you're selecting the same record from my_table twice, you could really just do:
SELECT
...
FROM
my_table AS t1
INNER JOIN
other_table AS t3
ON
t1.id = t3.id
WHERE
...
Because in your example code t1 Will always be t2.
But lets assume you mean ON t2.idX = t1.id; then to answer your question, you can't get much better performance than what you have, you could index them or you could go further and define them as foreign key relationships (which wouldn't do too much in terms of performance benefits compared to non-index vs indexing them).
You might instead like to look at restricting your where clause and perhaps that is where your indexing would be as (if not more) beneficial.
You could write your query as using WHERE EXISTS (if you don't need to select data from all three tables) rather than INNER JOINS but the performance will be almost identical (except when this is itself inside a nested query) as it still needs find the records.
In PostgreSQL. most of your tuning will not be on the actual query. The goal is to help the optimizer figure out how best to execute your declarative query, not to specify how to do it from your program. That isn't to say that sometimes queries can't be optimized themselves, or that they might not need to be, but this doesn't have any of the problem areas that I am aware of, unless you are retrieving a lot more records than you need to (which I have seen happen occasionally).
The fist thing to do is to run vacuum analyze to make sure you have optimal statistics. Then use explain analyze to compare expected query performance to actual. From that point, we'd look at indexes etc. There isn't anything in this query that needs to be optimized on a query level. However without looking at your actual filters in your where clause and the actual output of explain analyze there isn't much that can be suggested.
Typically you tweak the db to choose a better query plan rather than specifying it in your query. That's usually the PostgreSQL way. The comment is of course qualified by noting there are exceptions.

How can I use "FOR UPDATE" with a JOIN on Oracle?

The answer to another SO question was to use this SQL query:
SELECT o.Id, o.attrib1, o.attrib2
FROM table1 o
JOIN (SELECT DISTINCT Id
FROM table1, table2, table3
WHERE ...) T1 ON o.id = T1.Id
Now I wonder how I can use this statement together with the keyword FOR UPDATE. If I simply append it to the query, Oracle will tell me:
ORA-02014: cannot select FOR UPDATE from view
Do I have to modify the query or is there a trick to do this with Oracle?
With MySql the statement works fine.
try:
select .....
from <choose your table>
where id in (<your join query here>) for UPDATE;
EDIT: that might seem a bit counter-intuitive bearing in mind the question you linked to (which asked how to dispense with an IN), but may still provide benefit if your join returns a restricted set. However, there is no workaround: the oracle exception is pretty self-explanatory; oracle doesn't know which rows to lock becasue of the DISTINCT. You could either leave out the DISTINCT or define everything in a view and then update that, if you wanted to, without the explicit lock: http://www.dba-oracle.com/t_ora_02014_cannot_select_for_update.htm
It may depend what you want to update. You can do
... FOR UPDATE OF o.attrib1
to tell it you're only interested in updating data from the main table, which seems likely to be the case; this means it will only try to lock that table and not worry about the implicit view in the join. (And you can still update multiple columns within that table, naming one still locks the whole row - though it will be clearer if you specify all the columns you want to update in the FOR UPDATE OF clause).
Don't know if that will work with MySQL though, which brings us back to Mark Byers point.

SQL Server - Query Short-Circuiting?

Do T-SQL queries in SQL Server support short-circuiting?
For instance, I have a situation where I have two database and I'm comparing data between the two tables to match and copy some info across. In one table, the "ID" field will always have leading zeros (such as "000000001234"), and in the other table, the ID field may or may not have leading zeros (might be "000000001234" or "1234").
So my query to match the two is something like:
select * from table1 where table1.ID LIKE '%1234'
To speed things up, I'm thinking of adding an OR before the like that just says:
table1.ID = table2.ID
to handle the case where both ID's have the padded zeros and are equal.
Will doing so speed up the query by matching items on the "=" and not evaluating the LIKE for every single row (will it short circuit and skip the LIKE)?
SQL Server does NOT short circuit where conditions.
it can't since it's a cost based system: How SQL Server short-circuits WHERE condition evaluation .
You could add a computed column to the table. Then, index the computed column and use that column in the join.
Ex:
Alter Table Table1 Add PaddedId As Right('000000000000' + Id, 12)
Create Index idx_WhateverIndexNameYouWant On Table1(PaddedId)
Then your query would be...
select * from table1 where table1.PaddedID ='000000001234'
This will use the index you just created to quickly return the row.
You want to make sure that at least one of the tables is using its actual data type for the IDs and that it can use an index seek if possible. It depends on the selectivity of your query and the rate of matches though to determine which one should be converted to the other. If you know that you have to scan through the entire first table, then you can't use a seek anyway and you should convert that ID to the data type of the other table.
To make sure that you can use indexes, also avoid LIKE. As an example, it's much better to have:
WHERE
T1.ID = CAST(T2.ID AS VARCHAR) OR
T1.ID = RIGHT('0000000000' + CAST(T2.ID AS VARCHAR), 10)
than:
WHERE
T1.ID LIKE '%' + CAST(T2.ID AS VARCHAR)
As Steven A. Lowe mentioned, the second query might be inaccurate as well.
If you are going to be using all of the rows from T1 though (in other words a LEFT OUTER JOIN to T2) then you might be better off with:
WHERE
CAST(T1.ID AS INT) = T2.ID
Do some query plans with each method if you're not sure and see what works best.
The absolute best route to go though is as others have suggested and change the data type of the tables to match if that's at all possible. Even if you can't do it before this project is due, put it on your "to do" list for the near future.
How about,
table1WithZero.ID = REPLICATE('0', 12-len(table2.ID))+table2.ID
In this case, it should able to use the index on the table1
Just in case it's useful, as the linked page in Mladen Prajdic's anwer explains, CASE clauses are short-circuit evaluated.
If the ID is purely numeric (as your example), I would reccomend (if possible) changing that field to a number type instead. If the database is allready in use it might be hard to change the type though.
fix the database to be consistent
select * from table1 where table1.ID LIKE '%1234'
will match '1234', '01234', '00000000001234', but also '999991234'. Using LIKE pretty much guarantees an index scan (assuming table1.ID is indexed!). Cleaning up the data will improve performance significantly.
if cleaning up the data is not possible, write a user-defined function (UDF) to strip off leading zeros, e.g.
select * from table1 where dbo.udfStripLeadingZeros(table1.ID) = '1234'
this may not improve performance (since the function will have to run for each row) but it will eliminate false matches and make the intent of the query more obvious
EDIT: Tom H's suggestion to CAST to an integer would be best, if that is possible.

Is it better to join two fields together, or to compare them each to the same constant?

For example which is better:
select * from t1, t2 where t1.country='US' and t2.country=t1.country and t1.id=t2.id
or
select * from t1, t2 where t1.country'US' and t2.country='US' and t1.id=t2.id
better as in less work for the database, faster results.
Note: Sybase, and there's an index on both tables of country+id.
I don't think there is a global answer to your question. It depends on the specific query. You would have to compare the execution plans for the two queries to see if there are significant differences.
I personally prefer the first form:
select * from t1, t2 where t1.country='US' and t2.country=t1.country and t1.id=t2.id
because if I want to change the literal there is only one change needed.
There are a lot of factors at play here that you've left out. What kind of database is it? Are those tables indexed? How are they indexed? How large are those tables?
(Premature optimization is the root of all evil!)
It could be that if "t1.id" and "t2.id" are indexed, the database engine joins them together based on those fields, and then uses the rest of the WHERE clause to filter out rows.
They could be indexed but incredibly small tables, and both fit in a page of memory. In which case the database engine might just do a full scan of both rather than bother loading up the index.
You just don't know, really, until you try.
I had a situation similar to this and this was the solution I resorted to:
Select *
FROM t1
INNER JOIN t2 ON t1.id = t2.id AND t1.country = t2.country AND t1.country = 'US'
I noticed that my query ran faster in this scenario. I made the assumption that joining on the constant saved the engine time because the WHERE clause will execute at the end. Joining and then filtering by 'US' means you still pulled all the other countries from your table and then had to filter out the ones you wanted. This method pulls less records in the end, because it will only find US records.
The correct answer probably depends on your SQL engine. For MS SQL Server, the first approach is clearly the better because the statistical optimizer is given an additional clue which may help it find a better (more optimal) resolution path.
I think it depends on the library and database engine. Each one will execute the SQL differently, and there's no telling which one will be optimized.
I'd lean towards only including your constant in the code once. There might be a performance advantage one way or the other, but it's probably so small the maintenance advantage of only one parameter trumps it.
If you ever wish to make the query more general, perhaps substituting a parameter for the target country, then I'd go with your first example, as it requires only a single change. That's less to worry about getting wrong in the future.
I suspect this is going to depend on the tables, the data and the meta-data. I expect I could work up examples that would show results both ways - benchmark!
The extressions should be equivalent with any decent optimizer, but it depends on which database you're using and what indexes are defined on your table.
I would suggest using the EXPLAIN feature to figure out which of expressions is the most optimal.
I think a better SQL would be:
select * from t1, t2 where t1.id=t2.id
and t1.country ='US'
There's no need to use the second comparison to 'US' unless it's possisble that the country in t2 could be different than t1 for the same id.
Rather than use an implicit inner join, I would explicitly join the tables.
Since you want both the id fields and country fields to be the same, and you mentioned that both are indexed (I'm presuming in the same index), I would include both columns in the join so you can make use of an index seek instead of a scan. Finally, add your where clause.
SELECT *
FROM t1
JOIN t2 ON t1.id = t2.id AND t1.country = t2.country
WHERE t1.country = 'US'