Concatenating two columns in Oracle giving performance issue - sql

I have a select query where I am doing inner joins and in AND clause I am checking like this
AND UPPER (b.a1||b.a2) IN
(
select a.a1||a.a2
from a
where a.a3 =
(
select UPPER(decode('609',null,c.c1,'609'))
from dual
)
)
but so because of || opertaor it is taking more than 2 minutes. Can we have any other solution to this?

How about using EXISTS clause?
AND EXISTS (
SELECT
1
FROM
a
WHERE
a.a3 = (SELECT UPPER(DECODE('609',c.c1,'609')) FROM dual) -- this condition is pretty odd
AND a.a1 = UPPER(b.a1)
AND a.a2 = UPPER(b.a2)
)
Adding Function Based Indexes on UPPER(b.a1) AND UPPER(b.a2) may help as well.
Speaking of that odd condition: (SELECT UPPER(DECODE('609',c.c1,'609')) FROM dual):
Why do you perform a SELECT from dual there?
What you check is - if '609' equals c.c1 ('609') then a.a3 must equal '609', in any other case your SELECT returns NULL, thus no value from table a is returned. So you can just change the entire condition to a.a3 = '609'.

Try with a JOIN
SELECT *
FROM b
JOIN ( select UPPER(a.a1), UPPER(a.a2)
from a
where a.a3 = (select UPPER(decode('609',c.c1,'609')) from dual)
) a
ON UPPER(b.b1) = a.a1
AND UPPER(b.b2) = a.a2
But the problem is when you do UPPER(b.b1) or b1||b2 you cant use the index anymore.
You may need a function index

You don't need to concatenate. In fact: concatenating the values is a bug waiting to happen: consider the the values foob and ar in table b and foo and bar in a - the concatenation treats those as the same tuple although they are not.
You also don't need the select from dual for a constant:
AND (UPPER (b.a1), upper(b.a2)) IN
(
select a.a1. a.a2
from a
where a.a3 = UPPER(decode('609',null,c.c1,'609'))
)
An index on b(a1,a2) can't be used for this, but you can create an index on b (UPPER (b.a1), upper(b.a2)) which would be used.

Related

Oracle SQL XOR condition with > 14 tables

I have a question on sql desgin.
Context:
I have a table called t_master and 13 other tables (lets call them a,b,c... for simplicity) where it needs to compared.
Logic:
t_master will be compared to table 'a' where t_master.gen_val =
a.value.
If record exist in t_master, retrieve t_master record, else retrieve 'a' record.
I do not need to retrieve the records if it exists in both tables (t_master and a) - XOR condition
Repeat this comparison with the remaining 12 tables.
I have some idea on doing this, using WITH to subquery the non-master tables (a,b,c...) first with their respective WHERE clause.
Then use XOR statement to retrieve the records.
Something like
WITH a AS (SELECT ...),
b AS (SELECT ...)
SELECT field1,field2...
FROM t_master FULL OUTER JOIN a FULL OUTER JOIN b FULL OUTER JOIN c...
ON t_master.gen_value = a.value
WHERE ((field1 = x OR field2 = y ) AND NOT (field1 = x AND field2 = y))
AND ....
.
.
.
.
Seeing that I have 13 tables that I need to full outer join, is there a better way/design to handle this?
Otherwise I would have at least 2*13 lines of WHERE clause which I'm not sure if that will have impact on the performance as t_master is sort of a log table.
**Assume I cant change any schema.
Currently I'm not sure if this SQL will working correctly yet, so I'm hoping someone can guide me in the right direction regarding this.
update from used_by_already's suggestion:
This is what I'm trying to do (comparison between 2 tables first, before I add more, but I am unable to get values from ATP_R.TBL_HI_HDR HI_HDR as it is in the NOT EXISTS subquery.
How do i overcome this?
SELECT LOG_REPO.UNIQ_ID,
LOG_REPO.REQUEST_PAYLOAD,
LOG_REPO.GEN_VAL,
LOG_REPO.CREATED_BY,
TO_CHAR(LOG_REPO.CREATED_DT,'DD/MM/YYYY') AS CREATED_DT,
HI_HDR.HI_NO R_VALUE,
HI_HDR.CREATED_BY R_CREATED_BY,
TO_CHAR(HI_HDR.CREATED_DT,'DD/MM/YYYY') AS R_CREATED_DT
FROM ATP_COMMON.VW_CMN_LOG_GEN_REPO LOG_REPO JOIN ATP_R.TBL_HI_HDR HI_HDR ON LOG_REPO.GEN_VAL = HI_HDR.HI_NO
WHERE NOT EXISTS
(SELECT NULL
FROM ATP_R.TBL_HI_HDR HI_HDR
WHERE LOG_REPO.GEN_VAL = HI_HDR.HI_NO
)
UNION ALL
SELECT LOG_REPO.UNIQ_ID,
LOG_REPO.REQUEST_PAYLOAD,
LOG_REPO.GEN_VAL,
LOG_REPO.CREATED_BY,
TO_CHAR(LOG_REPO.CREATED_DT,'DD/MM/YYYY') AS CREATED_DT,
HI_HDR.HI_NO R_VALUE,
HI_HDR.CREATED_BY R_CREATED_BY,
TO_CHAR(HI_HDR.CREATED_DT,'DD/MM/YYYY') AS R_CREATED_DT
FROM ATP_R.TBL_HI_HDR HI_HDR JOIN ATP_COMMON.VW_CMN_LOG_GEN_REPO LOG_REPO ON HI_HDR.HI_NO = LOG_REPO.GEN_VAL
WHERE NOT EXISTS
(SELECT NULL
FROM ATP_COMMON.VW_CMN_LOG_GEN_REPO LOG_REPO
WHERE HI_HDR.HI_NO = LOG_REPO.GEN_VAL
)
Full outer joins used to exclude all matching rows can be an expensive query. You don't supply much detail, but perhaps using NOT EXISTS would be simpler and maybe it will produce a better explain plan. Something along these lines.
select
cola,colb,colc
from t_master m
where not exists (
select null from a where m.keycol = a.fk_to_m
)
and not exists (
select null from b where m.keycol = b.fk_to_m
)
and not exists (
select null from c where m.keycol = c.fk_to_m
)
union all
select
cola,colb,colc from a
where not exists (
select null from t_master m where a.fk_to_m = m.keycol
)
union all
select
cola,colb,colc from b
where not exists (
select null from t_master m where b.fk_to_m = m.keycol
)
union all
select
cola,colb,colc from c
where not exists (
select null from t_master m where c.fk_to_m = m.keycol
)
You could union the 13 a,b,c ... tables to simplify the coding, but that may not perform so well.

Cartesian product of two tables: Oracle Sql Join On True

I want to concatenate _BAR to the results of a Query.
One of my first attempts is this:
SELECT lhs.f||rhs.f as concat_bar FROM (
SELECT 'FOO' as f FROM DUAL) lhs
LEFT JOIN
(
SELECT '_BAR' as f FROM DUAL) rhs
ON ('' != rhs.f)
;
but I got no results. I was expecting ON ('' != rhs.f) to evaluate to true so I expected as a result a single row: 'FOO_BAR'. Which is the result of concatenating the cartesian product of the lhs and rhs tables.
How can I JOIN on TRUE?
I know that, for the specific problem, other solutions as
SELECT lhs.f||'_BAR' FROM (
SELECT 'FOO' as f FROM DUAL) lhs;
are possible.
My question is on an effective syntax to make a cartesian product of two tables as a LEFT JOIN ON TRUE.
Aside from the issues with null, is a cross join what you wanted?
select lhs.f || rhs.f as concat_bar
from ( select 'FOO' as f from dual ) lhs
cross join
( select '_BAR' as f from dual ) rhs
Oracle (by default) treats an empty string as NULL. This is a real pain. It is different from other databases. I suppose their argument is: "Well, we were doing it for years before ANSI defined NULL values." Great, but those years were the 1980s.
In any case, this logic:
'' <> rhs.f
(<> is the original not-equals operator in SQL.)
is exactly the same as:
NULL <> rhs.f
This always returns NULL and that is treated as not-true in WHERE conditions.
In Oracle, express this emptiness as:
rhs.f is not null

Exists, or Within

I'm writing a some filtering logic that basically wants to first check if there's a value in the filter table, then if there is return the filtered values. When there isn't a value in the filter table just return everything. The following table does this correctly but I have to write the same select statement twice.
select *
from personTbl
where (not exists (select filterValue from filterTable where filterType = 'name') or
personTbl.name in (select filterValue from filterTable where filterType = 'name'))
Is there some better way to do this that will return true if the table is empty, or the value is contained within it?
One approach is to do a left outer join to your filter-subquery, and then select all the rows where the join either failed (meaning that the subquery returned no rows) or succeeded and had the right value:
SELECT personTbl.*
FROM personTbl
LEFT
OUTER
JOIN ( SELECT DISTINCT filterValue
FROM filterTable
WHERE filterType = 'name'
) filter
ON 1 = 1
WHERE filter.filterValue = personTbl.name
OR filter.filterValue IS NULL
;
To be honest, I'm not sure if the above is a good idea — it's not very intuitive1, and I'm not sure how well it will perform — but you can judge for yourself.
1. As evidence of its unintuitiveness, witness the mistaken claim below that it doesn't work. As of this writing, that comment has garnered two upvotes. So although the query is correct, it clearly inspires people to great confidence that it's wrong. Which is a nice party trick, but not generally desirable in production code.
You can use a collection to try to make the query more intuitive (and only require a single select from the filter table):
CREATE TYPE filterlist IS TABLE OF VARCHAR2(100);
/
SELECT p.*
FROM PersonTbl p
INNER JOIN
( SELECT CAST(
MULTISET(
SELECT filterValue
FROM filterTable
WHERE filterType = 'name'
)
AS filterlist
) AS filters
FROM DUAL ) f
ON ( f.filters IS EMPTY OR p.name MEMBER OF f.filters );

sql parameterised cte query

I have a query like the following
select *
from (
select *
from callTableFunction(#paramPrev)
.....< a whole load of other joins, wheres , etc >........
) prevValues
full join
(
select *
from callTableFunction(#paramCurr)
.....< a whole load of other joins, wheres , etc >........
) currValues on prevValues.Field1 = currValues.Field1
....<other joins with the same subselect as the above two with different parameters passed in
where ........
group by ....
The following subselect is common to all the subselects in the query bar the #param to the table function.
select *
from callTableFunction(#param)
.....< a whole load of other joins, wheres , etc >........
One option is for me to convert this into a function and call the function, but i dont like this as I may be changing the
subselect query quite often for.....or I am wondering if there is an alternative using CTE
like
with sometable(#param1) as
(
select *
from callTableFunction(#param)
.....< a whole load of other joins, wheres , etc >........
)
select
sometable(#paramPrev) prevValues
full join sometable(#currPrev) currValues on prevValues.Field1 = currValues.Field1
where ........
group by ....
Is there any syntax like this or technique I can use like this.
This is in SQL Server 2008 R2
Thanks.
What you're trying to do is not supported syntax - CTE's cannot be parameterised in this way.
See books online - http://msdn.microsoft.com/en-us/library/ms175972.aspx.
(values in brackets after a CTE name are an optional list of output column names)
If there are only two parameter values (paramPrev and currPrev), you might be able to make the code a little easier to read by splitting them into two CTEs - something like this:
with prevCTE as (
select *
from callTableFunction(#paramPrev)
.....< a whole load of other joins, wheres , etc
........ )
,curCTE as (
select *
from callTableFunction(#currPrev)
.....< a whole load of other joins, wheres , etc
........ ),
select
prevCTE prevValues
full join curCTE currValues on
prevValues.Field1 = currValues.Field1 where
........ group by
....
You should be able to wrap the subqueries up as parameterized inline table-valued functions, and then use them with an OUTER JOIN:
CREATE FUNCTION wrapped_subquery(#param int) -- assuming it's an int type, change if necessary...
RETURNS TABLE
RETURN
SELECT * FROM callTableFunction(#param)
.....< a whole load of other joins, wheres , etc ........
GO
SELECT *
FROM
wrapped_subquery(#paramPrev) prevValues
FULL OUTER JOIN wrapped_subquery(#currPrev) currValues ON prevValues.Field1 = currValues.Field1
WHERE ........
GROUP BY ....
After failing to assign scalar variables before with, i finally got a working solution using a stored procedure and a temp table:
create proc hours_absent(#wid nvarchar(30), #start date, #end date)
as
with T1 as(
select c from t
),
T2 as(
select c from T1
)
select c from T2
order by 1, 2
OPTION(MAXRECURSION 365)
Calling the stored procedure:
if object_id('tempdb..#t') is not null drop table #t
create table #t([month] date, hours float)
insert into #t exec hours_absent '9001', '2014-01-01', '2015-01-01'
select * from #t
There may be some differences between my example and what you want depending on how your subsequent ON statements are formulated. Since you didn't specify, I assumed that all the subsequent joins were against the first table.
In my example I used literals rather than #prev,#current but you can easily substitute variables in place of literals to achieve what you want.
-- Standin function for your table function to create working example.
CREATE FUNCTION TestMe(
#parm int)
RETURNS TABLE
AS
RETURN
(SELECT #parm AS N, 'a' AS V UNION ALL
SELECT #parm + 1, 'b' UNION ALL
SELECT #parm + 2, 'c' UNION ALL
SELECT #parm + 2, 'd' UNION ALL
SELECT #parm + 3, 'e');
go
-- This calls TestMe first with 2 then 4 then 6... (what you don't want)
-- Compare these results with those below
SELECT t1.N AS AN, t1.V as AV,
t2.N AS BN, t2.V as BV,
t3.N AS CN, t3.V as CV
FROM TestMe(2)AS t1
FULL JOIN TestMe(4)AS t2 ON t1.N = t2.N
FULL JOIN TestMe(6)AS t3 ON t1.N = t3.N;
-- Put your #vars in place of 2,4,6 adding select statements as needed
WITH params
AS (SELECT 2 AS p UNION ALL
SELECT 4 AS p UNION ALL
SELECT 6 AS p)
-- This CTE encapsulates the call to TestMe (and any other joins)
,AllData
AS (SELECT *
FROM params AS p
OUTER APPLY TestMe(p.p)) -- See! only coded once
-- Add any other necessary joins here
-- Select needs to deal with all the columns with identical names
SELECT d1.N AS AN, d1.V as AV,
d2.N AS BN, d2.V as BV,
d3.N AS CN, d3.V as CV
-- d1 gets limited to values where p = 2 in the where clause below
FROM AllData AS d1
-- Outer joins require the ANDs to restrict row multiplication
FULL JOIN AllData AS d2 ON d1.N = d2.N
AND d1.p = 2 AND d2.p = 4
FULL JOIN AllData AS d3 ON d1.N = d3.N
AND d1.p = 2 AND d2.p = 4 AND d3.p = 6
-- Since AllData actually contains all the rows we must limit the results
WHERE(d1.p = 2 OR d1.p IS NULL)
AND (d2.p = 4 OR d2.p IS NULL)
AND (d3.p = 6 OR d3.p IS NULL);
What you want to do is akin to a pivot and so the complexity of the needed query is similar to creating a pivot result without using the pivot statement.
Were you to use Pivot, duplicate rows (such as I included in this example) would be aggreagted. This is also a solution for doing a pivot where aggregation is unwanted.

General rules for simplifying SQL statements

I'm looking for some "inference rules" (similar to set operation rules or logic rules) which I can use to reduce a SQL query in complexity or size.
Does there exist something like that? Any papers, any tools? Any equivalencies that you found on your own? It's somehow similar to query optimization, but not in terms of performance.
To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?
So, I'm looking for equivalent transformations of SQL statements like the fact that most SUBSELECTs can be rewritten as a JOIN.
To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?
This answer was written in 2009. Some of the query optimization tricks described here are obsolete by now, others can be made more efficient, yet others still apply. The statements about feature support by different database systems apply to versions that existed at the time of this writing.
That's exactly what optimizers do for a living (not that I'm saying they always do this well).
Since SQL is a set based language, there are usually more than one way to transform one query to other.
Like this query:
SELECT *
FROM mytable
WHERE col1 > #value1 OR col2 < #value2
can be transformed into this one (provided that mytable has a primary key):
SELECT *
FROM mytable
WHERE col1 > #value1
UNION
SELECT *
FROM mytable
WHERE col2 < #value2
or this one:
SELECT mo.*
FROM (
SELECT id
FROM mytable
WHERE col1 > #value1
UNION
SELECT id
FROM mytable
WHERE col2 < #value2
) mi
JOIN mytable mo
ON mo.id = mi.id
, which look uglier but can yield better execution plans.
One of the most common things to do is replacing this query:
SELECT *
FROM mytable
WHERE col IN
(
SELECT othercol
FROM othertable
)
with this one:
SELECT *
FROM mytable mo
WHERE EXISTS
(
SELECT NULL
FROM othertable o
WHERE o.othercol = mo.col
)
In some RDBMS's (like PostgreSQL 8.4), DISTINCT and GROUP BY use different execution plans, so sometimes it's better to replace the one with the other:
SELECT mo.grouper,
(
SELECT SUM(col)
FROM mytable mi
WHERE mi.grouper = mo.grouper
)
FROM (
SELECT DISTINCT grouper
FROM mytable
) mo
vs.
SELECT mo.grouper, SUM(col)
FROM mytable
GROUP BY
mo.grouper
In PostgreSQL, DISTINCT sorts and GROUP BY hashes.
MySQL 5.6 lacks FULL OUTER JOIN, so it can be rewritten as following:
SELECT t1.col1, t2.col2
FROM table1 t1
LEFT OUTER JOIN
table2 t2
ON t1.id = t2.id
vs.
SELECT t1.col1, t2.col2
FROM table1 t1
LEFT JOIN
table2 t2
ON t1.id = t2.id
UNION ALL
SELECT NULL, t2.col2
FROM table1 t1
RIGHT JOIN
table2 t2
ON t1.id = t2.id
WHERE t1.id IS NULL
, but see this article in my blog on how to do this more efficiently in MySQL:
Emulating FULL OUTER JOIN in MySQL
This hierarchical query in Oracle 11g:
SELECT DISTINCT(animal_id) AS animal_id
FROM animal
START WITH
animal_id = :id
CONNECT BY
PRIOR animal_id IN (father, mother)
ORDER BY
animal_id
can be transformed to this:
SELECT DISTINCT(animal_id) AS animal_id
FROM (
SELECT 0 AS gender, animal_id, father AS parent
FROM animal
UNION ALL
SELECT 1, animal_id, mother
FROM animal
)
START WITH
animal_id = :id
CONNECT BY
parent = PRIOR animal_id
ORDER BY
animal_id
, the latter one being more efficient.
See this article in my blog for the execution plan details:
Genealogy query on both parents
To find all ranges that overlap the given range, you can use the following query:
SELECT *
FROM ranges
WHERE end_date >= #start
AND start_date <= #end
, but in SQL Server this more complex query yields same results faster:
SELECT *
FROM ranges
WHERE (start_date > #start AND start_date <= #end)
OR (#start BETWEEN start_date AND end_date)
, and believe it or not, I have an article in my blog on this too:
Overlapping ranges: SQL Server
SQL Server 2008 also lacks an efficient way to do cumulative aggregates, so this query:
SELECT mi.id, SUM(mo.value) AS running_sum
FROM mytable mi
JOIN mytable mo
ON mo.id <= mi.id
GROUP BY
mi.id
can be more efficiently rewritten using, Lord help me, cursors (you heard me right: "cursors", "more efficiently" and "SQL Server" in one sentence).
See this article in my blog on how to do it:
Flattening timespans: SQL Server
There is a certain kind of query, commonly met in financial applications, that pulls effective exchange rate for a currency, like this one in Oracle 11g:
SELECT TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999')
FROM t_transaction x
JOIN t_rate r
ON (rte_currency, rte_date) IN
(
SELECT xac_currency, MAX(rte_date)
FROM t_rate
WHERE rte_currency = xac_currency
AND rte_date <= xac_date
)
This query can be heavily rewritten to use an equality condition which allows a HASH JOIN instead of NESTED LOOPS:
WITH v_rate AS
(
SELECT cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
FROM (
SELECT cur_id, dte_date,
(
SELECT MAX(rte_date)
FROM t_rate ri
WHERE rte_currency = cur_id
AND rte_date <= dte_date
) AS rte_effdate
FROM (
SELECT (
SELECT MAX(rte_date)
FROM t_rate
) - level + 1 AS dte_date
FROM dual
CONNECT BY
level <=
(
SELECT MAX(rte_date) - MIN(rte_date)
FROM t_rate
)
) v_date,
(
SELECT 1 AS cur_id
FROM dual
UNION ALL
SELECT 2 AS cur_id
FROM dual
) v_currency
) v_eff
LEFT JOIN
t_rate
ON rte_currency = cur_id
AND rte_date = rte_effdate
)
SELECT TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
FROM (
SELECT xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
FROM t_transaction x
GROUP BY
xac_currency, TRUNC(xac_date)
)
JOIN v_rate
ON eff_currency = xac_currency
AND eff_date = xac_date
Despite being bulky as hell, the latter query is six times as fast.
The main idea here is replacing <= with =, which requires building an in-memory calendar table to join with.
Converting currencies
Here's a few from working with Oracle 8 & 9 (of course, sometimes doing the opposite might make the query simpler or faster):
Parentheses can be removed if they are not used to override operator precedence. A simple example is when all the boolean operators in your where clause are the same: where ((a or b) or c) is equivalent to where a or b or c.
A sub-query can often (if not always) be merged with the main query to simplify it. In my experience, this often improves performance considerably:
select foo.a,
bar.a
from foomatic foo,
bartastic bar
where foo.id = bar.id and
bar.id = (
select ban.id
from bantabulous ban
where ban.bandana = 42
)
;
is equivalent to
select foo.a,
bar.a
from foomatic foo,
bartastic bar,
bantabulous ban
where foo.id = bar.id and
bar.id = ban.id and
ban.bandana = 42
;
Using ANSI joins separates a lot of "code monkey" logic from the really interesting parts of the where clause: The previous query is equivalent to
select foo.a,
bar.a
from foomatic foo
join bartastic bar on bar.id = foo.id
join bantabulous ban on ban.id = bar.id
where ban.bandana = 42
;
If you want to check for the existence of a row, don't use count(*), instead use either rownum = 1 or put the query in a where exists clause to fetch only one row instead of all.
I suppose the obvious one is look for any Cursors that can be replaced with a SQL 'Set' based operation.
Next on my list, is look for any correlated sub-queries that can be re-written as a un-correlated query
In long stored procedures, break out separate SQL statements into their own stored procedures. That way they will get there own cached query plan.
Look for transactions that can have their scope shortened. I regularly find statements inside a transaction that can safely be outside.
Sub-selects can often be re-written as straight forward joins (modern optimisers are good at spotting simple ones)
As #Quassnoi mentioned, the Optimiser often does a good job. One way to help it is to ensure indexes and statistics are up to date, and that suitable indexes exist for your query workload.
I like everyone on a team to follow a set of standards to make code readable, maintainable, understandable, washable, etc.. :)
everyone uses the same alias
no cursors. no loops
why even think of IN when you can EXISTS
INDENT
Consistency in coding style
there is some more stuff here What are some of your most useful database standards?
I like to replace all sort of subselect by join query.
This one is obvious :
SELECT *
FROM mytable mo
WHERE EXISTS
(
SELECT *
FROM othertable o
WHERE o.othercol = mo.col
)
by
SELECT mo.*
FROM mytable mo inner join othertable o on o.othercol = mo.col
And this one is under estimate :
SELECT *
FROM mytable mo
WHERE NOT EXISTS
(
SELECT *
FROM othertable o
WHERE o.othercol = mo.col
)
by
SELECT mo.*
FROM mytable mo left outer join othertable o on o.othercol = mo.col
WHERE o.othercol is null
It could help the DBMS to choose the good execution plan in a big request.
Given the nature of SQL, you absolutely have to be aware of the performance implications of any refactoring. Refactoring SQL Applications is a good resource on refactoring with a heavy emphasis on performance (see Chapter 5).
Although simplification may not equal optimization, simplification can be important in writing readable SQL code, which is in turn critical to being able to check your SQL code for conceptual correctness (not syntactic correctness, which your development environment should check for you). It seems to me that in an ideal world, we would write the most simple, readable SQL code and then the optimizer would rewrite that SQL code to be in whatever form (perhaps more verbose) would run the fastest.
I have found that thinking of SQL statements as based on set logic is very useful, particularly if I need to combine where clauses or figure out a complex negation of a where clause. I use the laws of boolean algebra in this case.
The most important ones for simplifying a where clause are probably DeMorgan's Laws (note that "·" is "AND" and "+" is "OR"):
NOT (x · y) = NOT x + NOT y
NOT (x + y) = NOT x · NOT y
This translates in SQL to:
NOT (expr1 AND expr2) -> NOT expr1 OR NOT expr2
NOT (expr1 OR expr2) -> NOT expr1 AND NOT expr2
These laws can be very useful in simplifying where clauses with lots of nested AND and OR parts.
It is also useful to remember that the statement field1 IN (value1, value2, ...) is equivalent to field1 = value1 OR field1 = value2 OR ... . This allows you to negate the IN () one of two ways:
NOT field1 IN (value1, value2) -- for longer lists
NOT field1 = value1 AND NOT field1 = value2 -- for shorter lists
A sub-query can be thought of this way also. For example, this negated where clause:
NOT (table1.field1 = value1 AND EXISTS (SELECT * FROM table2 WHERE table1.field1 = table2.field2))
can be rewritten as:
NOT table1.field1 = value1 OR NOT EXISTS (SELECT * FROM table2 WHERE table1.field1 = table2.field2))
These laws do not tell you how to transform a SQL query using a subquery into one using a join, but boolean logic can help you understand join types and what your query should be returning. For example, with tables A and B, an INNER JOIN is like A AND B, a LEFT OUTER JOIN is like (A AND NOT B) OR (A AND B) which simplifies to A OR (A AND B), and a FULL OUTER JOIN is A OR (A AND B) OR B which simplifies to A OR B.
jOOQ supports pattern based transformation, which can be used in the online SQL parser and translator (look for the "patterns" dropdown), or as a parser CLI, or programmatically.
Since you're mainly looking for ways to turn your query into something simpler, not necessarily faster (which may depend on the target RDBMS), jOOQ could help you here.
Some examples include:
CASE to CASE abbreviation
-- Original
SELECT
CASE WHEN x IS NULL THEN y ELSE x END,
CASE WHEN x = y THEN NULL ELSE x END,
CASE WHEN x IS NOT NULL THEN y ELSE z END,
CASE WHEN x IS NULL THEN y ELSE z END,
CASE WHEN x = 1 THEN y WHEN x = 2 THEN z END,
FROM tab;
-- Transformed
SELECT
NVL(x, y), -- If available in the target dialect, otherwise COALESCE
NULLIF(x, y),
NVL2(x, y, z), -- If available in the target dialect
NVL2(x, z, y), -- If available in the target dialect
CHOOSE(x, y, z) -- If available in the target dialect
FROM tab;
COUNT(*) scalar subquery comparison
-- Original
SELECT (SELECT COUNT(*) FROM tab) > 0;
-- Transformed
SELECT EXISTS (SELECT 1 FROM tab)
Flatten CASE
-- Original
SELECT
CASE
WHEN a = b THEN 1
ELSE CASE
WHEN c = d THEN 2
END
END
FROM tab;
-- Transformed
SELECT
CASE
WHEN a = b THEN 1
WHEN c = d THEN 2
END
FROM tab;
NOT AND (De Morgan's rules)
-- Original
SELECT
NOT (x = 1 AND y = 2),
NOT (x = 1 AND y = 2 AND z = 3)
FROM tab;
-- Transformed
SELECT
NOT (x = 1) OR NOT (y = 2),
NOT (x = 1) OR NOT (y = 2) OR NOT (z = 3)
FROM tab;
Unnecessary EXISTS subquery clauses
-- Original
SELECT EXISTS (SELECT DISTINCT a, b FROM t);
-- Transformed
SELECT EXISTS (SELECT 1 FROM t);
There's a lot more.
Disclaimer: I work for the company behind jOOQ
My approach is to learn relational theory in general and relational algebra in particular. Then learn to spot the constructs used in SQL to implement operators from the relational algebra (e.g. universal quantification a.k.a. division) and calculus (e.g. existential quantification). The gotcha is that SQL has features not found in the relational model e.g. nulls, which are probably best refactored away anyhow. Recommended reading: SQL and Relational Theory: How to Write Accurate SQL Code By C. J. Date.
In this vein, I'm not convinced "the fact that most SUBSELECTs can be rewritten as a JOIN" represents a simplification.
Take this query for example:
SELECT c
FROM T1
WHERE c NOT IN ( SELECT c FROM T2 );
Rewrite using JOIN
SELECT DISTINCT T1.c
FROM T1 NATURAL LEFT OUTER JOIN T2
WHERE T2.c IS NULL;
The join is more verbose!
Alternatively, recognize the construct is implementing an antijoin on the projection of c e.g. pseudo algrbra
T1 { c } antijoin T2 { c }
Simplification using relational operators:
SELECT c FROM T1 EXCEPT SELECT c FROM T2;