Filtering out children in a table with parentid - sql

I need a bit of help constructing a query that will let me filter the following data.
Table: MyTree
Id ParentId Visible
=====================
1 null 0
2 1 1
3 2 1
4 3 1
5 null 1
6 5 1
I expect the following result from the query:
Id ParentId Visible
=====================
5 null 1
6 5 1
That is, all the children of the hidden node should not be returned. What's more is that the depth of a hierarchy is not limited. Now don't answer "just set 2, 3 & 4 to visible=0" for non-obviuos reasons that is not possible... Like I'm fixing a horrible "legacy system".
I was thinking of something like:
SELECT *
FROM MyTree m1
JOIN MyTree m2 ON m1.ParentId = m2.Id
WHERE m1.Visible = 1
AND (m1.ParentId IS NULL OR m2.Id IS NOT NULL)
Sorry for any syntactical mistakes
But that will only filter the first level, right? Hope you can help.
Edit: Finished up the title, whoops. The server is a brand spanking new MSSQL 2008 server but the database is running in 2000 compatibility mode.

In SQL Server 2005+:
WITH q (id, parentid, visible) AS
(
SELECT id, parentid, visible
FROM mytree
WHERE id = 5
UNION ALL
SELECT m.id, m.parentid, m.visible
FROM q
JOIN mytree m
ON m.parentid = q.id
WHERE q.visible = 1
)
SELECT *
FROM q

I agree with #Quassnoi's focus on recursive CTEs (in SQL Server 2005 or later) but I think the logic is different to answer the original question:
WITH visall(id, parentid, visible) AS
(SELECT id, parentid, visible
FROM mytree
WHERE parentid IS NULL
UNION ALL
SELECT m.id, m.parentid, m.visible & visall.visible AS visible
FROM visall
JOIN mytree m
ON m.parentid = visall.id
)
SELECT *
FROM visall
WHERE visall.visible = 1
A probably more optimized way to express the same logic should be to have the visible checks in the WHERE as much as possible -- stop recursion along invisible "subtrees" ASAP. I.e.:
WITH visall(id, parentid, visible) AS
(SELECT id, parentid, visible
FROM mytree
WHERE parentid IS NULL AND visible = 1
UNION ALL
SELECT m.id, m.parentid, m.visible
FROM visall
JOIN mytree m
ON m.parentid = visall.id
WHERE m.visible = 1
)
SELECT *
FROM visall
As usual with performance issues, benchmarking both versions on realistic data is necessary to decide with confidence (it also helps to check that they do indeed produce identical results;-) -- as DB engines' optimizers sometimes do strange things for strange reasons;-).

I think Quassnoi was close to what the questioner wants, but not quite. I think this is what the questioner is looking for (SQL Server 2005+):
WITH q (id) AS
(
SELECT id
FROM mytree
WHERE parentid is null and visible=1
UNION ALL
SELECT m.id
FROM q
JOIN mytree m
ON m.parentid = q.id
WHERE q.visible = 1
)
SELECT *
FROM q
Common Table Expressions are great for this kind of work.

I don't think what you need is possible from a single query. This looks more like something to do from code and still it will require multiple queries to DB.
If you really need to do it from SQL I think your best bet would be to use a cursor and build a table with hidden IDs. If data doesn't change often you might keep that 'temporary' table as a kind of cache.
Edit: I stand corrected (for SQL 2005) and also learned something new today :)

Related

Recursive Query CTE Father - Son - Grandson error

I have a table that has an ID and IDFATHER of some projects, these projects can receive N sons, so, the structure is like
ID
IDFATHER
REV
1
1
0
2
1
1
5
2
2
I need to, iniciating in ID 5 go to ID 1, so I did a CTE Query:
WITH lb (ID, IDFATHER) AS (
SELECT ID, IDFATHER
FROM PROJECTS
WHERE ID = 5
UNION ALL
SELECT I.ID, I.IDFATHER
FROM PROJECTS I
JOIN lb LBI ON I.ID = LBI.IDFATHER
--WHERE I.ID = LBI.IDFATHER -- Recursive Subquery
)
SELECT *
FROM lb
WHERE LB.ID = LB.IDFATHER
When this code runs it gives me:
The statement terminated. The maximum recursion 100 has been exhausted
before statement completion.
So basically I handle it by just adding:
SELECT TOP 1 * FROM LB WHERE LB.ID = LB.IDFATHER
But I really want to know were is my error. Can anyone give me a hand on these?
The first row points to itself so the recursion never stops. You need to add this condition inside the recursive cte:
WHERE LBI.ID <> LBI.IDFATHER
I would rather set IDFather of the first row to NULL.
The recursion didn't stop because your top row refers to itself endlessly.
If the top row has a null parent, that would have stopped the recursion.
Another approach is to use that case id = parentid as the termination logic.
The fiddle
WITH LB (ID, IDFATHER, idstart) AS (
SELECT ID, IDFATHER, id
FROM PROJECTS WHERE ID = 5
UNION ALL
SELECT I.ID, I.IDFATHER, lbi.idstart
FROM PROJECTS I
JOIN LB LBI
ON I.ID = lbi.IDFATHER
AND lbi.id <> lbi.idfather
)
SELECT id AS idtop, idstart
FROM LB
WHERE LB.ID = LB.IDFATHER
;
The result:

What is the practical difference between these two SQL statements?

In an exam I was asked to retrieve the name of the transporters never having transported a container based in Rotterdam. The correct answer was
select Transporter.ID
from Transporter
where Transporter.ID not in (
select TransporterID
from Container
inner join Transportation on Container.ID = Transportation.ContainerID
where Container.City = 'Rotterdam')
and nevertheless the following was marked as a wrong answer:
select Transporter.ID
from Transporter
where Transporter.ID in (
select TransporterID
from Container
inner join Transportation on Container.ID = Transportation.ContainerID
where Container.City <> 'Rotterdam')
Why don't both statements lead to the same result? What is the practical difference between in ( ... where A <> B ) and not in ( ... where A = B )?
[Note that Transportation is in the center of the relational scheme, with all its prime attributs being foreign keys]
Let's build a simple table as example :
Container
TransporterID | City
1 | 'Rotterdam'
1 | 'Paris'
2 | 'Rotterdam'
And then this query
SELECT TransporterID
FROM Container
WHERE Container.City <> 'Rotterdam'
This will result 1 (the row with paris)
Then, WHERE Transporter.ID IN ( ... statement will give wrong result (transporter 1 has been to 'Rotterdam')
Besides what the other answers point out, take NULLs into consideration:
If City is NULL both queries would treat the comparison as FALSE in their WHERE clause...
You version is answering a slightly different question: "What are the ids of transporters that have transported a container somewhere other than Rotterdam?".
As for the best answer, I would use not exists (which is material) and table aliases (more stylistic):
select t.ID
from Transporter t
where not exists (select 1
from Container c join
Transportation tr
on c.ID = t.ContainerID
where tr.TransporterID = t.id and
c.City = 'Rotterdam'
);
NOT IN does not behave the way most people expect when any row in the subquery returns NULL (all rows are filtered out in that case). NOT EXISTS has the expected behavior.

How to get first entry with a value from an hierarchical setting structure?

I have a couple of tables. One table with Groups:
[ID] - [ParentGroupID]
1 - NULL
2 1
3 1
4 2
And another with settings
[Setting] - [GroupId] - [Value]
Title 1 Hello
Title 2 World
Now I'd like to get "Hello" back if I'd query the Title for Group 3
And I'd like to get "World" back if I'd query the Title for Group 4 (And not "Hello" as well)
Is there any way to efficiently do this in MSSQL? At the moment I am resolving this recursively in code. But I was hoping that SQL could solve this problem for me.
Don't knoww the SQL Server syntax but something like the following?
SELECT settings.value
FROM settings
JOIN groups ON settings.groupid = groups.parentgroupid
WHERE settings.setting = 'Title'
AND groups.id = 3
This is a problem we've encountered multiple times in our company. This would work for any case, including when the settings can be set only at some levels and not others (see SQL Fiddle http://sqlfiddle.com/#!3/16af0/1/0 :
With GroupSettings(group_id, parent_group_id, value, current_level)
As
(
Select g.id as group_id, g.parent_id, s.value, 0 As current_Level
From Groups As g
Join Settings As s On s.group_id = g.id
Where g.parent_id Is Null
Union All
Select g.id, g.parent_id, Coalesce((Select value From Settings s Where s.group_id=g.id), gs.value), current_level+1
From GroupSettings as gs
Join Groups As g On g.parent_id = gs.group_id
)
Select *
From GroupSettings
Where group_id=4
I believe the following is what you are seeking. See the sqlfiddle
SELECT vALUE FROM
Groups g inner join Settings s
ON g.ParentGroupId = s.GroupID
WHERE g.ID = 3 -- will return Hello,], set ID = 4 will return World

Recursive Query on a self referential table (not hierarchical)

I am creating a state chart of sorts with the data being stored in a simple self referencing table (JobPath)
JobId - ParentJobId
I was using a standard SQL CTE to get the data out which was working perfectly until I ended up with the following data
JobId - ParentId
1 2
2 3
3 4
4 2
Now as you can see Job 4 links to Job 2 which goes to Job 3 and then to Job 4 and so on.
Is there any way I can tell my query not to pull out data it already has?
Here is my current query
WITH JobPathTemp (JobId, ParentId, Level)
AS
(
-- Anchor member definition
SELECT j.JobId, jp.ParentJobId, 1 AS Level
FROM Job AS j
LEFT OUTER JOIN dbo.JobPath AS jp
ON j.JobId = jp.JobId
where j.JobId=1516
UNION ALL
-- Recursive member definition
SELECT j.JobId, jp.ParentJobId, Level + 1
FROM dbo.Job as j
INNER JOIN dbo.JobPath AS jp
ON j.JobId = jp.JobId
INNER JOIN JobPathTemp AS jpt
ON jpt.ParentId = jp.JobId
WHERE jp.ParentJobId <> jpt.JobId
)
-- Statement that executes the CTE
SELECT * FROM JobPathTemp
If you are not dealing with a large number of entries, the following solution might be suitable. The idea is to build the complete "id path" for each row and make sure the "current id" (in the recursive part) is not already in the path being processed:
(I removed the join to jobpath for testing purposes but the basic pattern should be the same)
WITH JobPathTemp (JobId, ParentId, Level, id_path)
AS
(
SELECT jobid,
parentid,
1 as level,
'|' + cast(jobid as varchar(max)) as id_path
FROM job
WHERE jobid = 1
UNION ALL
SELECT j.JobId,
j.parentid,
Level + 1,
jpt.id_path + '|' + cast(j.jobid as varchar(max))
FROM Job as j
INNER JOIN JobPathTemp AS jpt ON j.jobid = jpt.parentid
AND charindex('|' + cast(j.jobid as varchar), jpt.id_path) = 0
)
SELECT *
FROM JobPathTemp
;
This solution doesn't work, SQL Server doesn't support using UNION to join together the recursive term. Since you can't refer to the the recursion except as the join, tbh I don't see any alternative to using a stored function...
You didn't post your query... but I tried (in postgres, which works in much the same way) and if you use "UNION" (not "UNION ALL") in the recursive term, then it should automatically remove duplicate rows:
with /*recursive*/ jobs as
(select jobpath.jobid, jobpath.parentjobid from jobpath where jobid = 1
union
select jobpath.jobid, jobpath.parentjobid
from jobpath
join jobs on jobs.parentjobid = jobpath.jobid
)
select jobpath.* from jobpath join jobs on jobpath.jobid = jobs.jobid;

General rules for simplifying SQL statements

I'm looking for some "inference rules" (similar to set operation rules or logic rules) which I can use to reduce a SQL query in complexity or size.
Does there exist something like that? Any papers, any tools? Any equivalencies that you found on your own? It's somehow similar to query optimization, but not in terms of performance.
To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?
So, I'm looking for equivalent transformations of SQL statements like the fact that most SUBSELECTs can be rewritten as a JOIN.
To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?
This answer was written in 2009. Some of the query optimization tricks described here are obsolete by now, others can be made more efficient, yet others still apply. The statements about feature support by different database systems apply to versions that existed at the time of this writing.
That's exactly what optimizers do for a living (not that I'm saying they always do this well).
Since SQL is a set based language, there are usually more than one way to transform one query to other.
Like this query:
SELECT *
FROM mytable
WHERE col1 > #value1 OR col2 < #value2
can be transformed into this one (provided that mytable has a primary key):
SELECT *
FROM mytable
WHERE col1 > #value1
UNION
SELECT *
FROM mytable
WHERE col2 < #value2
or this one:
SELECT mo.*
FROM (
SELECT id
FROM mytable
WHERE col1 > #value1
UNION
SELECT id
FROM mytable
WHERE col2 < #value2
) mi
JOIN mytable mo
ON mo.id = mi.id
, which look uglier but can yield better execution plans.
One of the most common things to do is replacing this query:
SELECT *
FROM mytable
WHERE col IN
(
SELECT othercol
FROM othertable
)
with this one:
SELECT *
FROM mytable mo
WHERE EXISTS
(
SELECT NULL
FROM othertable o
WHERE o.othercol = mo.col
)
In some RDBMS's (like PostgreSQL 8.4), DISTINCT and GROUP BY use different execution plans, so sometimes it's better to replace the one with the other:
SELECT mo.grouper,
(
SELECT SUM(col)
FROM mytable mi
WHERE mi.grouper = mo.grouper
)
FROM (
SELECT DISTINCT grouper
FROM mytable
) mo
vs.
SELECT mo.grouper, SUM(col)
FROM mytable
GROUP BY
mo.grouper
In PostgreSQL, DISTINCT sorts and GROUP BY hashes.
MySQL 5.6 lacks FULL OUTER JOIN, so it can be rewritten as following:
SELECT t1.col1, t2.col2
FROM table1 t1
LEFT OUTER JOIN
table2 t2
ON t1.id = t2.id
vs.
SELECT t1.col1, t2.col2
FROM table1 t1
LEFT JOIN
table2 t2
ON t1.id = t2.id
UNION ALL
SELECT NULL, t2.col2
FROM table1 t1
RIGHT JOIN
table2 t2
ON t1.id = t2.id
WHERE t1.id IS NULL
, but see this article in my blog on how to do this more efficiently in MySQL:
Emulating FULL OUTER JOIN in MySQL
This hierarchical query in Oracle 11g:
SELECT DISTINCT(animal_id) AS animal_id
FROM animal
START WITH
animal_id = :id
CONNECT BY
PRIOR animal_id IN (father, mother)
ORDER BY
animal_id
can be transformed to this:
SELECT DISTINCT(animal_id) AS animal_id
FROM (
SELECT 0 AS gender, animal_id, father AS parent
FROM animal
UNION ALL
SELECT 1, animal_id, mother
FROM animal
)
START WITH
animal_id = :id
CONNECT BY
parent = PRIOR animal_id
ORDER BY
animal_id
, the latter one being more efficient.
See this article in my blog for the execution plan details:
Genealogy query on both parents
To find all ranges that overlap the given range, you can use the following query:
SELECT *
FROM ranges
WHERE end_date >= #start
AND start_date <= #end
, but in SQL Server this more complex query yields same results faster:
SELECT *
FROM ranges
WHERE (start_date > #start AND start_date <= #end)
OR (#start BETWEEN start_date AND end_date)
, and believe it or not, I have an article in my blog on this too:
Overlapping ranges: SQL Server
SQL Server 2008 also lacks an efficient way to do cumulative aggregates, so this query:
SELECT mi.id, SUM(mo.value) AS running_sum
FROM mytable mi
JOIN mytable mo
ON mo.id <= mi.id
GROUP BY
mi.id
can be more efficiently rewritten using, Lord help me, cursors (you heard me right: "cursors", "more efficiently" and "SQL Server" in one sentence).
See this article in my blog on how to do it:
Flattening timespans: SQL Server
There is a certain kind of query, commonly met in financial applications, that pulls effective exchange rate for a currency, like this one in Oracle 11g:
SELECT TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999')
FROM t_transaction x
JOIN t_rate r
ON (rte_currency, rte_date) IN
(
SELECT xac_currency, MAX(rte_date)
FROM t_rate
WHERE rte_currency = xac_currency
AND rte_date <= xac_date
)
This query can be heavily rewritten to use an equality condition which allows a HASH JOIN instead of NESTED LOOPS:
WITH v_rate AS
(
SELECT cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
FROM (
SELECT cur_id, dte_date,
(
SELECT MAX(rte_date)
FROM t_rate ri
WHERE rte_currency = cur_id
AND rte_date <= dte_date
) AS rte_effdate
FROM (
SELECT (
SELECT MAX(rte_date)
FROM t_rate
) - level + 1 AS dte_date
FROM dual
CONNECT BY
level <=
(
SELECT MAX(rte_date) - MIN(rte_date)
FROM t_rate
)
) v_date,
(
SELECT 1 AS cur_id
FROM dual
UNION ALL
SELECT 2 AS cur_id
FROM dual
) v_currency
) v_eff
LEFT JOIN
t_rate
ON rte_currency = cur_id
AND rte_date = rte_effdate
)
SELECT TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
FROM (
SELECT xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
FROM t_transaction x
GROUP BY
xac_currency, TRUNC(xac_date)
)
JOIN v_rate
ON eff_currency = xac_currency
AND eff_date = xac_date
Despite being bulky as hell, the latter query is six times as fast.
The main idea here is replacing <= with =, which requires building an in-memory calendar table to join with.
Converting currencies
Here's a few from working with Oracle 8 & 9 (of course, sometimes doing the opposite might make the query simpler or faster):
Parentheses can be removed if they are not used to override operator precedence. A simple example is when all the boolean operators in your where clause are the same: where ((a or b) or c) is equivalent to where a or b or c.
A sub-query can often (if not always) be merged with the main query to simplify it. In my experience, this often improves performance considerably:
select foo.a,
bar.a
from foomatic foo,
bartastic bar
where foo.id = bar.id and
bar.id = (
select ban.id
from bantabulous ban
where ban.bandana = 42
)
;
is equivalent to
select foo.a,
bar.a
from foomatic foo,
bartastic bar,
bantabulous ban
where foo.id = bar.id and
bar.id = ban.id and
ban.bandana = 42
;
Using ANSI joins separates a lot of "code monkey" logic from the really interesting parts of the where clause: The previous query is equivalent to
select foo.a,
bar.a
from foomatic foo
join bartastic bar on bar.id = foo.id
join bantabulous ban on ban.id = bar.id
where ban.bandana = 42
;
If you want to check for the existence of a row, don't use count(*), instead use either rownum = 1 or put the query in a where exists clause to fetch only one row instead of all.
I suppose the obvious one is look for any Cursors that can be replaced with a SQL 'Set' based operation.
Next on my list, is look for any correlated sub-queries that can be re-written as a un-correlated query
In long stored procedures, break out separate SQL statements into their own stored procedures. That way they will get there own cached query plan.
Look for transactions that can have their scope shortened. I regularly find statements inside a transaction that can safely be outside.
Sub-selects can often be re-written as straight forward joins (modern optimisers are good at spotting simple ones)
As #Quassnoi mentioned, the Optimiser often does a good job. One way to help it is to ensure indexes and statistics are up to date, and that suitable indexes exist for your query workload.
I like everyone on a team to follow a set of standards to make code readable, maintainable, understandable, washable, etc.. :)
everyone uses the same alias
no cursors. no loops
why even think of IN when you can EXISTS
INDENT
Consistency in coding style
there is some more stuff here What are some of your most useful database standards?
I like to replace all sort of subselect by join query.
This one is obvious :
SELECT *
FROM mytable mo
WHERE EXISTS
(
SELECT *
FROM othertable o
WHERE o.othercol = mo.col
)
by
SELECT mo.*
FROM mytable mo inner join othertable o on o.othercol = mo.col
And this one is under estimate :
SELECT *
FROM mytable mo
WHERE NOT EXISTS
(
SELECT *
FROM othertable o
WHERE o.othercol = mo.col
)
by
SELECT mo.*
FROM mytable mo left outer join othertable o on o.othercol = mo.col
WHERE o.othercol is null
It could help the DBMS to choose the good execution plan in a big request.
Given the nature of SQL, you absolutely have to be aware of the performance implications of any refactoring. Refactoring SQL Applications is a good resource on refactoring with a heavy emphasis on performance (see Chapter 5).
Although simplification may not equal optimization, simplification can be important in writing readable SQL code, which is in turn critical to being able to check your SQL code for conceptual correctness (not syntactic correctness, which your development environment should check for you). It seems to me that in an ideal world, we would write the most simple, readable SQL code and then the optimizer would rewrite that SQL code to be in whatever form (perhaps more verbose) would run the fastest.
I have found that thinking of SQL statements as based on set logic is very useful, particularly if I need to combine where clauses or figure out a complex negation of a where clause. I use the laws of boolean algebra in this case.
The most important ones for simplifying a where clause are probably DeMorgan's Laws (note that "·" is "AND" and "+" is "OR"):
NOT (x · y) = NOT x + NOT y
NOT (x + y) = NOT x · NOT y
This translates in SQL to:
NOT (expr1 AND expr2) -> NOT expr1 OR NOT expr2
NOT (expr1 OR expr2) -> NOT expr1 AND NOT expr2
These laws can be very useful in simplifying where clauses with lots of nested AND and OR parts.
It is also useful to remember that the statement field1 IN (value1, value2, ...) is equivalent to field1 = value1 OR field1 = value2 OR ... . This allows you to negate the IN () one of two ways:
NOT field1 IN (value1, value2) -- for longer lists
NOT field1 = value1 AND NOT field1 = value2 -- for shorter lists
A sub-query can be thought of this way also. For example, this negated where clause:
NOT (table1.field1 = value1 AND EXISTS (SELECT * FROM table2 WHERE table1.field1 = table2.field2))
can be rewritten as:
NOT table1.field1 = value1 OR NOT EXISTS (SELECT * FROM table2 WHERE table1.field1 = table2.field2))
These laws do not tell you how to transform a SQL query using a subquery into one using a join, but boolean logic can help you understand join types and what your query should be returning. For example, with tables A and B, an INNER JOIN is like A AND B, a LEFT OUTER JOIN is like (A AND NOT B) OR (A AND B) which simplifies to A OR (A AND B), and a FULL OUTER JOIN is A OR (A AND B) OR B which simplifies to A OR B.
jOOQ supports pattern based transformation, which can be used in the online SQL parser and translator (look for the "patterns" dropdown), or as a parser CLI, or programmatically.
Since you're mainly looking for ways to turn your query into something simpler, not necessarily faster (which may depend on the target RDBMS), jOOQ could help you here.
Some examples include:
CASE to CASE abbreviation
-- Original
SELECT
CASE WHEN x IS NULL THEN y ELSE x END,
CASE WHEN x = y THEN NULL ELSE x END,
CASE WHEN x IS NOT NULL THEN y ELSE z END,
CASE WHEN x IS NULL THEN y ELSE z END,
CASE WHEN x = 1 THEN y WHEN x = 2 THEN z END,
FROM tab;
-- Transformed
SELECT
NVL(x, y), -- If available in the target dialect, otherwise COALESCE
NULLIF(x, y),
NVL2(x, y, z), -- If available in the target dialect
NVL2(x, z, y), -- If available in the target dialect
CHOOSE(x, y, z) -- If available in the target dialect
FROM tab;
COUNT(*) scalar subquery comparison
-- Original
SELECT (SELECT COUNT(*) FROM tab) > 0;
-- Transformed
SELECT EXISTS (SELECT 1 FROM tab)
Flatten CASE
-- Original
SELECT
CASE
WHEN a = b THEN 1
ELSE CASE
WHEN c = d THEN 2
END
END
FROM tab;
-- Transformed
SELECT
CASE
WHEN a = b THEN 1
WHEN c = d THEN 2
END
FROM tab;
NOT AND (De Morgan's rules)
-- Original
SELECT
NOT (x = 1 AND y = 2),
NOT (x = 1 AND y = 2 AND z = 3)
FROM tab;
-- Transformed
SELECT
NOT (x = 1) OR NOT (y = 2),
NOT (x = 1) OR NOT (y = 2) OR NOT (z = 3)
FROM tab;
Unnecessary EXISTS subquery clauses
-- Original
SELECT EXISTS (SELECT DISTINCT a, b FROM t);
-- Transformed
SELECT EXISTS (SELECT 1 FROM t);
There's a lot more.
Disclaimer: I work for the company behind jOOQ
My approach is to learn relational theory in general and relational algebra in particular. Then learn to spot the constructs used in SQL to implement operators from the relational algebra (e.g. universal quantification a.k.a. division) and calculus (e.g. existential quantification). The gotcha is that SQL has features not found in the relational model e.g. nulls, which are probably best refactored away anyhow. Recommended reading: SQL and Relational Theory: How to Write Accurate SQL Code By C. J. Date.
In this vein, I'm not convinced "the fact that most SUBSELECTs can be rewritten as a JOIN" represents a simplification.
Take this query for example:
SELECT c
FROM T1
WHERE c NOT IN ( SELECT c FROM T2 );
Rewrite using JOIN
SELECT DISTINCT T1.c
FROM T1 NATURAL LEFT OUTER JOIN T2
WHERE T2.c IS NULL;
The join is more verbose!
Alternatively, recognize the construct is implementing an antijoin on the projection of c e.g. pseudo algrbra
T1 { c } antijoin T2 { c }
Simplification using relational operators:
SELECT c FROM T1 EXCEPT SELECT c FROM T2;