How do I avoid repeating this subquery for the IN clause? - sql

I have an SQL script (currently running against SQLite, but it should probably work against any DB engine) that uses the same subquery twice, and since it might be fetching a lot of records (the table has a couple of million rows) I'd like to only call it once.
A shortened pseudo-version of the query looks like this:
SELECT * FROM
([the subquery, returns a column of ids]) AS sq
[a couple of joins, that fetches things from other tables based on the ids]
WHERE thisorthat NOT IN ([the subquery again])
I tried just using the name (sq) in various ways (with/without parenthesis, with/without naming the column of sq etc) but to no avail.
Do I really have to repeat this subquery?
Clarification:
I am doing this in python and sqlite as a small demo of what can be done, but I would like my solution to scale as well as possible with as little modification as possible. In the real situation, the database will have a couple of million rows, but in my example there is just 10 rows with dummy data. Thus, code that would be well optimized on for example MySQL is absolutely good enough - it doesn't have to be optimized specifically for SQLite. But as I said, the less modification needed, the better.

There is a WITH clause in standard SQL, however, I don't know if it is supported by SQLlite - though of course worth a try:
WITH mySubQuery AS
(
[the subquery code]
)
SELECT * FROM
mySubQuery AS sq
[a couple of joins, that fetches things from other tables based on the ids]
WHERE thisorthat NOT IN (mySubQuery)
That said, what you do here will likely be horribly slow for any data set that is more than a few thousand rows, so I'd try to remodel it if possible - NOT IN should be avoided in general, especially if you also have a couple of joins.

Do you need a subquery? You could probably rewrite using an OUTER JOIN e.g. something like:
SELECT *
FROM [the subquery's FROM clause] AS sq
RIGHT OUTER JOIN [a couple of tables based on the ids]
ON thisorthat = sq.[a column of ids]
WHERE sq.[a column of ids] IS NULL;

In general, I question the need to eliminate the duplication. The SQL compiler can see that two subqueries are identical and chose to only do them once if that seems optimal.
In addition, by leaving duplicates in the source, the SQL compiler and optimizer is given the opportunity to treat them differently. For example the subquery flattening optimization of SQLite might be applied to one of a pair of duplicates or applied differently to each. See section 9.0, Subquery flattening of https://www.sqlite.org/optoverview.html.

you can put the SELECT part into a View than you can filter the View results using the alias "sq"
I hope it's helpful

Related

Oracle 12c Subquery Factoring Inline View now has bad plan?

Update 11/2
After some additional troubleshooting, my team was able to tie this Oracle bug directly to a parameter change that was made on the 12c database the night before the query stopped working. After experiencing some performance issues from an application tied to this database, my team had our DBA change the OPTIMIZER_FEATURES_ENABLE parameter from 12.1.02 to 11.2.0.4. This fixed the performance issue for the problem application but caused the bug I have described above. To verify, I've been able to replicate this same issue in a separate environment by changing this parameter. My DBA has filed a ticket with Oracle to have this looked at.
As a workaround, I was able to make a slight change to my query in order to retrieve the expected results. Specifically, I combined Subquery1 with Subquery2 and I moved a few predicates in Subquery1 from the WHERE clause to the JOIN (where they more properly belonged). This change edited my execution plan (it is slightly less efficient than what was listed before) but was enough to address the original issue.
Original Post
Firstly, let me apologize for any vagueness in this question but I'm dealing with a confidential financial system so I am forced to hide certain implementation details.
Background
I have an Oracle query that I put into production a long time ago that has recently stopped producing expected results coincidentally after an upgrade from 11g to 12c. To my (and my production support team's) knowledge this query had been working fine for well over a year before that.
Details
The query is overly complicated and not very efficient but this is in large part because I am dealing with non-normalized tables (historically modeled after a Mainframe) and poor data input from upstream systems. In order to deal with a complicated business situation I leveraged multiple levels of Subquery Factoring (the WITH statement) and then my final statement joins together two Inline Views. The basic structure of the query without all of the complicated predicates is as follows:
I have 3 tables Table1, Table2, Table3. Table1 is a processing table made up of records from Table2.
--This grabs a subset from Table1
WITH Subquery1 as (
SELECT FROM Table1),
--This eliminates certain records from the first subset based on sister records
--from the original source table
Subquery2 as (
SELECT FROM Subquery1
WHERE NOT EXISTS FROM (SELECT from Table2)),
--This ties the records from Subquery2 to Table3
Subquery3 as (
SELECT FROM Table3
JOIN (SELECT Max(Date) FROM Table3)
JOIN Subquery2)
--This final query evaluates subquery3 in two different ways and
--only takes those records which fit the criteria items from both sets
SELECT FROM
(SELECT FROM Subquery3) -- Call this Inline View A
JOIN (SELECT FROM Subquery3) -- Call this Inline View B
The final query is pretty basic:
SELECT A.Group_No, B.Sub_Group, B.Key, B.Lob
FROM (SELECT Group_No, Lob, COUNT(Sub_Group)
FROM Subquery3
GROUP BY Group_No, Lob
HAVING COUNT(Sub_Group) = 1) A
JOIN (SELECT Group_No, Sub_Group, Key, Lob
FROM Subquery3
WHERE Sub_Group LIKE '0000%') B
ON A.Group_No = B.Group_No
AND A.Lob = B.Lob
Problem
If I edit the final query to remove the second Inline View and evaluate the output of the A inline view, I come away with 0 returned rows. I've manually evaluated the records for each individual subquery and can confirm this is an expected result.
Likewise, if I edit the final query to produce the output of only the 'B' inline view, I come away with 6 returned rows. Again, I've manually evaluated the data and this is exactly as expected.
Now when joining these two subsets (Inline View A and Inline View B) together, I would expect that the final query result would be 0 rows (since an inner join between a full set and an empty set produces no matches). However, when I run the entire query with the inner join as described above, I am getting back 1158 rows!
I have reviewed the Execution Plan but nothing jumps out at me:
Questions
Clearly I have done something to confuse the Oracle Optimizer and the updated query plan is pulling back a much different query than the one I have submitted. My best guess is that with all of these temporary views floating around within the same query, I have confused Oracle into evaluating some set before one that it depends upon.
To this day I've been unable to locate the official Oracle documentation around the WITH statement so I've never been completely confident about the order that subqueries are evaluated. I did notice in searching SO (can't find it now) someone mentioned that a factored subquery cannot refer to another factored query. I've never known this to be true before but the bizarre output above is making me wonder if I had only been lucky before with this query?
Can anyone explain the behavior I am seeing? Am I attempting to do something obviously incorrect with this query plan? Or alternatively, is there any chance that something changed between 11g and 12c that could explain why the behavior of this query might have changed?
This sounds like a "wrong results" bug in Oracle. These bugs are usually extremely specific to the version and the features you are using. There's nothing obviously wrong with the queries or execution plan you posted.
You have two ways of handling this:
Try to find the precise bug. What you're doing with common table expressions looks fine. There are some rare times when your query is technically invalid, you get "lucky" in one version and it works, and when you upgrade it fails. But when that happens the new version usually throws an error, not return wrong results. There's probably some extremely weird, specific combination of features you're using that's causing the issue. To find the real issue you need to massively simplify the query until you can make the smallest possible change and see the problem appear and disappear. You'll also want to remove all objects and only use DUAL. This process can take hours. At the end, when you're left with only a few lines of code, either post them here, look on Oracle Support, or create a Service Request.
Avoid the bug. Even if you go through the above steps there may not be a fix anyway. Sometimes the best work-around is to do something differently. It's nice to get to the bottom of every problem but you don't always have time. Instead, try re-writing the query in syntactically different but logically equivalent ways. Remove some or all of the common table expressions, maybe even repeat some SQL. But be sure to leave a comment warning future programmers of why you're doing things in a weird way.

Are joins the proper way to do cross table queries?

I have a few tables in their third normal form and I need to do some cross table queries to get the information I need.
I looked at joins but it seems like it will create a new table. Is this the proper way to perform such queries? Or should I just do nested queries ? I guess it might make sense if I have to do these queries alot? I'm really not sure how well optimize these operations are. I'm using the sequelize ORM and I'm not sure I see any clear solution.
It seems to me you are asking about joins vs subqueries. These are to some extent different. But let's start with a couple of points.
A join creates a new relvar, not a new table. A relvar is a variable standing in for the relation output by the join operation. It is transient (as opposed to a view which would be persistent).
Joins and subqueries are not always perfect substitutes. Sometimes you will need both.
Your query output is also a relvar.
The above being said, generally where possible I think joins are preferable. The major reason is that a SQL query that can be written using the structure below is far easier (as you master the language) to both understand and debug than most alternatives, and also subqueries in column lists necessarily perform badly:
SELECT [column_list]
FROM [initial_table]
[join list]
WHERE [filters]
GROUP BY [grouping list]
HAVING [post-aggregation filters]
LIMIT [limit and offset]
If your query fits the above structure then you can usually expect that specific kinds of problems will occur in logic in specific parts of the query. On the other hand, with subqueries, you have to check these independently.

What are the pros/cons of using SQL variables versus subqueries?

I'm wondering there is a difference between SQL variables and subqueries. Whether one uses more processing power, or one is quicker, or even if one merely is more readable.
For (a very basic) example, I like to use variables to hold polygon and transformations in PostGIS:
WITH region_polygon AS (
SELECT ST_Transform(wkb_geometry, %(fishnet_srid)d) geom
FROM regions
LIMIT 1
), raster_pixels AS (
SELECT (ST_PixelAsPolygons(rast)).*
FROM test_regions_raster
LIMIT 1
)
SELECT x, y
FROM raster_pixels a, region_polygon b
WHERE ST_Within(a.geom, b.geom)
But would it be better in any way to use subqueries?
SELECT x, y
FROM (
SELECT ST_Transform(wkb_geometry, %(fishnet_srid)d) geom
FROM regions
LIMIT 1
) a, (
SELECT (ST_PixelAsPolygons(rast)).*
FROM test_regions_raster
LIMIT 1
) b
WHERE ST_Within(a.geom, b.geom)
Note that I'm using PostgreSQL.
There's an important syntactic advantage of common table expressions over derived tables when it comes to reuse. Consider the following, equivalent examples using self-joins:
Using common table expressions
WITH a(v) AS (SELECT 1 UNION SELECT 2)
SELECT *
FROM a AS x, a AS y
Using derived tables
SELECT *
FROM (SELECT 1 UNION SELECT 2) x(v),
(SELECT 1 UNION SELECT 2) y(v)
As you can see, using common table expressions, the view (SELECT 1 UNION SELECT 2) can be reused multiple times in your query. With derived tables, you will have to repeat your view declaration. In my example, this is still OK. In your own example, this starts getting a bit more hairy.
It's all about scope
Views in SQL are all about scoping. There are essentially four levels of declaring views:
As derived tables. They can be consumed exactly once.
As common table expressions. They can be consumed several times, but only in one query.
As views. They can be consumed several times in several queries.
As materialized views. Same as views, but the data is pre-calculated.
Some databases (in particular PostgreSQL) also know table-valued functions. From a mere syntax perspective, they're just like views - parameterised views.
Performance
Note that these thoughts only focus on syntax, not query planning. The different approaches may have very different performance implications, depending on the database vendor.
Those aren't variables, they're common table expressions (cte). In your query above, the execution plans are likely identical, because the optimizer should recognize they are equivalent queries. I prefer to use cte's because I think they're easier to read than subqueries, but that's it.
Edit: Upon further reading it looks like PostgreSQL does treat common table expressions differently than other databases, you can't update a cte in PostgreSQL, for instance. I'll leave my answer here because I believe for your query there won't be a difference, but I'm not terribly familiar with PostgreSQL.
As pointed out this construct is called Common Table Expression, not a variable.
I prefer to use CTE, rather than subquery, because it is way easier to read and write for me, especially when you have several nested CTEs.
You can write CTE once and refer to it several times in the rest of the query. With subquery you'll have to repeat the code several times.
Important difference of PostgreSQL from other databases (at least from MS SQL Server) is that PostgreSQL evaluates each CTE only once.
A useful property of WITH queries is that they are evaluated only once
per execution of the parent query, even if they are referred to more
than once by the parent query or sibling WITH queries. Thus, expensive
calculations that are needed in multiple places can be placed within a
WITH query to avoid redundant work. Another possible application is to
prevent unwanted multiple evaluations of functions with side-effects.
However, the other side of this coin is that the optimizer is less
able to push restrictions from the parent query down into a WITH query
than an ordinary sub-query. The WITH query will generally be evaluated
as written, without suppression of rows that the parent query might
discard afterwards. (But, as mentioned above, evaluation might stop
early if the reference(s) to the query demand only a limited number of
rows.)
MS SQL Server would inline each reference of CTE into the main query and optimize the whole result, but PostgreSQL doesn't. In some sense PostgreSQL is more flexible here. If you want the subquery to be evaluated only once, put it in CTE. If you don't want, put it in subquery and repeat the code. In SQL Server you'd have to use temporary table explicitly.
Your example in the question is too simple and most likely both variants are equivalent - check the execution plan.
Official docs mention it, as I quoted above, but Nick Barnes gave a link to a good article explaining it in more details and I thought it is worth putting it in an answer, rather that comment.
When optimising queries in PostgreSQL (true at least in 9.4 and
older), it’s worth keeping in mind that – unlike newer versions of
various other databases – PostgreSQL will always materialise a CTE
term in a query.
This can have quite surprising effects for those used to working with
DBs like MS SQL:
A query that should touch a small amount of data instead reads a whole
table and possibly spills it to a tempfile;
and You cannot UPDATE or
DELETE FROM a CTE term, because it’s more like a read-only temp table
rather than a dynamic view.
So, there is no definite answer whether CTE is better than subquery in PostgreSQL. In some cases it can be faster, in some cases it can be slower. But, IMHO, in most cases CTE is easier to write, read and maintain.
And, obviously, there is a case when you have no other option, but to use so-called recursive CTE (recursive queries are typically used to deal with hierarchical or tree-structured data).

SQL Joins Vs SQL Subqueries (Performance)?

I wish to know if I have a join query something like this -
Select E.Id,E.Name from Employee E join Dept D on E.DeptId=D.Id
and a subquery something like this -
Select E.Id,E.Name from Employee Where DeptId in (Select Id from Dept)
When I consider performance which of the two queries would be faster and why ?
Also is there a time when I should prefer one over the other?
Sorry if this is too trivial and asked before but I am confused about it. Also, it would be great if you guys can suggest me tools i should use to measure performance of two queries. Thanks a lot!
Well, I believe it's an "Old but Gold" question. The answer is: "It depends!".
The performances are such a delicate subject that it would be too much silly to say: "Never use subqueries, always join".
In the following links, you'll find some basic best practices that I have found to be very helpful:
Optimizing Subqueries
Optimizing Subqueries with Semijoin Transformations
Rewriting Subqueries as Joins
I have a table with 50000 elements, the result i was looking for was 739 elements.
My query at first was this:
SELECT p.id,
p.fixedId,
p.azienda_id,
p.categoria_id,
p.linea,
p.tipo,
p.nome
FROM prodotto p
WHERE p.azienda_id = 2699 AND p.anno = (
SELECT MAX(p2.anno)
FROM prodotto p2
WHERE p2.fixedId = p.fixedId
)
and it took 7.9s to execute.
My query at last is this:
SELECT p.id,
p.fixedId,
p.azienda_id,
p.categoria_id,
p.linea,
p.tipo,
p.nome
FROM prodotto p
WHERE p.azienda_id = 2699 AND (p.fixedId, p.anno) IN
(
SELECT p2.fixedId, MAX(p2.anno)
FROM prodotto p2
WHERE p.azienda_id = p2.azienda_id
GROUP BY p2.fixedId
)
and it took 0.0256s
Good SQL, good.
I would EXPECT the first query to be quicker, mainly because you have an equivalence and an explicit JOIN. In my experience IN is a very slow operator, since SQL normally evaluates it as a series of WHERE clauses separated by "OR" (WHERE x=Y OR x=Z OR...).
As with ALL THINGS SQL though, your mileage may vary. The speed will depend a lot on indexes (do you have indexes on both ID columns? That will help a lot...) among other things.
The only REAL way to tell with 100% certainty which is faster is to turn on performance tracking (IO Statistics is especially useful) and run them both. Make sure to clear your cache between runs!
Performance is based on the amount of data you are executing on...
If it is less data around 20k. JOIN works better.
If the data is more like 100k+ then IN works better.
If you do not need the data from the other table, IN is good, But it is alwys better to go for EXISTS.
All these criterias I tested and the tables have proper indexes.
Start to look at the execution plans to see the differences in how the SQl Server will interpret them. You can also use Profiler to actually run the queries multiple times and get the differnce.
I would not expect these to be so horribly different, where you can get get real, large performance gains in using joins instead of subqueries is when you use correlated subqueries.
EXISTS is often better than either of these two and when you are talking left joins where you want to all records not in the left join table, then NOT EXISTS is often a much better choice.
The performance should be the same; it's much more important to have the correct indexes and clustering applied on your tables (there exist some good resources on that topic).
(Edited to reflect the updated question)
I know this is an old post, but I think this is a very important topic, especially nowadays where we have 10M+ records and talk about terabytes of data.
I will also weight in with the following observations. I have about 45M records in my table ([data]), and about 300 records in my [cats] table. I have extensive indexing for all of the queries I am about to talk about.
Consider Example 1:
UPDATE d set category = c.categoryname
FROM [data] d
JOIN [cats] c on c.id = d.catid
versus Example 2:
UPDATE d set category = (SELECT TOP(1) c.categoryname FROM [cats] c where c.id = d.catid)
FROM [data] d
Example 1 took about 23 mins to run. Example 2 took around 5 mins.
So I would conclude that sub-query in this case is much faster. Of course keep in mind that I am using M.2 SSD drives capable of i/o # 1GB/sec (thats bytes not bits), so my indexes are really fast too. So this may affect the speeds too in your circumstance
If its a one-off data cleansing, probably best to just leave it run and finish. I use TOP(10000) and see how long it takes and multiply by number of records before I hit the big query.
If you are optimizing production databases, I would strongly suggest pre-processing data, i.e. use triggers or job-broker to async update records, so that real-time access retrieves static data.
The two queries may not be semantically equivalent. If a employee works for more than one department (possible in the enterprise I work for; admittedly, this would imply your table is not fully normalized) then the first query would return duplicate rows whereas the second query would not. To make the queries equivalent in this case, the DISTINCT keyword would have to be added to the SELECT clause, which may have an impact on performance.
Note there is a design rule of thumb that states a table should model an entity/class or a relationship between entities/classes but not both. Therefore, I suggest you create a third table, say OrgChart, to model the relationship between employees and departments.
You can use an Explain Plan to get an objective answer.
For your problem, an Exists filter would probably perform the fastest.

Subqueries vs joins

I refactored a slow section of an application we inherited from another company to use an inner join instead of a subquery like:
WHERE id IN (SELECT id FROM ...)
The refactored query runs about 100x faster. (~50 seconds to ~0.3) I expected an improvement, but can anyone explain why it was so drastic? The columns used in the where clause were all indexed. Does SQL execute the query in the where clause once per row or something?
Update - Explain results:
The difference is in the second part of the "where id in ()" query -
2 DEPENDENT SUBQUERY submission_tags ref st_tag_id st_tag_id 4 const 2966 Using where
vs 1 indexed row with the join:
SIMPLE s eq_ref PRIMARY PRIMARY 4 newsladder_production.st.submission_id 1 Using index
A "correlated subquery" (i.e., one in which the where condition depends on values obtained from the rows of the containing query) will execute once for each row. A non-correlated subquery (one in which the where condition is independent of the containing query) will execute once at the beginning. The SQL engine makes this distinction automatically.
But, yeah, explain-plan will give you the dirty details.
You are running the subquery once for every row whereas the join happens on indexes.
Here's an example of how subqueries are evaluated in MySQL 6.0.
The new optimizer will convert this kind of subqueries into joins.
before the queries are run against the dataset they are put through a query optimizer, the optimizer attempts to organize the query in such a fashion that it can remove as many tuples (rows) from the result set as quickly as it can. Often when you use subqueries (especially bad ones) the tuples can't be pruned out of the result set until the outer query starts to run.
With out seeing the the query its hard to say what was so bad about the original, but my guess would be it was something that the optimizer just couldn't make much better. Running 'explain' will show you the optimizers method for retrieving the data.
Look at the query plan for each query.
Where in and Join can typically be implemented using the same execution plan, so typically there is zero speed-up from changing between them.
Optimizer didn't do a very good job. Usually they can be transformed without any difference and the optimizer can do this.
This question is somewhat general, so here's a general answer:
Basically, queries take longer when MySQL has tons of rows to sort through.
Do this:
Run an EXPLAIN on each of the queries (the JOIN'ed one, then the Subqueried one), and post the results here.
I think seeing the difference in MySQL's interpretation of those queries would be a learning experience for everyone.
The where subquery has to run 1 query for each returned row. The inner join just has to run 1 query.
Usually its the result of the optimizer not being able to figure out that the subquery can be executed as a join in which case it executes the subquery for each record in the table rather then join the table in the subquery against the table you are querying. Some of the more "enterprisey" database are better at this, but they still miss it sometimes.
With a subquery, you have to re-execute the 2nd SELECT for each result, and each execution typically returns 1 row.
With a join, the 2nd SELECT returns a lot more rows, but you only have to execute it once. The advantage is that now you can join on the results, and joining relations is what a database is supposed to be good at. For example, maybe the optimizer can spot how to take better advantage of an index now.
It isn't so much the subquery as the IN clause, although joins are at the foundation of at least Oracle's SQL engine and run extremely quickly.
The subquery was probably executing a "full table scan". In other words, not using the index and returning way too many rows that the Where from the main query were needing to filter out.
Just a guess without details of course but that's the common situation.
Taken from the Reference Manual (14.2.10.11 Rewriting Subqueries as Joins):
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone.
So subqueries can be slower than LEFT [OUTER] JOINS.