SQL 'group by' equivalent query - sql

I've seen the following example:
Let T be a table with 2 columns - [id,value] (both int)
Then:
SELECT * FROM T
WHERE id=(SELECT MAX(id) FROM T t2 where T.value=t2.value);
is equivalent to:
SELECT MAX(id) FROM T GROUP BY value
What is going on behind the scene? How can we refer to T1.value?
What is the meaning of T1.value=t2.value?

#JuanCarlosOropeza is correct, your premise is false. Those are not equivalent queries. The second query should error out. But more to the point. The purpose of the WHERE clause in the subquery is to restrict the rows in the subquery to the id from the outer query.

For what's going on behind the scenes, use the explain plan, which provides information about how the optimizer decides to get the data your query asks for.

Related

Oracle how to avoid merge cartesian join in execution plan?

The following query is sometimes resulting in a merge cartesian join in the execution plan, we're trying to rewrite the query (in the simplest fashion) in order to ensure the merge cartesian join will not occur anymore.
SELECT COL1
FROM SCHEMA.VIEW_NAME1
WHERE DATE_VAL > (SELECT DATE_VAL FROM SCHEMA.VIEW_NAME2)
After reviewing a similar question "Why would this query cause a Merge Cartesian Join in Oracle", the problem seems to be "Oracle doesn't know that (SELECT DATE_VAL FROM SCHEMA.VIEW_NAME2) returns one single result. So it's assuming that it will generate lots of rows."
Is there some way to tell the Oracle optimizer that the sub-select will only return one row?
Would using a function that returns a datetime value in place of the sub-select help, assuming that the optimizer would then know that the function can only return one value?
SELECT COL1
FROM SCHEMA.VIEW_NAME1
WHERE DATE_VAL > SCHEMA.FN_GET_DATE_VAL()
The Oracle DBA recommended using a WITH statement, which seems like it will work, but we were wondering if there were any shorter options.
with mx_dt as (SELECT DATE_VAL FROM SCHEMA.VIEW_NAME2)
SELECT COL1
FROM SCHEMA.VIEW_NAME1, mx_dt a
WHERE DATE_VAL > a.DATE_VAL
I wouldn't worry about the Cartesian join, because the subquery is only returning one row (at most). Otherwise, you would get a "subquery returns too many rows" error.
It is possible that Oracle runs that the subquery once for each comparison -- possible, but the Oracle optimizer is smart and I doubt that would happen. However, it is easy enough to phrase this as a JOIN:
SELECT n1.COL1
FROM SCHEMA.VIEW_NAME1 n1 JOIN
SCHEMA.VIEW_NAME2 n2
ON n1.DATE_VAL > n2.DATE_VAL;
However, it is possible that this execution plan is even worse, because you have not specified that n2 is only supposed to return (at most) one value.
An aggregate function in the sub-select ensures a single row is returned. Probably would be a good hint to the optimizer and if there is only 1 row in VIEW_NAME2 then the result of the sub-select is the same.
SELECT COL1
FROM SCHEMA.VIEW_NAME1
WHERE DATE_VAL > (SELECT MIN(DATE_VAL) FROM SCHEMA.VIEW_NAME2)
Try adding WHERE ROWNUM >= 1:
SELECT COL1
FROM SCHEMA.VIEW_NAME1
WHERE DATE_VAL > (SELECT DATE_VAL FROM SCHEMA.VIEW_NAME2 WHERE ROWNUM >= 1)
That predicate looks totally extraneous, or the kind of thing Oracle would just ignore, but the ROWNUM pseudo-function is special. When Oracle sees it, it thinks "these rows must be returned in order, I better not do any query transformations". Which means that it won't try to push predicates, merge views, etc. Which means VIEW_NAME1 will be run separately from VIEW_NAME2 and they now both run just as fast as before.
You'll probably still see a Cartesian product in the explain plan but hopefully only near the top, between the two view result sets. If there really is only one row returned then a Cartesian product is probably the right operation.

SQL nesting on highschool test

We made this test on my school, and according to the test this is the right answer:
SELECT
names
FROM
COMPANY
WHERE
NOT EXISTS (SELECT
kgmilk
FROM
COWS
WHERE kgmilk < 1000 AND COMPANY.nr = COWS.nr)
Now my question is, can you actually do COMPANY.nr = COWS.nr in the nested query, since you only select one database in that query.
You didn't specify which kind of SQL this is.
If it's MS SQL Server, COMPANY.nr = COWS.nr is possible.
See this example.
Yes, it's called a correlated subquery and it will in effect get evaluated for each row in the main table (ignoring optimization).
The nested query - commonly called a subquery - is a correlated subquery, which means that it is evaluated for each row in the outer query. The correlation occurs when, within the subquery, you reference a table in the outer query. Here's an example of a plain subquery:
select
NAME
from
COMPANY
where
NAME not in (
select
NAME
from
COMPANY_BLACKLIST
)

Why is an IN statement with a list of items faster than an IN statement with a subquery?

I'm having the following situation:
I've got a quite complex view from which I've to select a couple of records.
SELECT * FROM VW_Test INNER JOIN TBL_Test ON VW_Test.id = TBL_Test.id
WHERE VW_Test.id IN (1000,1001,1002,1003,1004,[etc])
This returns a result practically instantly (currently with 25 items in that IN statement). However when I use the following query it slows down really fast.
SELECT * FROM VW_Test INNER JOIN TBL_Test ON VW_Test.id = TBL_Test.id
WHERE VW_Test.id IN (SELECT id FROM TBL_Test)
With 25 records in the TBL_Test this query takes about 5 seconds. I've got an index on that id in the TBL_Test.
Anyone got an idea why this happens and how to get performance up?
EDIT: I forgot to mention that this subquery
SELECT id FROM TBL_Test
returns a result instantly as well.
Well, when using a subquery the database engine will first have to generate the results for the subquery before it can do anything else, which takes time. If you have a predefined list, this will not need to happen and the engine can simply use those values 'as is'. At least, this is how I understand it.
How to improve performance: do away with the subquery. I don't think you even need the IN clause in this case. The INNER JOIN should suffice.

Sql Server query syntax

I need to perform a query like this:
SELECT *,
(SELECT Table1.Column
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id
) as tmp
FROM Table2 WHERE tmp = 1
I know I can take a workaround but I would like to know if this syntax is possible as it is (I think) in Mysql.
The query you posted won't work on sql server, because the sub query in your select clause could possibly return more than one row. I don't know how MySQL will treat it, but from what I'm reading MySQL will also yield an error if the sub query returns any duplicates. I do know that SQL Server won't even compile it.
The difference is that MySQL will at least attempt to run the query and if you're very lucky (Table2Id is unique in Table1) it will succeed. More probably is will return an error. SQL Server won't try to run it at all.
Here is a query that should run on either system, and won't cause an error if Table2Id is not unique in Table1. It will return "duplicate" rows in that case, where the only difference is the source of the Table1.Column value:
SELECT Table2.*, Table1.Column AS tmp
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id
WHERE Table1.Column = 1
Perhaps if you shared what you were trying to accomplish we could help you write a query that does it.
SELECT *
FROM (
SELECT t.*,
(
SELECT Table1.Column
FROM Table1
INNER JOIN
Table2
ON Table1.Table2Id = Table2.Id
) as tmp
FROM Table2 t
) q
WHERE tmp = 1
This is valid syntax, but it will fail (both in MySQL and in SQL Server) if the subquery returns more than 1 row
What exactly are you trying to do?
Please provide some sample data and desired resultset.
I agree with Joel's solution but I want to discuss why your query would be a bad idea to use (even though the syntax is essentially valid). This is a correlated subquery. The first issue with these is that they don't work if the subquery could possibly return more than one value for a record. The second and more critical problem (in my mind) is that they must work row by row rather than on the set of data. This means they will virtually always affect performance. So correlated subqueries should almost never be used in a production system. In this simple case, the join Joel showed is the correct solution.
If the subquery is more complicated, you may want to turn it into a derived table instead (this also fixes the more than one value associated to a record problem). While a derived table looks a lot like a correlated subquery to the uninitated, it does not perform the same way because it acts on the set of data rather than row-by row and thus will often be significantly faster. You are essentially making the query a table in the join.
Below is an example of your query re-written as a derived table. (Of course in production code you would not use select * either especially in a join, spell out the fields you need)
SELECT *
FROM Table2 t2
JOIN
(SELECT Table1.[Column], Table1.Table2Id as tmp
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id ) as t
ON t.Table2Id = Table2.Id
WHERE tmp = 1
You've already got a variety of answers, some of them more useful than others. But to answer your question directly:
No, SQL Server will not allow you to reference the column alias (defined in the select list) in the predicate (the WHERE clause). I think that is sufficient to answer the question you asked.
Additional details:
(this discussion goes beyond the original question you asked.)
As you noted, there are several workarounds available.
Most problematic with the query you posted (as others have already pointed out) is that we aren't guaranteed that the subquery in the SELECT list returns only one row. If it does return more than one row, SQL Server will throw a "too many rows" exception:
Subquery returned more than 1 value.
This is not permitted when the subquery
follows =, !=, , >= or when the
subquery is used as an expression.
For the following discussion, I'm going to assume that issue is already sufficiently addressed.
Sometimes, the easiest way to make the alias available in the predicate is to use an inline view.
SELECT v.*
FROM ( SELECT *
, (SELECT Table1.Column
FROM Table1
JOIN Table2 ON Table1.Table2Id = Table2.Id
WHERE Table1.Column = 1
) as tmp
FROM Table2
) v
WHERE v.tmp = 1
Note that SQL Server won't push the predicate for the outer query (WHERE v.tmp = 1) into the subquery in the inline view. So you need to push that in yourself, by including the WHERE Table1.Column = 1 predicate in the subquery, particularly if you're depending on that to make the subquery return only one value.
That's just one approach to working around the problem, there are others. I suspect that query plan for this SQL Server query is not going to be optimal, for performance, you probably want to go with a JOIN or an EXISTS predicate.
NOTE: I'm not an expert on using MySQL. I'm not all that familiar with MySQL support for subqueries. I do know (from painful experience) that subqueries weren't supported in MySQL 3.23, which made migrating an application from Oracle 8 to MySQL 3.23 particularly painful.
Oh and btw... of no interest to anyone in particular, the Teradata DBMS engine DOES have an extension that allows for the NAMED keyword in place of the AS keyword, and a NAMED expression CAN be referenced elsewhere in the QUERY, including the WHERE clause, the GROUP BY clause and the ORDER BY clause. Shuh-weeeet
That kind of syntax is basically valid (you need to move the where tmp=... to on outer "select * from (....)", though), although it's ambiguous since you have two sets named "Table2"- you should probably define aliases on at least one of your usages of that table to clear up the ambiguity.
Unless you intended that to return a column from table1 corresponding to columns in table2 ... in which case you might have wanted to simply join the tables?

How does the IN predicate work in SQL?

After prepairing an answer for this question I found I couldn't verify my answer.
In my first programming job I was told that a query within the IN () predicate gets executed for every row contained in the parent query, and therefore using IN should be avoided.
For example, given the query:
SELECT count(*) FROM Table1 WHERE Table1Id NOT IN (
SELECT Table1Id FROM Table2 WHERE id_user = 1)
Table1 Rows | # of "IN" executions
----------------------------------
10 | 10
100 | 100
1000 | 1000
10000 | 10000
Is this correct? How does the IN predicate actually work?
The warning you got about subqueries executing for each row is true -- for correlated subqueries.
SELECT COUNT(*) FROM Table1 a
WHERE a.Table1id NOT IN (
SELECT b.Table1Id FROM Table2 b WHERE b.id_user = a.id_user
);
Note that the subquery references the id_user column of the outer query. The value of id_user on each row of Table1 may be different. So the subquery's result will likely be different, depending on the current row in the outer query. The RDBMS must execute the subquery many times, once for each row in the outer query.
The example you tested is a non-correlated subquery. Most modern RDBMS optimizers worth their salt should be able to tell when the subquery's result doesn't depend on the values in each row of the outer query. In that case, the RDBMS runs the subquery a single time, caches its result, and uses it repeatedly for the predicate in the outer query.
PS: In SQL, IN() is called a "predicate," not a statement. A predicate is a part of the language that evaluates to either true or false, but cannot necessarily be executed independently as a statement. That is, you can't just run this as an SQL query: "2 IN (1,2,3);" Although this is a valid predicate, it's not a valid statement.
It will entirely depend on the database you're using, and the exact query.
Query optimisers are very smart at times - in your sample query, I'd expect the better databases to be able to use the same sort of techniques that they do with a join. More naive databases may just execute the same query many times.
This depends on the RDBMS in question.
See detailed analysis here:
MySQL, part 1
MySQL, part 2
SQL Server
Oracle
PostgreSQL
In short:
MySQL will optimize the query to this:
SELECT COUNT(*)
FROM Table1 t1
WHERE NOT EXISTS
(
SELECT 1
FROM Table2 t2
WHERE t2.id_user = 1
AND t2.Table1ID = t1.Table2ID
)
and run the inner subquery in a loop, using the index lookup each time.
SQL Server will use MERGE ANTI JOIN.
The inner subquery will not be "executed" in a common sense of word, instead, the results from both query and subquery will be fetched concurrently.
See the link above for detailed explanation.
Oracle will use HASH ANTI JOIN.
The inner subquery will be executed once, and a hash table will be built from the resultset.
The values from the outer query will be looked up in the hash table.
PostgreSQL will use NOT (HASHED SUBPLAN).
Much like Oracle.
Note that rewriting the query as this:
SELECT (
SELECT COUNT(*)
FROM Table1
) -
(
SELECT COUNT(*)
FROM Table2 t2
WHERE (t2.id_user, t2.Table1ID) IN
(
SELECT 1, Table1ID
FROM Table1
)
)
will greatly improve the performance in all four systems.
Depends on optimizer. Check exact query plan for each particular query to see how the RDBMS will actually execute that.
In Oracle that'd be:
EXPLAIN PLAN FOR «your query»
In MySQL or PostgreSQL
EXPLAIN «your query»
Most SQL engines nowadays will almost always create the same execution plan for LEFT JOIN, NOT IN and NOT EXISTS
I would say look at your execution plan and find out :-)
Also if you have NULL values for the Table1Id column you will not get any data back
Not really. But it's butter to write such queries using JOIN
Yes, but execution stops as soon as the query processer "finds" the value you are looking for... So if, for example the first row in the outer select has Table1Id = 32, and if Table2 has a record with a TableId = 32, then
as soon as the subquery finds the row in Table2 where TableId = 32, it stops...