jOOQ - group by uses alias instead of columns - sql

I'm running jOOQ 3.13.6, on Oracle 11g springboot environment.
To use listagg function, I'm trying the solution provided here: https://stackoverflow.com/a/69482329/17505774
Example code:
#Autowired
private DSLContext dsl;
// setting tables
Table table1 = DSL.table("table_1").as("t1");
Table table2 = DSL.table("table_2").as("t2");
// creating fields
final List<Field<?>> fields = new ArrayList<Field<?>>();
fields.add(DSL.field(DSL.name("t1", "t1_id")).as("id"));
Field cf = DSL.field(DSL.name("t2", "t2_code")).as("code");
fields.add(listAgg(cf, ";", null)); // no order
// running query
dsl.settings().withRenderQuotedNames(RenderQuotedNames.EXPLICIT_DEFAULT_UNQUOTED);
dsl.select(fields)
.from(table1)
.join(table2).on("t1.t1_id = t2.t1_id")
.where("t1.t1_id = ?", id)
.groupBy(fields)
However, it executes the following query:
select t1.t1_id id,
listagg(t2.t2_code) within group (null) as code
from table_1 t1
join table_2 t2 on ( t1.t1_id = t2.t2_id )
where t1.t1_id = 1
group by id
The group by should use the original column names (t1.t1_id), not the alias.

Note, for this answer, and for brevity reasons, I'm assuming you had been using the code generator. The answer is the same without code generation usage.
Why the current behaviour?
Note, this behaviour is also documented here in the manual.
In jOOQ, an aliased column expression T1.T1_ID.as("id") can only generate 2 different versions of itself:
T1.T1_ID as ID, i.e. the alias declaration (when inside of SELECT, at the top level)
ID, i.e. the alias reference (when inside of any other clause / expression than the SELECT clause)
There isn't a third type of generated SQL that depends on the location of where you embed the alias expression, e.g. the unaliased column expression T1.T1_ID when you put the expression in WHERE or GROUP BY, etc. The rationale is simple. What would a user expect when they write:
groupBy(T1.T1_ID.as("id"))
Why would they expect the as() call to be a no-op? That would be more surprising than the status quo.
Consistency with other rendering modes
There are other types of QueryPart in jOOQ, which have similar aliasing capabilities:
Field
Table
WindowSpecification
Parameter
CTE
Let's look at the CTE example:
Table<?> cte = name("cte").as(select(...))
That cte reference has 2 modes of rendering itself to SQL:
The CTE declaration (if placed in the WITH clause)
The CTE reference (if placed in FROM, etc.)
I don't think you'd expect that cte reference to ever ignore the aliasing, and just render the SELECT itself?
Likewise with table aliasing:
T1 x = T1.as("x");
This can render itself as:
The alias declaration (if placed in the FROM clause)
The alias reference
Because the FROM clause is logically before any other clauses, you'd never expect your x reference to render only T1, instead of x or T1 as x, right?
So, for consistency reasons across the jOOQ API, the Field aliases must also behave like all the others.
What to do instead?
Don't re-use the alias expression outside of your SELECT clause. Write the jOOQ SQL exactly as you'd write the actual SQL:
ctx.select(T1.T1_ID.as("id"), ...)
.from(T1)
.groupBy(T1.T1_ID)
...

Related

Can I use WHERE clause after JOIN USING in snowflake?

Can I use WHERE after
JOIN USING?
In my case if I run on snowflake multiple times the same code:
with CTE1 as
(
select *
from A
left join B
on A.date_a = B.date_b
)
select *
from CTE1
inner join C
using(var1_int)
where CTE1.date_a >= date('2020-10-01')
limit 1000;
sometimes I get a result and sometimes i get the error:
SQL compilation error: Can not convert parameter 'DATE('2020-10-01')' of type [DATE] into expected type [NUMBER(38,0)]
where NUMBER(38,0) is the type of var1_int column
Your problem has nothing to do with the existence of a where clause. Of course you can use a where clause after joins. That is how SQL queries are constructed.
According to the error message, CTE1.date_a is a number. Comparing it to a date results in a type-conversion error. If you provided sample data and desired results, then it might be possible to suggest a way to fix the problem.
tl;dr: Instead of JOIN .. USING() always prefer JOIN .. ON.
You are right to be suspicious of the results. Given your staging, only one of these queries returns without errors:
select a.date_1, id_1
from AE_USING_TST_A a
left join AE_USING_TST_B b
on a.date_1 = b.date_2
join AE_USING_TST_C v
using(id_1)
where A.date_1 >= date('2020-10-01')
-- Can not convert parameter 'DATE('2020-10-01')' of type
-- [DATE] into expected type [NUMBER(38,0)]
;
select a.date_1, a.id_1
from AE_USING_TST_A a
left join AE_USING_TST_B b
on a.date_1 = b.date_2
join AE_USING_TST_C v
on a.id_1=v.id_1
where A.date_1 >= date('2020-10-01')
-- 2020-10-11 2
;
I would call this a bug, except that the documentation is clear about not doing this kind of queries with JOIN .. USING:
To use the USING clause properly, the projection list (the list of columns and other expressions after the SELECT keyword) should be “*”. This allows the server to return the key_column exactly once, which is the standard way to use the USING clause. For examples of standard and non-standard usage, see the examples below.
https://docs.snowflake.com/en/sql-reference/constructs/join.html
The documentation doubles down on the problems of using USING() on non-standard situations, with a different query acting "wrong":
The following example shows non-standard usage; the projection list contains something other than “*”. Because the usage is non-standard, the output contains two columns named “userid”, and the second occurrence (which you might expect to contain a value from table ‘r’) contains a value that is not in the table (the value ‘a’ is not in the table ‘r’).
So just prefer JOIN .. ON. For extra discussion on the SQL ANSI standard not defining behavior for some cases of USING() check:
https://community.snowflake.com/s/question/0D50Z00008WRZBBSA5/bug-with-join-using-

BigQuery - using SQL UDF in join predicate

I'm trying to use a SQL UDF when running a left join, but get the following error:
Subquery in join predicate should only depend on exactly one join side.
Query is:
CREATE TEMPORARY FUNCTION game_match(game1 STRING,game2 STRING) AS (
strpos(game1,game2) >0
);
SELECT
t1.gameId
FROM `bigquery-public-data.baseball.games_post_wide` t1
left join `bigquery-public-data.baseball.games_post_wide` t2 on t1.gameId=t2.gameId and game_match(t1. gameId, t2.gameId)
When writing the condition inline, instead of the function call (strpos(t1. gameId, t2. gameId) >0), the query works.
Is there something problematic with this specific function, or is it that in general SQL UDF aren't supported in join predicate (for some reason)?
You could file a feature request on the issue tracker to make this work. It's a limitation of query planning/optimization; for some background, BigQuery converts the function call so that the query's logical representation is like this:
SELECT
t1.gameId
FROM `bigquery-public-data.baseball.games_post_wide` t1
left join `bigquery-public-data.baseball.games_post_wide` t2
on t1.gameId=t2.gameId
and (SELECT strpos(game1,game2) > 0 FROM (SELECT t1.gameId AS game1, t2.gameId AS game2))
The reason that BigQuery transforms the SQL UDF call like this is that it needs to avoid computing the inputs more than once. While it's not an issue in this particular case, it makes a difference if you reference one of the inputs more than once in the UDF body, e.g. consider this UDF:
CREATE TEMP FUNCTION Foo(x FLOAT64) AS (x - x);
SELECT Foo(RAND());
If BigQuery were to inline the expression directly, you'd end up with this:
SELECT RAND() - RAND();
The result would not be zero, which is unexpected given the definition of the UDF.
In most cases, BigQuery's logical optimizations transform the more complicated subselect as shown above into a simpler form, assuming that doing so doesn't change the semantics of the query. That didn't happen in this case, though, hence the error.

SQL Correlated subquery

I am trying to execute this query but am getting ORA-00904:"QM"."MDL_MDL_ID":invalid identifier. What is more confusing to me is the main query has two sub queries which only differ in the where clause. However, the first query is running fine but getting error for the second one. Below is the query.
select (
select make_description
from make_colours#dblink1
where makc_id = (
select makc_makc_id
from model_colours#dblink1
where to_char(mdc_id) = md.allocate_vehicle_colour_id
)
) as colour,
(
select make_description
from make_colours#dblink1
where makc_id = (
select makc_makc_id
from model_colours#dblink1
where mdl_mdl_id = qm.mdl_mdl_id
)
) as vehicle_colour
from schema1.web_order wo,
schema1.tot_order tot,
suppliers#dblink1 sp,
external_accounts#dblink1 ea,
schema1.location_contact_detail lcd,
quotation_models#dblink1 qm,
schema1.manage_delivery md
where wo.reference_id = tot.reference_id
and sp.ea_c_id = ea.c_id
and sp.ea_account_type = ea.account_type
and sp.ea_account_code = ea.account_code
and lcd.delivery_det_id = tot.delivery_detail_id
and sp.sup_id = tot.dealer_id
and wo.qmd_id = qm.qmd_id
and wo.reference_id = md.web_reference_id(+)
and supplier_category = 'dealer'
and wo.order_type = 'tot'
and trunc(wo.confirmdeliverydate - 3) = trunc(sysdate)
Oracle usually doesn't recognise table aliases (or anything else) more than one level down in a nested subquery; from the documentation:
Oracle performs a correlated subquery when a nested subquery references a column from a table referred to a parent statement one level above the subquery. [...] A correlated subquery conceptually is evaluated once for each row processed by the parent statement.
Note the 'one level' part. So your qm alias isn't being recognised where it is, in the nested subquery, as it is two levels away from the definition of the qm alias. (The same thing would happen with the original table name if you hadn't aliased it - it isn't specifically to do with aliases).
When you modified your query to just have select qm.mdl_mdl_id as Vehicle_colour - or a valid version of that, maybe (select qm.mdl_mdl_id from dual) as Vehicle_colour - you removed the nesting, and the qm was now only one level down from it's definition in the main body of the query, so it was recognised.
Your reference to md in the first nested subquery probably won't be recognised either, but the parser tends to sort of work backwards, so it's seeing the qm problem first; although it's possible a query rewrite would make it valid:
However, the optimizer may choose to rewrite the query as a join or use some other technique to formulate a query that is semantically equivalent.
You could also add hints to encourage that but it's better not to rely on that.
But you don't need nested subqueries, you can join inside each top level subquery:
select (
select mc2.make_description
from model_colours#dblink1 mc1,
make_colours#dblink1 mc2
where mc2.makc_id = mc1.makc_makc_id
and to_char(mc1.mdc_id) = md.allocate_vehicle_colour_id
) as colour,
(
select mc2.make_description
from model_colours#dblink1 mc1,
make_colours#dblink1 mc2
where mc2.makc_id = mc1.makc_makc_id
and mc1.mdl_mdl_id = qm.mdl_mdl_id
) as vehicle_colour
from schema1.web_order wo,
...
I've stuck with old-style join syntax to match the main query, but you should really consider rewriting the whole thing with modern ANSI join syntax. (I've also removed the rogue comma #Serg mentioned, but you may just have left out other columns in your real select list when posting the question.)
You could probably avoid subqueries altogether by joining to the make and model colour tables in the main query, either twice to handle the separate filter conditions, or once with a bit of logic in the column expressions. Once step at a time though...

Sql Server query syntax

I need to perform a query like this:
SELECT *,
(SELECT Table1.Column
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id
) as tmp
FROM Table2 WHERE tmp = 1
I know I can take a workaround but I would like to know if this syntax is possible as it is (I think) in Mysql.
The query you posted won't work on sql server, because the sub query in your select clause could possibly return more than one row. I don't know how MySQL will treat it, but from what I'm reading MySQL will also yield an error if the sub query returns any duplicates. I do know that SQL Server won't even compile it.
The difference is that MySQL will at least attempt to run the query and if you're very lucky (Table2Id is unique in Table1) it will succeed. More probably is will return an error. SQL Server won't try to run it at all.
Here is a query that should run on either system, and won't cause an error if Table2Id is not unique in Table1. It will return "duplicate" rows in that case, where the only difference is the source of the Table1.Column value:
SELECT Table2.*, Table1.Column AS tmp
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id
WHERE Table1.Column = 1
Perhaps if you shared what you were trying to accomplish we could help you write a query that does it.
SELECT *
FROM (
SELECT t.*,
(
SELECT Table1.Column
FROM Table1
INNER JOIN
Table2
ON Table1.Table2Id = Table2.Id
) as tmp
FROM Table2 t
) q
WHERE tmp = 1
This is valid syntax, but it will fail (both in MySQL and in SQL Server) if the subquery returns more than 1 row
What exactly are you trying to do?
Please provide some sample data and desired resultset.
I agree with Joel's solution but I want to discuss why your query would be a bad idea to use (even though the syntax is essentially valid). This is a correlated subquery. The first issue with these is that they don't work if the subquery could possibly return more than one value for a record. The second and more critical problem (in my mind) is that they must work row by row rather than on the set of data. This means they will virtually always affect performance. So correlated subqueries should almost never be used in a production system. In this simple case, the join Joel showed is the correct solution.
If the subquery is more complicated, you may want to turn it into a derived table instead (this also fixes the more than one value associated to a record problem). While a derived table looks a lot like a correlated subquery to the uninitated, it does not perform the same way because it acts on the set of data rather than row-by row and thus will often be significantly faster. You are essentially making the query a table in the join.
Below is an example of your query re-written as a derived table. (Of course in production code you would not use select * either especially in a join, spell out the fields you need)
SELECT *
FROM Table2 t2
JOIN
(SELECT Table1.[Column], Table1.Table2Id as tmp
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id ) as t
ON t.Table2Id = Table2.Id
WHERE tmp = 1
You've already got a variety of answers, some of them more useful than others. But to answer your question directly:
No, SQL Server will not allow you to reference the column alias (defined in the select list) in the predicate (the WHERE clause). I think that is sufficient to answer the question you asked.
Additional details:
(this discussion goes beyond the original question you asked.)
As you noted, there are several workarounds available.
Most problematic with the query you posted (as others have already pointed out) is that we aren't guaranteed that the subquery in the SELECT list returns only one row. If it does return more than one row, SQL Server will throw a "too many rows" exception:
Subquery returned more than 1 value.
This is not permitted when the subquery
follows =, !=, , >= or when the
subquery is used as an expression.
For the following discussion, I'm going to assume that issue is already sufficiently addressed.
Sometimes, the easiest way to make the alias available in the predicate is to use an inline view.
SELECT v.*
FROM ( SELECT *
, (SELECT Table1.Column
FROM Table1
JOIN Table2 ON Table1.Table2Id = Table2.Id
WHERE Table1.Column = 1
) as tmp
FROM Table2
) v
WHERE v.tmp = 1
Note that SQL Server won't push the predicate for the outer query (WHERE v.tmp = 1) into the subquery in the inline view. So you need to push that in yourself, by including the WHERE Table1.Column = 1 predicate in the subquery, particularly if you're depending on that to make the subquery return only one value.
That's just one approach to working around the problem, there are others. I suspect that query plan for this SQL Server query is not going to be optimal, for performance, you probably want to go with a JOIN or an EXISTS predicate.
NOTE: I'm not an expert on using MySQL. I'm not all that familiar with MySQL support for subqueries. I do know (from painful experience) that subqueries weren't supported in MySQL 3.23, which made migrating an application from Oracle 8 to MySQL 3.23 particularly painful.
Oh and btw... of no interest to anyone in particular, the Teradata DBMS engine DOES have an extension that allows for the NAMED keyword in place of the AS keyword, and a NAMED expression CAN be referenced elsewhere in the QUERY, including the WHERE clause, the GROUP BY clause and the ORDER BY clause. Shuh-weeeet
That kind of syntax is basically valid (you need to move the where tmp=... to on outer "select * from (....)", though), although it's ambiguous since you have two sets named "Table2"- you should probably define aliases on at least one of your usages of that table to clear up the ambiguity.
Unless you intended that to return a column from table1 corresponding to columns in table2 ... in which case you might have wanted to simply join the tables?

SQL - table alias scope

I've just learned ( yesterday ) to use "exists" instead of "in".
BAD
select * from table where nameid in (
select nameid from othertable where otherdesc = 'SomeDesc' )
GOOD
select * from table t where exists (
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeDesc' )
And I have some questions about this:
1) The explanation as I understood was: "The reason why this is better is because only the matching values will be returned instead of building a massive list of possible results". Does that mean that while the first subquery might return 900 results the second will return only 1 ( yes or no )?
2) In the past I have had the RDBMS complainin: "only the first 1000 rows might be retrieved", this second approach would solve that problem?
3) What is the scope of the alias in the second subquery?... does the alias only lives in the parenthesis?
for example
select * from table t where exists (
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeDesc' )
AND
select nameid from othertable o where t.nameid = o.nameid and otherdesc = 'SomeOtherDesc' )
That is, if I use the same alias ( o for table othertable ) In the second "exist" will it present any problem with the first exists? or are they totally independent?
Is this something Oracle only related or it is valid for most RDBMS?
Thanks a lot
It's specific to each DBMS and depends on the query optimizer. Some optimizers detect IN clause and translate it.
In all DBMSes I tested, alias is only valid inside the ( )
BTW, you can rewrite the query as:
select t.*
from table t
join othertable o on t.nameid = o.nameid
and o.otherdesc in ('SomeDesc','SomeOtherDesc');
And, to answer your questions:
Yes
Yes
Yes
You are treading into complicated territory, known as 'correlated sub-queries'. Since we don't have detailed information about your tables and the key structures, some of the answers can only be 'maybe'.
In your initial IN query, the notation would be valid whether or not OtherTable contains a column NameID (and, indeed, whether OtherDesc exists as a column in Table or OtherTable - which is not clear in any of your examples, but presumably is a column of OtherTable). This behaviour is what makes a correlated sub-query into a correlated sub-query. It is also a routine source of angst for people when they first run into it - invariably by accident. Since the SQL standard mandates the behaviour of interpreting a name in the sub-query as referring to a column in the outer query if there is no column with the relevant name in the tables mentioned in the sub-query but there is a column with the relevant name in the tables mentioned in the outer (main) query, no product that wants to claim conformance to (this bit of) the SQL standard will do anything different.
The answer to your Q1 is "it depends", but given plausible assumptions (NameID exists as a column in both tables; OtherDesc only exists in OtherTable), the results should be the same in terms of the data set returned, but may not be equivalent in terms of performance.
The answer to your Q2 is that in the past, you were using an inferior if not defective DBMS. If it supported EXISTS, then the DBMS might still complain about the cardinality of the result.
The answer to your Q3 as applied to the first EXISTS query is "t is available as an alias throughout the statement, but o is only available as an alias inside the parentheses". As applied to your second example box - with AND connecting two sub-selects (the second of which is missing the open parenthesis when I'm looking at it), then "t is available as an alias throughout the statement and refers to the same table, but there are two different aliases both labelled 'o', one for each sub-query". Note that the query might return no data if OtherDesc is unique for a given NameID value in OtherTable; otherwise, it requires two rows in OtherTable with the same NameID and the two OtherDesc values for each row in Table with that NameID value.
Oracle-specific: When you write a query using the IN clause, you're telling the rule-based optimizer that you want the inner query to drive the outer query. When you write EXISTS in a where clause, you're telling the optimizer that you want the outer query to be run first, using each value to fetch a value from the inner query. See "Difference between IN and EXISTS in subqueries".
Probably.
Alias declared inside subquery lives inside subquery. By the way, I don't think your example with 2 ANDed subqueries is valid SQL. Did you mean UNION instead of AND?
Personally I would use a join, rather than a subquery for this.
SELECT t.*
FROM yourTable t
INNER JOIN otherTable ot
ON (t.nameid = ot.nameid AND ot.otherdesc = 'SomeDesc')
It is difficult to generalize that EXISTS is always better than IN. Logically if that is the case, then SQL community would have replaced IN with EXISTS...
Also, please note that IN and EXISTS are not same, the results may be different when you use the two...
With IN, usually its a Full Table Scan of the inner table once without removing NULLs (so if you have NULLs in your inner table, IN will not remove NULLS by default)... While EXISTS removes NULL and in case of correlated subquery, it runs inner query for every row from outer query.
Assuming there are no NULLS and its a simple query (with no correlation), EXIST might perform better if the row you are finding is not the last row. If it happens to be the last row, EXISTS may need to scan till the end like IN.. so similar performance...
But IN and EXISTS are not interchangeable...