BigQuery - using SQL UDF in join predicate - google-bigquery

I'm trying to use a SQL UDF when running a left join, but get the following error:
Subquery in join predicate should only depend on exactly one join side.
Query is:
CREATE TEMPORARY FUNCTION game_match(game1 STRING,game2 STRING) AS (
strpos(game1,game2) >0
);
SELECT
t1.gameId
FROM `bigquery-public-data.baseball.games_post_wide` t1
left join `bigquery-public-data.baseball.games_post_wide` t2 on t1.gameId=t2.gameId and game_match(t1. gameId, t2.gameId)
When writing the condition inline, instead of the function call (strpos(t1. gameId, t2. gameId) >0), the query works.
Is there something problematic with this specific function, or is it that in general SQL UDF aren't supported in join predicate (for some reason)?

You could file a feature request on the issue tracker to make this work. It's a limitation of query planning/optimization; for some background, BigQuery converts the function call so that the query's logical representation is like this:
SELECT
t1.gameId
FROM `bigquery-public-data.baseball.games_post_wide` t1
left join `bigquery-public-data.baseball.games_post_wide` t2
on t1.gameId=t2.gameId
and (SELECT strpos(game1,game2) > 0 FROM (SELECT t1.gameId AS game1, t2.gameId AS game2))
The reason that BigQuery transforms the SQL UDF call like this is that it needs to avoid computing the inputs more than once. While it's not an issue in this particular case, it makes a difference if you reference one of the inputs more than once in the UDF body, e.g. consider this UDF:
CREATE TEMP FUNCTION Foo(x FLOAT64) AS (x - x);
SELECT Foo(RAND());
If BigQuery were to inline the expression directly, you'd end up with this:
SELECT RAND() - RAND();
The result would not be zero, which is unexpected given the definition of the UDF.
In most cases, BigQuery's logical optimizations transform the more complicated subselect as shown above into a simpler form, assuming that doing so doesn't change the semantics of the query. That didn't happen in this case, though, hence the error.

Related

Possible to do VLOOKUP as a udf in SQL

Let's take the following:
The pseudo-sql for this would be:
SELECT
Name,
Age,
VLOOKUP(Name, OtherTable, Letter)
FROM
Table
I suppose the inefficient way to write this as a function would be something along the lines of the following scalar subselect:
VLOOKUP(TargetTable, TargetLookupField, TargetLookupValue, TargetReturnValue)
--> SELECT TargetReturnValue FROM TargetTable WHERE TargetLookupField=TargetLookupValue
Would a better way be doing a query-rewrite to pushdown the scalar subselect to become a join instead? Is that a common technique in query rewriting? And is BigQuery (or any other RDBMS) ever smart enough to detect that and be able to do that on its own?
So, the best way to perform it from the beginning would be using a join, but, the UDF will work.
I've compared both ways for you as exemplification:
1-Implementing VLOOKUP with an User Defined Function (Not recommended):
CREATE TEMP FUNCTION VLOOKUP_SCALAR(user string)
AS (( SELECT letter FROM `data_letter` WHERE name=user));
SELECT name, VLOOKUP_SCALAR(name) as vlookup FROM `data`;
1-Result (More steps, time and bytes shuffled):
2-Implementing VLOOKUP with a JOIN (Recommended):
SELECT t1.name, letter
from `data` AS t1
LEFT JOIN `data_letter` AS t2
ON t1.name = t2.name;
2-Result (Less steps, time and bytes shuffled):

Can I use WHERE clause after JOIN USING in snowflake?

Can I use WHERE after
JOIN USING?
In my case if I run on snowflake multiple times the same code:
with CTE1 as
(
select *
from A
left join B
on A.date_a = B.date_b
)
select *
from CTE1
inner join C
using(var1_int)
where CTE1.date_a >= date('2020-10-01')
limit 1000;
sometimes I get a result and sometimes i get the error:
SQL compilation error: Can not convert parameter 'DATE('2020-10-01')' of type [DATE] into expected type [NUMBER(38,0)]
where NUMBER(38,0) is the type of var1_int column
Your problem has nothing to do with the existence of a where clause. Of course you can use a where clause after joins. That is how SQL queries are constructed.
According to the error message, CTE1.date_a is a number. Comparing it to a date results in a type-conversion error. If you provided sample data and desired results, then it might be possible to suggest a way to fix the problem.
tl;dr: Instead of JOIN .. USING() always prefer JOIN .. ON.
You are right to be suspicious of the results. Given your staging, only one of these queries returns without errors:
select a.date_1, id_1
from AE_USING_TST_A a
left join AE_USING_TST_B b
on a.date_1 = b.date_2
join AE_USING_TST_C v
using(id_1)
where A.date_1 >= date('2020-10-01')
-- Can not convert parameter 'DATE('2020-10-01')' of type
-- [DATE] into expected type [NUMBER(38,0)]
;
select a.date_1, a.id_1
from AE_USING_TST_A a
left join AE_USING_TST_B b
on a.date_1 = b.date_2
join AE_USING_TST_C v
on a.id_1=v.id_1
where A.date_1 >= date('2020-10-01')
-- 2020-10-11 2
;
I would call this a bug, except that the documentation is clear about not doing this kind of queries with JOIN .. USING:
To use the USING clause properly, the projection list (the list of columns and other expressions after the SELECT keyword) should be “*”. This allows the server to return the key_column exactly once, which is the standard way to use the USING clause. For examples of standard and non-standard usage, see the examples below.
https://docs.snowflake.com/en/sql-reference/constructs/join.html
The documentation doubles down on the problems of using USING() on non-standard situations, with a different query acting "wrong":
The following example shows non-standard usage; the projection list contains something other than “*”. Because the usage is non-standard, the output contains two columns named “userid”, and the second occurrence (which you might expect to contain a value from table ‘r’) contains a value that is not in the table (the value ‘a’ is not in the table ‘r’).
So just prefer JOIN .. ON. For extra discussion on the SQL ANSI standard not defining behavior for some cases of USING() check:
https://community.snowflake.com/s/question/0D50Z00008WRZBBSA5/bug-with-join-using-

Why using a UDF in a SQL query leads to cartesian product?

I saw Databricks-Question and don't understand
Why using UDFs leads to a Cartesian product instead of a full outer join? Obviously the Cartesian product would be a lot more rows than a full outer join(Joins is an example) which is a potential performance
hit.
Any way to force an outer join over the Cartesian product in the example given in Databricks-Question?
Quoting the Databricks-Question here:
I have a Spark Streaming application that uses SQLContext to execute
SQL statements on streaming data. When I register a custom UDF in
Scala, the performance of the streaming application degrades
significantly. Details below:
Statement 1:
Select col1, col2 from table1 as t1 join table2 as t2 on t1.foo = t2.bar
Statement 2:
Select col1, col2 from table1 as t1 join table2 as t2 on equals(t1.foo,t2.bar)
I register a custom UDF using SQLContext as follows:
sqlc.udf.register("equals", (s1: String, s2:String) => s1 == s2)
On the same input and Spark configuration, Statement2 performance
significantly worse(close to 100X) compared to Statement1.
Why using UDFs leads to a Cartesian product instead of a full outer join?
The reason why using UDFs require Cartesian product is quite simple. Since you pass an arbitrary function with possibly infinite domain and non-deterministic behavior the only way to determine its value is to pass arguments and evaluate. It means you simply have to check all possible pairs.
Simple equality from the other hand has a predictable behavior. If you use t1.foo = t2.bar condition you can simply shuffle t1 and t2 rows by foo and bar respectively to get expected result.
And just to be precise in the relational algebra outer join is actually expressed using natural join. Anything beyond that is simply an optimization.
Any way to force an outer join over the Cartesian product
Not really, unless you want to modify Spark SQL engine.

In T-SQL is it possible to have a result set returned by a select with subselects that cannot be obtained using joins instead?

Given a SQL SELECT expression with arbitrarily nested subselect's, it always possible to rewrite said SQL expression so that it contains no subselect's and returns the same result set?
If so, is there an algorithm for doing so?
If not, is there a characterization of those SELECT expressions that cannot be rewritten?
I'm making an application that will generate SQL SELECT statements. I'm still designing how it will work at this point. Here's the general idea, though:
The user will select what columns are displayed, how the results are sorted and how they are restricted.
The columns will not just be SQL columns but named objects such that the object can contain a SQL expression with column variables from multiple tables. These objects will contain information on how to join to each other.
I want to make the configuration of these expressions to be as flexible as possible; if it's possible to write the SELECT statement that returns some result set S, then I'd like the application to be able to generate a SELECT statement that returns S. One thing that's possible in SQL are sub-selects. I've read that rewriting said sub-select's with JOINS is better performance wise. Therefore I am considering disallowing sub-select's in the configuration. However I do not want to do this unless every sub-select can be rewritten as a join.
Subselects in the WHERE clause can often be impossible to rewrite as JOIN, especially if aggregate functions are in use.
Quoted from here:
Here is an example of a common-form subquery comparison which you can't do with a join: find all the values in table t1 which are equal to a maximum value in table t2.
SELECT column1 FROM t1
WHERE column1 = (SELECT MAX(column2) FROM t2);
Here is another example, which again is impossible with a join because it involves aggregating for one of the tables: find all rows in table t1 which contain a value which occurs twice.
SELECT * FROM t1
WHERE 2 = (SELECT COUNT(column1) FROM t1);
Therefore, if a complex subselect in the SELECT clause itself has subselects in its WHERE clause, that could be impossible to express as a JOIN.
SELECT T2.B, (SELECT A from t1 where t1.ID=T2.ID
and 2=(SELECT COUNT(A) from t1 as TX WHERE TX.A=T1.A))
FROM T2

Sql Server query syntax

I need to perform a query like this:
SELECT *,
(SELECT Table1.Column
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id
) as tmp
FROM Table2 WHERE tmp = 1
I know I can take a workaround but I would like to know if this syntax is possible as it is (I think) in Mysql.
The query you posted won't work on sql server, because the sub query in your select clause could possibly return more than one row. I don't know how MySQL will treat it, but from what I'm reading MySQL will also yield an error if the sub query returns any duplicates. I do know that SQL Server won't even compile it.
The difference is that MySQL will at least attempt to run the query and if you're very lucky (Table2Id is unique in Table1) it will succeed. More probably is will return an error. SQL Server won't try to run it at all.
Here is a query that should run on either system, and won't cause an error if Table2Id is not unique in Table1. It will return "duplicate" rows in that case, where the only difference is the source of the Table1.Column value:
SELECT Table2.*, Table1.Column AS tmp
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id
WHERE Table1.Column = 1
Perhaps if you shared what you were trying to accomplish we could help you write a query that does it.
SELECT *
FROM (
SELECT t.*,
(
SELECT Table1.Column
FROM Table1
INNER JOIN
Table2
ON Table1.Table2Id = Table2.Id
) as tmp
FROM Table2 t
) q
WHERE tmp = 1
This is valid syntax, but it will fail (both in MySQL and in SQL Server) if the subquery returns more than 1 row
What exactly are you trying to do?
Please provide some sample data and desired resultset.
I agree with Joel's solution but I want to discuss why your query would be a bad idea to use (even though the syntax is essentially valid). This is a correlated subquery. The first issue with these is that they don't work if the subquery could possibly return more than one value for a record. The second and more critical problem (in my mind) is that they must work row by row rather than on the set of data. This means they will virtually always affect performance. So correlated subqueries should almost never be used in a production system. In this simple case, the join Joel showed is the correct solution.
If the subquery is more complicated, you may want to turn it into a derived table instead (this also fixes the more than one value associated to a record problem). While a derived table looks a lot like a correlated subquery to the uninitated, it does not perform the same way because it acts on the set of data rather than row-by row and thus will often be significantly faster. You are essentially making the query a table in the join.
Below is an example of your query re-written as a derived table. (Of course in production code you would not use select * either especially in a join, spell out the fields you need)
SELECT *
FROM Table2 t2
JOIN
(SELECT Table1.[Column], Table1.Table2Id as tmp
FROM Table1
INNER JOIN Table2 ON Table1.Table2Id = Table2.Id ) as t
ON t.Table2Id = Table2.Id
WHERE tmp = 1
You've already got a variety of answers, some of them more useful than others. But to answer your question directly:
No, SQL Server will not allow you to reference the column alias (defined in the select list) in the predicate (the WHERE clause). I think that is sufficient to answer the question you asked.
Additional details:
(this discussion goes beyond the original question you asked.)
As you noted, there are several workarounds available.
Most problematic with the query you posted (as others have already pointed out) is that we aren't guaranteed that the subquery in the SELECT list returns only one row. If it does return more than one row, SQL Server will throw a "too many rows" exception:
Subquery returned more than 1 value.
This is not permitted when the subquery
follows =, !=, , >= or when the
subquery is used as an expression.
For the following discussion, I'm going to assume that issue is already sufficiently addressed.
Sometimes, the easiest way to make the alias available in the predicate is to use an inline view.
SELECT v.*
FROM ( SELECT *
, (SELECT Table1.Column
FROM Table1
JOIN Table2 ON Table1.Table2Id = Table2.Id
WHERE Table1.Column = 1
) as tmp
FROM Table2
) v
WHERE v.tmp = 1
Note that SQL Server won't push the predicate for the outer query (WHERE v.tmp = 1) into the subquery in the inline view. So you need to push that in yourself, by including the WHERE Table1.Column = 1 predicate in the subquery, particularly if you're depending on that to make the subquery return only one value.
That's just one approach to working around the problem, there are others. I suspect that query plan for this SQL Server query is not going to be optimal, for performance, you probably want to go with a JOIN or an EXISTS predicate.
NOTE: I'm not an expert on using MySQL. I'm not all that familiar with MySQL support for subqueries. I do know (from painful experience) that subqueries weren't supported in MySQL 3.23, which made migrating an application from Oracle 8 to MySQL 3.23 particularly painful.
Oh and btw... of no interest to anyone in particular, the Teradata DBMS engine DOES have an extension that allows for the NAMED keyword in place of the AS keyword, and a NAMED expression CAN be referenced elsewhere in the QUERY, including the WHERE clause, the GROUP BY clause and the ORDER BY clause. Shuh-weeeet
That kind of syntax is basically valid (you need to move the where tmp=... to on outer "select * from (....)", though), although it's ambiguous since you have two sets named "Table2"- you should probably define aliases on at least one of your usages of that table to clear up the ambiguity.
Unless you intended that to return a column from table1 corresponding to columns in table2 ... in which case you might have wanted to simply join the tables?