Possible to do VLOOKUP as a udf in SQL - sql

Let's take the following:
The pseudo-sql for this would be:
SELECT
Name,
Age,
VLOOKUP(Name, OtherTable, Letter)
FROM
Table
I suppose the inefficient way to write this as a function would be something along the lines of the following scalar subselect:
VLOOKUP(TargetTable, TargetLookupField, TargetLookupValue, TargetReturnValue)
--> SELECT TargetReturnValue FROM TargetTable WHERE TargetLookupField=TargetLookupValue
Would a better way be doing a query-rewrite to pushdown the scalar subselect to become a join instead? Is that a common technique in query rewriting? And is BigQuery (or any other RDBMS) ever smart enough to detect that and be able to do that on its own?

So, the best way to perform it from the beginning would be using a join, but, the UDF will work.
I've compared both ways for you as exemplification:
1-Implementing VLOOKUP with an User Defined Function (Not recommended):
CREATE TEMP FUNCTION VLOOKUP_SCALAR(user string)
AS (( SELECT letter FROM `data_letter` WHERE name=user));
SELECT name, VLOOKUP_SCALAR(name) as vlookup FROM `data`;
1-Result (More steps, time and bytes shuffled):
2-Implementing VLOOKUP with a JOIN (Recommended):
SELECT t1.name, letter
from `data` AS t1
LEFT JOIN `data_letter` AS t2
ON t1.name = t2.name;
2-Result (Less steps, time and bytes shuffled):

Related

BigQuery - using SQL UDF in join predicate

I'm trying to use a SQL UDF when running a left join, but get the following error:
Subquery in join predicate should only depend on exactly one join side.
Query is:
CREATE TEMPORARY FUNCTION game_match(game1 STRING,game2 STRING) AS (
strpos(game1,game2) >0
);
SELECT
t1.gameId
FROM `bigquery-public-data.baseball.games_post_wide` t1
left join `bigquery-public-data.baseball.games_post_wide` t2 on t1.gameId=t2.gameId and game_match(t1. gameId, t2.gameId)
When writing the condition inline, instead of the function call (strpos(t1. gameId, t2. gameId) >0), the query works.
Is there something problematic with this specific function, or is it that in general SQL UDF aren't supported in join predicate (for some reason)?
You could file a feature request on the issue tracker to make this work. It's a limitation of query planning/optimization; for some background, BigQuery converts the function call so that the query's logical representation is like this:
SELECT
t1.gameId
FROM `bigquery-public-data.baseball.games_post_wide` t1
left join `bigquery-public-data.baseball.games_post_wide` t2
on t1.gameId=t2.gameId
and (SELECT strpos(game1,game2) > 0 FROM (SELECT t1.gameId AS game1, t2.gameId AS game2))
The reason that BigQuery transforms the SQL UDF call like this is that it needs to avoid computing the inputs more than once. While it's not an issue in this particular case, it makes a difference if you reference one of the inputs more than once in the UDF body, e.g. consider this UDF:
CREATE TEMP FUNCTION Foo(x FLOAT64) AS (x - x);
SELECT Foo(RAND());
If BigQuery were to inline the expression directly, you'd end up with this:
SELECT RAND() - RAND();
The result would not be zero, which is unexpected given the definition of the UDF.
In most cases, BigQuery's logical optimizations transform the more complicated subselect as shown above into a simpler form, assuming that doing so doesn't change the semantics of the query. That didn't happen in this case, though, hence the error.

How to speed up a SQL query which has an IN clause that contains 4,000 elements?

I am using python to generate a query text which I then send to the SQL server. The query is created in a function that accepts a list of strings which are then inserted into the query.
The query looks like:
SELECT *
FROM DB
WHERE last_word in ('red', 'phone', 'robin')
The issue is that here I have just 3 words, red, phone, and robin, but in another use case I have over 4,000 words and the response takes about 2 hours. How can I rewrite this query to make it more performant?
optimization strategies:
add an index on last_word
CREATE INDEX ON db(last_word)
store the filter words in a table and use a WHERE exists (or inner join)
WITH words (word) AS (
VALUES ('red'), ('phone'), ('robin')
)
SELECT *
FROM db
WHERE EXISTS (SELECT TRUE FROM words WHERE word = last_word)
or
WITH words (word) AS (
VALUES ('red'), ('phone'), ('robin')
)
SELECT db.*
FROM db
JOIN words ON db.last_word = words.word
The WHERE EXISTS here should be slightly faster than JOIN
How many rows do you have in "DB"? Are there more "last_word"s matching the 4000 words in the IN clause than not? If so, it would be better to use NOT IN, to exclude instead of include. Also, try to never use SELECT * since this wildcard is very unperformant, it's better to explicitly define the columns you want to include in your query.
You could also try to put the 4000 words to match on in a (temporary) table or a CTE and then join on it, since joins usually work better than large loads of data within the IN clause. With this, I still recommend to not use the wildcard in the SELECT statement.
Put your data into a temp table or CTE. This would make for easier addition of new data. Likewise, you’ll have to do an inner join to your source table to make sure you capture everything.
Hope this helps.
Try doing something like this:
SELECT *
FROM DB INNER JOIN WORDS_TABLE
ON DB.WORDS = WORDS_TABLE.WORDS;
Instead of the * use whatever you want to get.
JOIN in this case will be faster than the IN as you will have to write another inner query if you are using a table.

Hive Query: defining a variable which is a list of strings

How can I create a constant list and use it in the WHERE clause of my query?
For example, I have a hive query, where I say
Select t1.Id,
t1.symptom
from t1
WHERE lower(symptom) NOT IN ('coughing','sneezing','xyz', etc,...)
Instead of keep repeating this long list of symptoms (which makes the code very ugly), is there a way to define it ahead of time as
MyList = ('coughing','sneezing','x',...)
and then in WHERE clause I'd just say WHERE lower(symptom) not in MyList.
You can put the list in a table and use join:
Select t1.Id, t1.symptom
from t1
where lower(symptom) NOT IN (select symptom from mysymptoms_list);
This persists the list, so it can be used in multiple queries.
You can use hive variable to do this.
SET hivevar:InClause=('coughing','sneezing','x',...)
Make sure you don't leave spaces either side of equals.
SELECT t1.Id,
t1.symptom
FROM t1
WHERE LOWER(symptom) NOT IN ${InClause}
If you are comfortable with joins, you can use a left join with where clause:
Select t1.Id, t1.symptom
from
t1 A left join MyList B
on
lower(A.symptom) = lower(B.symptom)
where lower(B.symptom) IS NULL;
This query will retain all symptoms(A.symptom) from table t1 in one column and for the second column(B.symptom) corresponding to the table MyList, the value will be same as the symptom in t1 if a match is found or NULL if a match is not found.
You want those where a match is not found, hence the where clause.

In T-SQL is it possible to have a result set returned by a select with subselects that cannot be obtained using joins instead?

Given a SQL SELECT expression with arbitrarily nested subselect's, it always possible to rewrite said SQL expression so that it contains no subselect's and returns the same result set?
If so, is there an algorithm for doing so?
If not, is there a characterization of those SELECT expressions that cannot be rewritten?
I'm making an application that will generate SQL SELECT statements. I'm still designing how it will work at this point. Here's the general idea, though:
The user will select what columns are displayed, how the results are sorted and how they are restricted.
The columns will not just be SQL columns but named objects such that the object can contain a SQL expression with column variables from multiple tables. These objects will contain information on how to join to each other.
I want to make the configuration of these expressions to be as flexible as possible; if it's possible to write the SELECT statement that returns some result set S, then I'd like the application to be able to generate a SELECT statement that returns S. One thing that's possible in SQL are sub-selects. I've read that rewriting said sub-select's with JOINS is better performance wise. Therefore I am considering disallowing sub-select's in the configuration. However I do not want to do this unless every sub-select can be rewritten as a join.
Subselects in the WHERE clause can often be impossible to rewrite as JOIN, especially if aggregate functions are in use.
Quoted from here:
Here is an example of a common-form subquery comparison which you can't do with a join: find all the values in table t1 which are equal to a maximum value in table t2.
SELECT column1 FROM t1
WHERE column1 = (SELECT MAX(column2) FROM t2);
Here is another example, which again is impossible with a join because it involves aggregating for one of the tables: find all rows in table t1 which contain a value which occurs twice.
SELECT * FROM t1
WHERE 2 = (SELECT COUNT(column1) FROM t1);
Therefore, if a complex subselect in the SELECT clause itself has subselects in its WHERE clause, that could be impossible to express as a JOIN.
SELECT T2.B, (SELECT A from t1 where t1.ID=T2.ID
and 2=(SELECT COUNT(A) from t1 as TX WHERE TX.A=T1.A))
FROM T2

Explanation of using the operator EXISTS on a correlated subqueries

What is an explanation of the mechanics behind the following Query?
It looks like a powerful method of doing dynamic filtering on a table.
CREATE TABLE tbl (ID INT, amt INT)
INSERT tbl VALUES
(1,1),
(1,1),
(1,2),
(1,3),
(2,3),
(2,400),
(3,400),
(3,400)
SELECT *
FROM tbl T1
WHERE EXISTS
(
SELECT *
FROM tbl T2
WHERE
T1.ID = T2.ID AND
T1.amt < T2.amt
)
Live test of it here on SQL Fiddle
You can usually convert correlated subqueries into an equivalent expression using explicit joins. Here is one way:
SELECT distinct t1.*
FROM tbl T1 left outer join
tbl t2
on t1.id = t2.id and
t1.amt < t2.amt
where t2.id is null
Martin Smith shows another way.
The question of whether they are a "powerful way of doing dynamic filtering" is true, but (usually) unimportant. You can do the same filtering using other SQL constructs.
Why use correlated subqueries? There are several positives and several negatives, and one important reason that is both. On the positive side, you do not have to worry about "multiplication" of rows, as happens in the above query. Also, when you have other filtering conditions, the correlated subquery is often more efficient. And, sometimes using delete or update, it seems to be the only way to express a query.
The Achilles heel is that many SQL optimizers implement correlated subqueries as nested loop joins (even though do not have to). So, they can be highly inefficient at times. However, the particular "exists" construct that you have is often quite efficient.
In addition, the nature of the joins between the tables can get lost in nested subqueries, which complicated conditions in where clauses. It can get hard to understand what is going on in more complicated cases.
My recommendation. If you are going to use them on large tables, learn about SQL execution plans in your database. Correlated subqueries can bring out the best or the worst in SQL performance.
Possible Edit. This is more equivalent to the script in the OP:
SELECT distinct t1.*
FROM tbl T1 inner join
tbl t2
on t1.id = t2.id and
t1.amt < t2.amt
Let's translate this to english:
"Select rows from tbl where tbl has a row of the same ID and bigger amt."
What this does is select everything except the rows with maximum values of amt for each ID.
Note, the last line SELECT * FROM tbl is a separate query and probably not related to the question at hand.
As others have already pointed out, using EXISTS in a correlated subquery is essentially telling the database engine "return all records for which there is a corresponding record which meets the criteria specified in the subquery." But there's more.
The EXISTS keyword represents a boolean value. It could also be taken to mean "Where at least one record exists that matches the criteria in the WHERE statement." In other words, if a single record is found, "I'm done, and I don't need to search any further."
The efficiency gain that CAN result from using EXISTS in a correlated subquery comes from the fact that as soon as EXISTS returns TRUE, the subquery stops scanning records and returns a result. Similarly, a subquery which employs NOT EXISTS will return as soon as ANY record matches the criteria in the WHERE statement of the subquery.
I believe the idea is that the subquery using EXISTS is SUPPOSED to avoid the use of nested loop searches. As #Gordon Linoff states above though, the query optimizer may or may not perform as desired. I believe MS SQL Server usually takes full advantage of EXISTS.
My understanding is that not all queries benefit from EXISTS, but often, they will, particularly in the case of simple structures such as that in your example.
I may have butchered some of this, but conceptually I believe it's on the right track.
The caveat is that if you have a performance-critical query, it would be best to evaluate execution of a version using EXISTS with one using simple JOINS as Mr. Linoff indicates. Depending on your database engine, table structure, time of day, and the alignment of the moon and stars, it is not cut-and-dried which will be faster.
Last note - I agree with lc. When you use SELECT * in your subquery, you may well be negating some or all of any performance gain. SELECT only the PK field(s).