Calling table valued function twice. Could be once? - sql

I need to use in SQL a function which returns a table
At this moment I have:
SELECT
K.ID,
(SELECT A from dbo.TableFunction1 (K.ID,0,77)) AS A,
(SELECT B from dbo.TableFunction1 (K.ID,0,77)) AS B
FROM K
I'm worried because I execute the same function with the same parameters twice, once to get one column and next time to get another column.
It turns out I can't do:
SELECT
K.ID,
(SELECT A,B from dbo.TableFunction1 (K.ID,0,77))
FROM K
as I get: Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.
Could this query be improved so I called the function only once?

Try using cross apply:
select K.ID, tf1.A, tf1.B
from K cross apply
dbo.TableFunction1(K.ID, 0, 77) tf1

You can do 3 things here:
Use cross apply
select k.id, f.b, f.b
from k
cross apply dbo.TableFunction(k.id, 0, 77)
This will work fine if your query stays that simple, but if you start to have other joins, that limit the number of rows that you would return from "K", then you can still end up running "TableFunction" on every row in "K". I've seen that turn into a performance nightmare.
Convert the function to a 2 scalar functions
select
K.ID,
dbo.ScalarFunctionA (K.ID,0,77)) AS A,
dbo.ScalarFunctionB (K.ID,0,77)) AS B
FROM K
This also has drawbacks if you have a big query within that function, you're now running that query twice for every row you return. If you're only returning 1 row, no problem, if you're returning thousands performance takes another hit.
Unwrap the function completely and include it within the query. Most likely the fastest, but comes with the drawback of not reusing the code.

Related

Athena/Presto | Can't match ID row on self join

I'm trying to get the bi-grams on a string column.
I've followed the approach here but Athena/Presto is giving me errors at the final steps.
Source code so far
with word_list as (
SELECT
transaction_id,
words,
n,
regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)') as f70,
f70_remittance_info
FROM exploration_transaction
cross join unnest(regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)')) with ordinality AS t (words, n)
where cardinality((regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)'))) > 1
and f70_remittance_info is not null
limit 50 )
select wl1.f70, wl1.n, wl1.words, wl2.f70, wl2.n, wl2.words
from word_list wl1
join word_list wl2
on wl1.transaction_id = wl2.transaction_id
The specific issue I'm having is on the very last line, when I try to self join the transaction ids - it always returns zero rows. It does work if I join only by wl1.n = wl2.n-1 (the position on the array) which is useless if I can't constrain it to a same id.
Athena doesn't support the ngrams function by presto, so I'm left with this approach.
Any clues why this isn't working?
Thanks!
This is speculation. But I note that your CTE is using limit with no order by. That means that an arbitrary set of rows is being returned.
Although some databases materialize CTEs, many do not. They run the code independently each time it is referenced. My guess is that the code is run independently and the arbitrary set of 50 rows has no transaction ids in common.
One solution would be to add order by transacdtion_id in the subquery.

Adding a "calculated column" to BigQuery query without repeating the calculations

I want to resuse value of calculated columns in a new third column.
For example, this query works:
select
countif(cond1) as A,
countif(cond2) as B,
countif(cond1)/countif(cond2) as prct_pass
From
Where
Group By
But when I try to use A,B instead of repeating the countif, it doesn't work because A and B are invalid:
select
countif(cond1) as A,
countif(cond2) as B,
A/B as prct_pass
From
Where
Group By
Can I somehow make the more readable second version work ?
Is this first one inefficient ?
You should construct a subquery (i.e. a double select) like
SELECT A, B, A/B as prct_pass
FROM
(
SELECT countif(cond1) as A,
countif(cond2) as B
FROM <yourtable>
)
The same amount of data will be processed in both queries.
In the subquery one you will do only 2 countif(), in case that step takes a long time then doing 2 instead of 4 should be more efficient indeed.
Looking at an example using bigquery public datasets:
SELECT
countif(homeFinalRuns>3) as A,
countif(awayFinalRuns>3) as B,
countif(homeFinalRuns>3)/countif(awayFinalRuns>3) as division
FROM `bigquery-public-data.baseball.games_post_wide`
or
SELECT A, B, A/B as division FROM
(
SELECT countif(homeFinalRuns>3) as A,
countif(awayFinalRuns>3) as B
FROM `bigquery-public-data.baseball.games_post_wide`
)
we can see that doing all in one (without a subquery) is actually slightly faster. (I ran the queries 6 times for different values of the inequality, 5 times was faster and one time slower)
In any case, the efficiency will depend on how taxing is to compute the condition in your particular dataset.

Infinite loop with recursive SQL query

I can't seem to find the reason behind the infinite loop in this query, nor how to correct it.
Here is the context :
I have a table called mergesWith with this description :
mergesWith: information about neighboring seas. Note that in this relation, for every pair of
neighboring seas (A,B), only one tuple is given – thus, the relation is not symmetric.
sea1: a sea
sea2: a sea.
I want to know every sea accessible from the Mediterranean Sea by navigating. I have opted for a recursive query using "with" :
With
acces(p,d) as (
select sea1 as p, sea2 as d
from MERGESWITH
UNION ALL
select a.p, case when mw.sea1=a.d
then mw.sea2
else mw.sea1
end as d
from acces a, MERGESWITH mw
where a.d=mw.sea1 or a.d=mw.sea2)
select d
from acces
where p= 'Mediterranean Sea';
I think the cause is either the case when or the a.d=mw.sea1 or a.d=mw.sea2 that is not restrictive enough, but I can't seem to pinpoint why.
I get this error message :
32044. 00000 - "cycle detected while executing recursive WITH query"
*Cause: A recursive WITH clause query produced a cycle and was stopped
in order to avoid an infinite loop.
*Action: Rewrite the recursive WITH query to stop the recursion or use
the CYCLE clause.
The cycles are caused by the structure of your query, not by cycles in the data. You ask for the reason for cycling. That should be obvious: at the first iteration, one row of output has d = 'Aegean Sea'. At the second iteration, you will find a row with d = 'Mediterranean Sea', right? Can you now see how this will result in cycles?
Recursive queries have a cycle clause used exactly for this kind of problem. For some reason, even many users who learned the recursive with clause well, and use it all the time, seem unaware of the cycle clause (as well as the unrelated, but equally useful, search clause - used for ordering the output).
In your code, you need to make two changes. Add the cycle clause, and also in the outer query filter for non-cycle rows only. In the cycle clause, you can decide what to call the "cycle" column, and what values to give it. To make this look as similar to connect by queries as possible, I like to call the new column IS_CYCLE and to give it the values 0 (for no cycle) and 1 (for cycle). In the outer query below, add is_cycle to the select list to see what it adds to the recursive query.
Notice the position of the cycle clause: it comes right after the recursive with clause (in particular, after the closing parenthesis at the end of the recursive factored subquery).
with
acces(p,d) as (
select sea1 as p, sea2 as d
from MERGESWITH
UNION ALL
select a.p, case when mw.sea1=a.d
then mw.sea2
else mw.sea1
end as d
from acces a, MERGESWITH mw
where a.d=mw.sea1 or a.d=mw.sea2)
cycle d set is_cycle to 1 default 0 -- add this line
select d
from acces
where p= 'Mediterranean Sea'
and is_cycle = 0 -- and this line
;
Clearly, this would be data-dependent due to cycles in the data. I typically include a lev value when developing recursive CTEs. This makes it simpler to debug them.
So, try something like this:
with acces(p, d, lev) as (
select sea1 as p, sea2 as d, 1 as lev
from MERGESWITH
union all
select a.p,
(case when mw.sea1 = a.d then mw.sea2 else mw.sea1 end) as d,
lev + 1
from acces a join
MERGESWITH mw
on a.d in (mw.sea1, mw.sea2)
where lev < 5)
select d
from acces
where p = 'Mediterranean Sea';
If you find the reason but can't fix the code, ask a new question with sample data and desired results. A DB fiddle of some sort is also helpful.

Stream Analytics UDF - using the output from one UDF in another

The following code results in my GT2HP value being null in the follow on UDFs:
SELECT
UDF.GT2HP(Collect()) as GT2HP,
UDF.LPLPReturns(Collect()) as LPLPReturns,
UDF.LPGasHeater(Collect()) as LPGasHeater,
UDF.HPRaisedSW(Collect(), AVG(GT2HP)) as HPRaisedSW,
UDF.HPCustomerDemand(Collect(), AVG(GT2HP)) as HPCustomerDemand
INTO SQLDWUKSTEAMLOSS
FROM IotHubInput
WHERE IoTHub.ConnectionDeviceId = 'uk-iotedge'
GROUP BY TumblingWindow(second, 60)
The following code works:
SELECT
UDF.GT2HP(Collect()) as GT2HP,
UDF.LPLPReturns(Collect()) as LPLPReturns,
UDF.LPGasHeater(Collect()) as LPGasHeater,
UDF.HPRaisedSW(Collect(), UDF.GT2HP(Collect())) as HPRaisedSW,
UDF.HPCustomerDemand(Collect(), UDF.GT2HP(Collect())) as HPCustomerDemand
INTO SQLDWUKSTEAMLOSS
FROM IotHubInput
WHERE IoTHub.ConnectionDeviceId = 'uk-iotedge'
GROUP BY TumblingWindow(second, 60)
Obviously the second code is way more computationally expensive than the first and I'd like to avoid it if possible.
I'd like to use the output of the first UDF in my follow on UDFs, but it seems to pass on null. All the select statements appear to execute in parallel not serial, which probably explains the null.
Is there a way to use the output of one UDF in another UDF?
The reason that GT2HP column referenced in the AVG(GT2HP) is always null is due to the SQL semantics.
Columns in the SELECT clause can only refer to sources referenced in FROM, and since there is no IotHubInput.GT2HP - it is interpreted as null.
If you separate your query into multiple steps, as Vignesh suggested you will end up with the first step just computing the COLLECT over the 60 sec window:
SELECT Collect() AS c
WHERE IoTHub.ConnectionDeviceId = 'uk-iotedge'
FROM IotHubInput
GROUP BY TumblingWindow(second, 60)
Let's name it step1. Now, since you are grouping only by a window, you will have just one value of column c every 60 sec.
Any aggregation over this is not necessary unless you increase the size of the window to aggregate more than one value...
So the AVG in the AVG(GT2HP) is unnecessary.
The second step then will be:
SELECT
c,
GT2HP = UDF.GT2HP(c)
FROM step1
Let's call this step step2.
Now the final selection will be:
SELECT
GT2HP,
UDF.LPLPReturns(c) as LPLPReturns,
UDF.LPGasHeater(c) as LPGasHeater,
UDF.HPRaisedSW(c, GT2HP) as HPRaisedSW,
UDF.HPCustomerDemand(c, GT2HP) as HPCustomerDemand
INTO SQLDWUKSTEAMLOSS
FROM step2
And putting it all together:
WITH step1 AS (
SELECT Collect() AS c
WHERE IoTHub.ConnectionDeviceId = 'uk-iotedge'
FROM IotHubInput
GROUP BY TumblingWindow(second, 60)
),
step2 AS (
SELECT
c,
GT2HP = UDF.GT2HP(c)
FROM step1
)
SELECT
GT2HP,
UDF.LPLPReturns(c) as LPLPReturns,
UDF.LPGasHeater(c) as LPGasHeater,
UDF.HPRaisedSW(c, GT2HP) as HPRaisedSW,
UDF.HPCustomerDemand(c, GT2HP) as HPCustomerDemand
INTO SQLDWUKSTEAMLOSS
FROM step2
You can write it as two statements. First one selects Collect() and avg() with a group by. Second select uses the results to call UDF.

RECURSIVE in SQL

I'm learning SQL and had a hard time understanding the following recursive SQL statement.
WITH RECURSIVE t(n) AS (
SELECT 1
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
What is n and t from SELECT sum(n) FROM t;? As far as I could understand, n is a number of t is a set. Am I right?
Also how is recursion triggered in this statement?
The syntax that you are using looks like Postgres. "Recursion" in SQL is not really recursion, it is iteration. Your statement is:
WITH RECURSIVE t(n) AS (
SELECT 1
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
The statement for t is evaluated as:
Evaluate the non-self-referring part (select 1).
Then evaluate the self-referring part. (Initially this gives 2.)
Then evaluation the self-referring part again. (3).
And so on while the condition is still valid (n < 100).
When this is done the t subquery is finished, and the final statement can be evaluated.
This is called a Common Table Expression, or CTE.
The RECURSIVE from the query doesn't mean anything: it's just another name like n or t. What makes things recursive is that the CTE named t references itself inside the expression. To produce the result of the expression, the query engine must therefore recursively build the result, where each evaluation triggers the next. It reaches this point: SELECT n+1 FROM t... and has to stop and evaluate t. To do that, it has to call itself again, and so on, until the condition (n < 100) no longer holds. The SELECT 1 provides a starting point, and the WHERE n < 100 makes it so that the query does not recur forever.
At least, that's how it's supposed to work conceptually. What generally really happens is that the query engine builds the result iteratively, rather than recursively, if it can, but that's another story.
Let's break this apart:
WITH RECURSIVE t(n) AS (
A Common Table Expression (CTE) which is supposed to include a seed query and a recursive query. CTE is called t and returns 1 column: n
The seed query:
SELECT 1
returns ans answer set (in this case a just a single row: 1) and puts a copy of it into the final answer set
Now starts the recursive part:
UNION ALL
The rows returned from the seed query are now processed and n+1 is returned (again a single row answer set: 2) and copied into the final answer set:
SELECT n+1 FROM t WHERE n < 100
If this step returned a non-empty answer set (activity_count > 0) it's repeated (forever).
A WHERE-condition on a calculation like this n+1 is usually used to avoid an endless recursion. One usually knows the maximum possible level based on the data and for complex queries it's too easy to put some conditions wrong ;-)
Finally the answer set is returned:
)
SELECT sum(n) FROM t;
When you simply do a SELECT * FROM t; you'll see all numbers from 1 to 100, it's not a very efficient way to produce this list.
The most important thing to remember is that each step produces a part of the final result and only those rows from the previous step are processed in the next recursion level.