Evaluation of CTEs in SQL Server 2005 - sql

I have a question about how MS SQL evaluates functions inside CTEs. A couple of searches didn't turn up any results related to this issue, but I apologize if this is common knowledge and I'm just behind the curve. It wouldn't be the first time :-)
This query is a simplified (and obviously less dynamic) version of what I'm actually doing, but it does exhibit the problem I'm experiencing. It looks like this:
CREATE TABLE #EmployeePool(EmployeeID int, EmployeeRank int);
INSERT INTO #EmployeePool(EmployeeID, EmployeeRank)
SELECT 42, 1
UNION ALL
SELECT 43, 2;
DECLARE #NumEmployees int;
SELECT #NumEmployees = COUNT(*) FROM #EmployeePool;
WITH RandomizedCustomers AS (
SELECT CAST(c.Criteria AS int) AS CustomerID,
dbo.fnUtil_Random(#NumEmployees) AS RandomRank
FROM dbo.fnUtil_ParseCriteria(#CustomerIDs, 'int') c)
SELECT rc.CustomerID,
ep.EmployeeID
FROM RandomizedCustomers rc
JOIN #EmployeePool ep ON ep.EmployeeRank = rc.RandomRank;
DROP TABLE #EmployeePool;
The following can be assumed about all executions of the above:
The result of dbo.fnUtil_Random() is always an int value greater than zero and less than or equal to the argument passed in. Since it's being called above with #NumEmployees which has the value 2, this function always evaluates to 1 or 2.
The result of dbo.fnUtil_ParseCriteria(#CustomerIDs, 'int') produces a one-column, one-row table that contains a sql_variant with a base type of 'int' that has the value 219935.
Given the above assumptions, it makes sense (to me, anyway) that the result of the expression above should always produce a two-column table containing one record - CustomerID and an EmployeeID. The CustomerID should always be the int value 219935, and the EmployeeID should be either 42 or 43.
However, this is not always the case. Sometimes I get the expected single record. Other times I get two records (one for each EmployeeID), and still others I get no records. However, if I replace the RandomizedCustomers CTE with a true temp table, the problem vanishes completely.
Every time I think I have an explanation for this behavior, it turns out to not make sense or be impossible, so I literally cannot explain why this would happen. Since the problem does not happen when I replace the CTE with a temp table, I can only assume it has something to do with the functions inside CTEs are evaluated during joins to that CTE. Do any of you have any theories?

SQL Server's optimizer is free to decide whether to reevaluate a CTE or not.
For instance, this query:
WITH q AS
(
SELECT NEWID() AS n
)
SELECT *
FROM q
UNION ALL
SELECT *
FROM q
will produce two different NEWID()'s, however, if you use cached XML plan to wrap the CTE into an Eager Spool operation, the records will be same.

Related

OUTER/CROSS APPLY Subquery without FROM clause

Most online documentation or tutorials discussing OUTER|CROSS APPLY describe something like:
SELECT columns
FROM table OUTER|CROSS APPLY (SELECT … FROM …);
The subquery is normally a full SELECT … FROM … query.
I must have read somewhere that the subquery doesn’t need a FROM in which case the columns appear to come from the main query:
SELECT columns
FROM table OUTER|CROSS APPLY (SELECT … );
because I have used it routinely as a method to pre-calculate columns.
The question is what is really happening if the FROM is omitted from the sub query? Is it short for something else? I found that it does not mean the same as from the main table.
I have a sample here: http://sqlfiddle.com/#!18/0188f7/4/1
First consider
SELECT o.name, o.type
FROM sys.objects o
Now consider
SELECT o.name, (SELECT o.type) AS type
FROM sys.objects o
A SELECT without a FROM is as though selecting from an imaginary single row table. The above doesn't change the results the scalar subquery just acts as a correlated sub query and uses the value from the outer query.
APPLY behaves in the same way. References to columns from the outer query are just passed in as correlated parameters. So this is the same as
SELECT o.name, ca.type
FROM sys.objects o
CROSS APPLY (SELECT o.type) AS ca
But APPLY in general is more capable than a scalar subquery in the SELECT (in that it can act to expand a row out or remove rows from the result)
What you have mentioned is not SUBQUERY. It is separate table expression. Whether you use FROM clause in the right expression or not problem.
If you use FROM clause in right table expression then you have got a source for the data in right table expression.
If you dont use FROM clause in the right expression, your source of data comes from left table expression.
First we will see what is APPLY operator. Reference BOL
Using APPLY
Both the left and right operands of the APPLY operator are table
expressions. The main difference between these operands is that the
right_table_source can use a table-valued function that takes a column
from the left_table_source as one of the arguments of the function.
The left_table_source can include table-valued functions, but it
cannot contain arguments that are columns from the right_table_source.
The APPLY operator works in the following way to produce the table
source for the FROM clause:
Evaluates right_table_source against each row of the left_table_source to produce rowsets.
The values in the right_table_source depend on left_table_source.
right_table_source can be represented approximately this way:
TVF(left_table_source.row), where TVF is a table-valued function.
Combines the result sets that are produced for each row in the evaluation of right_table_source with the left_table_source by
performing a UNION ALL operation.
The list of columns produced by the result of the APPLY operator is
the set of columns from the left_table_source that is combined with
the list of columns from the right_table_source.
Based on the way you are using APPLY operator, it will behave as correlated subquery or CROSS JOIN
Using values of the left table expression in right table expression
-- without FROM (similar to Correlated Subquery)
SELECT id, data, value
FROM test OUTER APPLY(SELECT data*10 AS value) AS sq;
Not using values of left table expression in right table expression
-- FROM table (Similar to cross join)
SELECT id, data, value
FROM test OUTER APPLY(SELECT data*10 AS value FROM test) AS sq;
Omitting the FROM statement is not specific to a CROSS/OUTER APPLY; any valid SQL select statement can omit it. By not using FROM you have no source for your data, so you can't specify columns within that source. Rather you can only select values that already exist; be that constants defined in the statement itself, or in some cases (e.g. subqueries) columns referenced from other parts of the query.
This is simpler to understand if you're familiar with Oracle's Dual table; a table with 1 row. In MS SQL that table would look like this:
-- Ref: https://blog.sqlauthority.com/2010/07/20/sql-server-select-from-dual-dual-equivalent/
CREATE TABLE DUAL
(
DUMMY VARCHAR(1) NOT NULL
, CONSTRAINT CHK_ColumnD_DocExc CHECK (DUMMY = 'X') -- ensure this column can only hold the value X
, CONSTRAINT PK_DUAL PRIMARY KEY (DUMMY) -- ensure we can only have unique values... combined with the above means we can only ever have 1 row
)
GO
INSERT INTO DUAL (DUMMY)
VALUES ('X')
GO
You can then do select 1 one, 'something else' two from dual. You're not really using dual; just ensuring that you have a table which will always return exactly 1 row.
Now in SQL anywhere you omit a FROM statement consider that statement as if it said FROM DUAL / it has the same meaning, only SQL allows this more shorthand approach.
Update
You mention in the comments that you don't see how you can reference columns from the original statement when in a subquery (e.g. of the kind you may see when using APPLY). The below code shows this without the APPLY scenario. Admittedly the demo code here's not somehting you'd ever use (since you could just to where Something like '%o%' on the original statement without needing the subquery/in statement), but for illustrative purposes it shows exactly the same sort of scenario as you've got with your APPLY scenario; i.e. the statement is just returning the value of SOMETHING for the current row.
declare #someTable table (
Id bigint not null identity(1,1)
, Something nvarchar(32) not null
)
insert #someTable (Something) values ('one'), ('two'), ('three')
select *
from #someTable x
where x.Something in
(
-- this subquery references the SOMETHING column from above, but doesn't have a FROM statement
-- note: there is only 1 value at a time for something here; not all 3 values at once; it's the same single value as Something as we have before the in keyword above
select Something
where Something like '%o%'
)

Hdp, Hive, Lateral view and null: disappearing rows

Since the upgrade from hdp 3.1.0 to 3.1.4, I have some issue in Hive I do not understand. Note that I am only using ORC transactional tables.
For instance this query:
with cte as (
select
e.id
, '{}' as json
from event e
)
-- select count(*) from cte
select
id
, lv.customfield
from cte
lateral view outer
json_tuple(cte.json, 'customfield') cv AS `customfield`
It worked perfectly before the upgrade.
Now, even if the CTE returns a certain number of rows, using the lateral view will just drop rows from the resultset, without any error, whereas there is no extra where clause outside the CTE (in my real example, the query returns 66 rows without the lateral view, but only 19 with).
In my case I have:
select count(*) give me 66 rows
when the lateral view on a static string is added, I only get 19 rows.
I tried quite a few variations:
if I replace the event table by a static CTE (select stack(1, ...)) I have the result I expect
if I remove the lateral view, I have the number of rows I expect (as long as I do not use is distinct from)
if instead of a CTE I create and use a temporary table, the outcome does not change.
if I put json_tuple(cte.json, 'customfield') in the select part outside the CTE (and nothing else as it would not be valid), without the lateral view, I have the number of expected rows,
If I use get_json_object in the select part outside the CTE (and no lateral view) I have the expected results.
of course, there is nothing in the hive (server or metastore) logs.
as a side note, since the upgrade a merge statement [keeps generating duplicates][1], whereas it worked perfectly before.
Another extremely surprising thing is that inside the CTE there is an if statement, for instance: if(is_deleted is null, 'true', 'false').
If I replace the is null with is not distinct from null, which should be perfectly valid, no rows are returned by the CTE.
I am completely at loss and I have no idea why this happens and how I can trust hive. 
I cannot replicate the error by generating manual data so I cannot give a (not) working example.
The actual reason I do not understand yet, but I could isolate the problem and could actually submit a bug report: https://issues.apache.org/jira/browse/HIVE-22500
In short, a lesser than or equals with implicit string conversion to timestamp fails if a sort by (implicit or explicit) is involved.
-- valid result
select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00') t;
-- invalid result
select count(*) from ( select * from opens where load_ts <= '2019-11-13 09:07:00' sort by id) t;
You can see the bug report for full set up or other examples. The workaround is to explicitly cast the string to a timestamp.

How Do I Combine Multiple SQL Queries?

I'm having some trouble figuring out any way to combine two SQL queries into a single one that expresses some greater idea.
For example, let's say that I have query A, and query B. Query A returns the total number of hours worked. Query B returns the total number of hours that were available for workers to work. Each one of these queries returns a single column with a single row.
What I really want, though, is essentially query A over query B. I want to know the percentage of capacity that was worked.
I know how to write query A and B independently, but my problem comes when I try to figure out how to use those prewritten queries to come up with a new SQL query that uses them together. I know that, on a higher level, like say in a report, I could just call both queries and then divide them, but I'd rather encompass it all into a single SQL query.
What I'm looking for is a general idea on how to combine these queries using SQL.
Thanks!
Unconstrained JOIN, Cartesian Product of 1 row by 1 row
SELECT worked/available AS PercentageCapacity
FROM ( SELECT worked FROM A ),
( SELECT available FROM B )
You can declare variables to store the results of each query and return the difference:
DECLARE #first INT
DECLARE #second INT
SET #first = SELECT val FROM Table...
SET #second = SELECT val FROM Table...
SELECT #first - #second
The answer depends on where the data is coming from.
If it's coming from a single table, it could be something as easy as:
select totalHours, availableHours, (totalHours - availableHours) as difference
from hoursTable
But if the data is coming from separate tables, you need to add some identifying column so that the rows can be joined together to provide some useful view of the data.
You may want to post examples of your queries so we know better how to answer your question.
You can query the queries:
SELECT
a.ID
a.HoursWorked/b.HoursAvailable AS UsedWork
FROM
( SELECT ID, HoursWorked FROM Somewhere ) a
INNER JOIN
( SELECT ID, HoursAvailable FROM SomewhereElse ) b
ON
a.ID = b.ID

When to use EXCEPT as opposed to NOT EXISTS in Transact SQL?

I just recently learned of the existence of the new "EXCEPT" clause in SQL Server (a bit late, I know...) through reading code written by a co-worker. It truly amazed me!
But then I have some questions regarding its usage: when is it recommended to be employed? Is there a difference, performance-wise, between using it versus a correlated query employing "AND NOT EXISTS..."?
After reading EXCEPT's article in the BOL I thought it was just a shorthand for the second option, but was surprised when I rewrote a couple queries using it (so they had the "AND NOT EXISTS" syntax much more familiar to me) and then checked the execution plans - surprise! The EXCEPT version had a shorter execution plan, and executed faster, also. Is this always so?
So I'd like to know: what are the guidelines for using this powerful tool?
EXCEPT treats NULL values as matching.
This query:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
WHERE value NOT IN
(
SELECT value
FROM p
)
will return an empty rowset.
This query:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
WHERE NOT EXISTS
(
SELECT NULL
FROM p
WHERE p.value = q.value
)
will return
NULL
1
, and this one:
WITH q (value) AS
(
SELECT NULL
UNION ALL
SELECT 1
),
p (value) AS
(
SELECT NULL
UNION ALL
SELECT 2
)
SELECT *
FROM q
EXCEPT
SELECT *
FROM p
will return:
1
Recursive reference is also allowed in EXCEPT clause in a recursive CTE, though it behaves in a strange way: it returns everything except the last row of a previous set, not everything except the whole previous set:
WITH q (value) AS
(
SELECT 1
UNION ALL
SELECT 2
UNION ALL
SELECT 3
),
rec (value) AS
(
SELECT value
FROM q
UNION ALL
SELECT *
FROM (
SELECT value
FROM q
EXCEPT
SELECT value
FROM rec
) q2
)
SELECT TOP 10 *
FROM rec
---
1
2
3
-- original set
1
2
-- everything except the last row of the previous set, that is 3
1
3
-- everything except the last row of the previous set, that is 2
1
2
-- everything except the last row of the previous set, that is 3, etc.
1
SQL Server developers must just have forgotten to forbid it.
I have done a lot of analysis of except, not exists, not in and left outer join. Generally the left outer join is the fastest for finding missing rows, especially joining on a primary key. Not In can be very fast if you know it will be a small list returned in the select.
I use EXCEPT a lot to compare what is being returned when rewriting code. Run the old code saving results. Run new code saving results and then use except to capture all differences. It is a very quick and easy way to find differences, especially when needing to get all differences including null. Very good for on the fly easy coding.
But, every situation is different. I say to every developer I have ever mentored. Try it. Do timings all different ways. Try it, time it, do it.
EXCEPT compares all (paired)columns of two full-selects.
NOT EXISTS compares two or more tables accoding to the conditions specified in WHERE clause in the sub-query following NOT EXISTS keyword.
EXCEPT can be rewritten by using NOT EXISTS.
(EXCEPT ALL can be rewritten by using ROW_NUMBER and NOT EXISTS.)
Got this from here
There is no accounting for SQL server's execution plans. I have always found when having performance issues that it was utterly arbitrary (from a user's perspective, I'm sure the algorithm writers would understand why) when one syntax made a better execution plan rather than another.
In this case, something about the query parameter comparison allows SQL to figure out a shortcut that it couldn't from a straight select statement. I'm sure that is a deficiency in the algorithm. In other words, you could logically interpolate the same thing, but the algorithm doesn't make that translation on an exists query. Sometimes that is because an algorithm that could reliably figure it out would take longer to execute than the query itself, or at least the algorithm designer thought so.
If your query is fine tuned then there is no performance difference b/w using of EXCEPT clause and NOT EXIST/NOT IN.. first time when I ran EXCEPT after changing my correlated query into it.. I was surprised because it returned with the result just in 7 secs while correlated query was returning in 22 secs.. then I used distinct clause in my correlated query and reran.. it also returned in 7 secs.. so EXCEPT is good when you don't know or don't have time to fine tuned your query otherwise both are same performance wise..

SQL - Use results of a query as basis for two other queries in one statement

I'm doing a probability calculation. I have a query to calculate the total number of times an event occurs. From these events, I want to get the number of times a sub-event occurs. The query to get the total events is 25 lines long and I don't want to just copy + paste it twice.
I want to do two things to this query: calculate the number of rows in it, and calculate the number of rows in the result of a query on this query. Right now, the only way I can think of doing that is this (replace #total# with the complicated query to get all rows, and #conditions# with the less-complicated conditions that rows, from #total#, must have to match the sub-event):
SELECT (SELECT COUNT(*) FROM (#total#) AS t1 WHERE #conditions#) AS suboccurs,
COUNT(*) AS totaloccurs FROM (#total#) as t2
As you notice, #total# is repeated twice. Is there any way around this? Is there a better way to do what I'm trying to do?
To re-emphasize: #conditions# does depend on what #total# returns (it does stuff like t1.foo = bar).
Some final notes: #total# by itself takes ~250ms. This more complicated query takes ~300ms, so postgres is likely doing some optimization, itself. Still, the query looks terribly ugly with #total# literally pasted in twice.
If your sql supports subquery factoring, then rewriting it using the WITH statement is an option. It allows subqueries to be used more than once. With will create them as either an inline-view or a temporary table in Oracle.
Here is a contrived example.
WITH
x AS
(
SELECT this
FROM THERE
WHERE something is true
),
y AS
(
SELECT this-other-thing
FROM somewhereelse
WHERE something else is true
),
z AS
(
select count(*) k
FROM X
)
SELECT z.k, y.*, x.*
FROM x,y, z
WHERE X.abc = Y.abc
SELECT COUNT(*) as totaloccurs, COUNT(#conditions#) as suboccurs
FROM (#total# as t1)
Put the reused sub-query into a temp table, then select what you need from the temp table.
#EvilTeach:
I've not seen the "with" (probably not implemented in Sybase :-(). I like it: does what you need in one chunk then goes away, with even less cruft than temp tables. Cool.