Is the 'WITH' clause only syntactic sugar? - sql

I'm trying to find the best way to iterate over the same set of data multiple times in a single query, and in the way I started wondering wether 'with' would save some execution time (by reducing the amount of times I have to query the data I need to work with) since some people claim that it creates only a reference (a subquery) to the source and others claim that it stores the result of the query in memory and then allows you to query them(that's what I would like it to be) saving tons of execution time.
For better explanation here's a resume of my code:
WITH MY_DATA AS(
SELECT
TABLE_A, TABLE_A.B, TABLE_B.C
FROM TABLE_A
JOIN
TABLE_B
ON TABLE_A.SOME_FILTER = TABLE_B.SOME_FILTER
WHERE SOME_OTHER_CONDITION = 0
)
SELECT
SUM CASE WHEN(A = 'SOME_VALUE') THEN '1' ELSE 0 END
FROM MY_DATA
UNION ALL
SELECT
SUM CASE WHEN(B = 'SOME_OTHER_VALUE') THEN '1' ELSE 0 END
FROM MY_DATA
UNION ALL
SELECT
SUM CASE WHEN(C = 'YET_ANOTHER_VALUE') THEN '1' ELSE 0 END
FROM MY_DATA
Would using subquerys in each union make any difference at all? Or am I just making it aesthetically pleasing?

SQL Server treats CTEs as code blocks that are inserted as code in the query each time the CTE is referenced. Hence, your interpretation of them as "syntactic sugar" does apply -- in this case and in this database.
This is specific to SQL Server. Other databases materialize CTEs sometimes or always. In these databases, CTEs make it easier to remove common code (i.e. to re-use the materialized version of the CTE).
In addition, recursive CTEs cannot be expressed in any other ways using a single SELECT query. So, they serve an independent purpose as well.

Related

How to improve SQL query performance containing partially common subqueries

I have a simple table tableA in PostgreSQL 13 that contains a time series of event counts. In stylized form it looks something like this:
event_count sys_timestamp
100 167877672772
110 167877672769
121 167877672987
111 167877673877
... ...
With both fields defined as numeric.
With the help of answers from stackoverflow I was able to create a query that basically counts the number of positive and negative excess events within a given time span, conditioned on the current event count. The query looks like this:
SELECT t1.*,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count >= t1.event_count+10)
AS positive,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count <= t1.event_count-10)
AS negative
FROM tableA as t1
The query works as expected, and returns in this particular example for each row a count of positive and negative excesses (range + / - 10) given the defined time window (+ 1000 [milliseconds]).
However, I will have to run such queries for tables with several million (perhaps even 100+ million) entries, and even with about 500k rows, the query takes a looooooong time to complete. Furthermore, whereas the time frame remains always the same within a given query [but the window size can change from query to query], in some instances I will have to use maybe 10 additional conditions similar to the positive / negative excesses in the same query.
Thus, I am looking for ways to improve the above query primarily to achieve better performance considering primarily the size of the envisaged dataset, and secondarily with more conditions in mind.
My concrete questions:
How can I reuse the common portion of the subquery to ensure that it's not executed twice (or several times), i.e. how can I reuse this within the query?
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000)
Is there some performance advantage in turning the sys_timestamp field which is currently numeric, into a timestamp field, and attempt using any of the PostgreSQL Windows functions? (Unfortunately I don't have enough experience with this at all.)
Are there some clever ways to rewrite the query aside from reusing the (partial) subquery that materially increases the performance for large datasets?
Is it perhaps even faster for these types of queries to run them outside of the database using something like Java, Scala, Python etc. ?
How can I reuse the common portion of the subquery ...?
Use conditional aggregates in a single LATERAL subquery:
SELECT t1.*, t2.positive, t2.negative
FROM tableA t1
CROSS JOIN LATERAL (
SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
, COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000
) t2;
It can be a CROSS JOIN because the subquery always returns a row. See:
JOIN (SELECT ... ) ue ON 1=1?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Use conditional aggregates with the FILTER clause to base multiple aggregates on the same time frame. See:
Aggregate columns with additional (distinct) filters
event_count should probably be integer or bigint. See:
PostgreSQL using UUID vs Text as primary key
Is there any difference in saving same value in different integer types?
sys_timestamp should probably be timestamp or timestamptz. See:
Ignoring time zones altogether in Rails and PostgreSQL
An index on (sys_timestamp) is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count) typically helps some more. If the table is vacuumed enough, you get index-only scans from it.
Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See:
Window Functions or Common Table Expressions: count previous rows within range
What are the pros and cons of performing calculations in sql vs. in your application
You have the right idea.
The way to write statements you can reuse in a query is "with" statements (AKA subquery factoring). The "with" statement runs once as a subquery of the main query and can be reused by subsequent subqueries or the final query.
The first step includes creating parent-child detail rows - table multiplied by itself and filtered down by the timestamp.
Then the next step is to reuse that same detail query for everything else.
Assuming that event_count is a primary index or you have a compound index on event_count and sys_timestamp, this would look like:
with baseQuery as
(
SELECT distinct t1.event_count as startEventCount, t1.event_count+10 as pEndEventCount
,t1.eventCount-10 as nEndEventCount, t2.event_count as t2EventCount
FROM tableA t1, tableA t2
where t2.sys_timestamp between t1.sys_timestamp AND t1.sys_timestamp + 1000
), posSummary as
(
select bq.startEventCount, count(*) as positive
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.pEndEventCount
group by bq.startEventCount
), negSummary as
(
select bq.startEventCount, count(*) as negative
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.nEndEventCount
group by bq.startEventCount
)
select t1.*, ps.positive, nv.negative
from tableA t1
inner join posSummary ps on t1.event_count=ps.startEventCount
inner join negSummary ns on t1.event_count=ns.startEventCount
Notes:
The distinct for baseQuery may not be necessary based on your actual keys.
The final join is done with tableA but could also use a summary of baseQuery as a separate "with" statement which already ran once. Seemed unnecessary.
You can play around to see what works.
There are other ways of course but this best illustrates how and where things could be improved.
With statements are used in multi-dimensional data warehouse queries because when you have so much data to join with so many tables(dimensions and facts), a strategy of isolating the queries helps understand where indexes are needed and perhaps how to minimize the rows the query needs to deal with further down the line to completion.
For example, it should be obvious that if you can minimize the rows returned in baseQuery or make it run faster (check explain plans), your query improves overall.

Improve Netezza SQL Query That Contains Hundreds of Strings in WHERE Clause

I have a Netezza query with a WHERE clause that includes several hundred potential strings. I'm surprised that it runs, but it takes time to complete and occasionally errors out ('transaction rolled back by client'). Here's a pseudo code version of my query.
SELECT
TO_CHAR(X.I_TS, 'YYYY-MM-DD') AS DATE,
X.I_SRC_NM AS CHANNEL,
X.I_CD AS CODE,
COUNT(DISTINCT CASE WHEN X.I_FLG = 1 THEN X.UID ELSE NULL) AS WIDGETS
FROM
(SELECT
A.I_TS,
A.I_SRC_NM,
A.I_CD,
B.UID,
B.I_FLG
FROM
SCHEMA.DATABASE.TABLE_A A
LEFT JOIN SCHEMA.DATABASE.TABLE_B B ON A.UID = B.UID
WHERE
A.I_TS BETWEEN '2017-01-01' AND '2017-01-15'
AND B.TAB_CODE IN ('00AV', '00BX', '00C2', '00DJ'...
...
...
...
...
...
...
...)
) X
GROUP BY
X.I_TS,
X.I_SRC_NM,
X.I_CD
;
In my query, I'm limiting the results on B.TAB_CODE to about 1,200 values (out of more than 10k). I'm honestly surprised that it works at all, but it does most of the time.
Is there a more efficient way to handle this?
If the IN clause becomes too cumbersome, you can make your query in multiple parts. Create a temporary table containing a TAB_CODE set then use it in a JOIN.
WITH tab_codes(tab_code) AS (
SELECT '00AV'
UNION ALL
SELECT '00BX'
--- etc ---
)
SELECT
TO_CHAR(X.I_TS, 'YYYY-MM-DD') AS DATE,
X.I_SRC_NM AS CHANNEL,
--- etc ---
INNER JOIN tab_codes Q ON B.TAB_CODES = Q.tab_code
If you want to boost performance even more, consider using a real temporary table (CTAS)
We've seen situations where it's "cheaper" to CTAS the original table to another, distributed on your primary condition, and then querying that table instead.
If im guessing correctly , the X.I_TS is in fact a ‘timestamp’, and as such i expect it to contain many different values per day. Can you confirm that?
If I’m right the query can possibly benefit from changing the ‘group by X.I._TS,...’ to ‘group by 1,...’
Furthermore the ‘Count(Distinct Case...’ can never return anything else than 1 or NULL. Can you confirm that?
If I’m right on that, you can get rid of the expensive ‘DISTINCT’ by changing it to ‘MAX(Case...’
Can you follow me :)

Conditional aggregate database queries and their performance implications

I think this question is best asked with an example: if you want two counts from a table - say one with all the rows with a bit flag set to false and another with all of the ones set to true - is there a best practice for this kind of query and what are the performance implications of any approaches that could be taken?
To expand a little, and basing it off of the article below, how would separate queries compare to the version with the CASE evaluation in the SELECT list from a performance point of view? Are there other methods?
http://www.codeproject.com/Articles/310674/Conditional-Sums-in-SQL-Aggregate-Methods
Other than Blam's way, I think there are three basic ways to get the desired result. I tested the three options below as well as Blam's on my system. The results I found were as follows. Also, a side note, we didn't have any bit data in our system so I counted an indexed column with two values ("H" or "R").
Using Conditional Aggregates method resulted in the fastest performance. Using Blam's Grouping with an Aggregate method resulted in the second fastest way, consistently taking about 33% longer than the Conditional Aggregates. The Two Separate Select Statements method was the third fastest, consistently taking close to 50% longer than the Conditional Aggregates. Finally, the Joins method took the longest, and was close to 1000% slower than the Conditional Aggregates. The joins were expected (by me) to take the longest as you're joining to that table multiple times. The reason I included this method is because it was not discussed (possibly for obvious reasons) in the question; all performance issues aside, it is a viable if not extremely slow option. The two separate select statements also makes sense as you're running two separate aggregates, accessing that table two separate times.
I'm not sure what accounts for the differences between the conditional aggregate method and Blam's method. I've always been pleasantly surprised by the speed and performance of case statements, and today was no different.
I think the case statement method, aside from the performance considerations, is possibly the most versatile method. It allows you to work with just about any type of field and facilitates the selection of a subset of values, whereas Blam's Grouping with an Aggregate method would show all possible column values unless a Where clause were included.
Conditional Aggregates
Select SUM(Case When bitcol = 1 Then 1 Else 0 End) as True_Count
, SUM(Case When bitcol = 0 Then 1 Else 0 End) as False_Count
From Table;
Two separate select statements
Select Count(1) as True_Count
From Table
Where bitcol = 1;
Select Count(1) as False_Count
From Table
Where bitcol = 0;
Using Joins
Select Count(T2.bitcol) as True_Count
, Count(T3.bitcol) as False_Count
From Table T1
Left Outer Join Table T2
on T1.ID = T2.ID
Left Outer Join Table T3
on T1.ID = T3.ID;
SELECT [bitCol], count(*)
FROM [table]
GROUP BY [bitCol]
If that column is indexed it is an index scan followed by a stream aggregate.
Doubt you can do better than that

SQL EXISTS Why does selecting rownum cause inefficient execution plan?

Problem
I'm trying to understand why what seems like a minor difference in these two Oracle Syntax Update queries is causing a radically different execution plan.
Query 1:
UPDATE sales s
SET status = 'DONE', trandate = sysdate
WHERE EXISTS (Select *
FROM tempTable tmp
WHERE s.key1 = tmp.key1
AND s.key2 = tmp.key2
AND s.key3 = tmp.key3)
Query 2:
UPDATE sales s
SET status = 'DONE', trandate = sysdate
WHERE EXISTS (Select rownum
FROM tempTable tmp
WHERE s.key1 = tmp.key1
AND s.key2 = tmp.key2
AND s.key3 = tmp.key3)
As you can see the only difference between the two is that the subquery in Query 2 returns a rownum instead of the values of every row.
The execution plans for these two couldn't be more different:
Query1 - Pulls the total results from both tables and uses a sort and a hashjoin to return the results. This peforms well with a favorable 2,346 cost (despite the use of the EXISTS clause and the cohesive subquery).
Query2 - Pulls both table results as well but uses a count and a filter to accomplish the same task and returns an execution plan with an astonishing 77,789,696 cost! I should note that his query just hangs on me so I'm not actually positive this returns the same results (though I believe it should).
From my understanding of the Exists clause it is just a simple boolean check that runs per line of the main table. It doesn't matter if a single row is returned in my EXISTS condition or 100,000 rows... if any results are returned for the row that it is being run, then you've passed the exist check. So why would it matter what my subquery SELECT statement returns?
--------------------EDIT----------------------
Per request, below are the execution plans I'm running in TOAD... please note I edited the table names in my example above for ease - In these plans ALSS_SALES2 = sales above and SALESEXT_TMP = tempTABLE above.
Also should have mentioned but neither of the two tables has indices at this point.. I haven't yet added them to my tempTable and I'm testing with a cheap copy of the sales table which only contains the fields and data but no indices, constraints or security.
Thanks for the assistance everyone!
Query 1 Execution Plan
Query 2 Execution Plan
------------------------------------------------
Questions
1) Why did the call for rownum cause the execution plan to change?
2) What is it about the Filter that is so incredibally inefficient?
3) Am I missing something fundamental with the way the Exists clause works that is causing this change?
Posting the actual query plans would be quite helpful.
In general, though, when the optimizer sees a subquery with rownum, that radically limits its ability to transform the query and merge the results from the subquery with the main query because doing so potentially affects the results. That can be a quick way to force Oracle to materialize a subquery if that happens to be more efficient than the plan chosen by the optimizer. In this case, though, it is probably causing the optimizer to forego a transform step that makes the query more efficient.
Occasionally, you'll see someone take a query like
SELECT b.*
FROM (SELECT <<columns>>
FROM driving_table
WHERE <<conditions>>) a,
b
WHERE a.id = b.id
and tack on a rownum to the a subquery
SELECT b.*
FROM (SELECT <<columns>>, rownum
FROM driving_table
WHERE <<conditions>>) a,
b
WHERE a.id = b.id
in order to force the optimizer to evaluate the a subquery before executing the join. Normally, of course, the optimizer should do this by default if it is more efficient. But if the optimizer makes a mistake, adding rownum can be quicker than figuring out the right set of hints to force a plan or digging in to the underlying problem to figure out the right solution.
Of course, in the particular case that you have a subquery in a WHERE EXISTS where the only use of rownum comes in the SELECT list, we humans can detect that the rownum shouldn't prevent any query transform step that the optimizer would care to use. The optimizer, though, is probably using a more general rule that says that subqueries that reference a function like rownum must be completely executed (this may depend on the exact Oracle version and/or the optimizer settings). So the optimizer is realistically doing a bunch of extra work because it's not smart enough to recognize that the rownum you added cannot possibly affect the results of the query.
Just a question, what's the execution plan for this query:
UPDATE sales s
SET status = 'DONE', trandate = sysdate
WHERE EXISTS (Select NULL
FROM tempTable tmp
WHERE s.key1 = tmp.key1
AND s.key2 = tmp.key2
AND s.key3 = tmp.key3);
It visualize what is needed in an EXISTS (...) expression - actually nothing! As already stated Oracle just have to check if anything is returned, not what is returned in Sub-Query.

SQL: ... WHERE X IN (SELECT Y FROM ...)

Is the following the most efficient in SQL to achieve its result:
SELECT *
FROM Customers
WHERE Customer_ID NOT IN (SELECT Cust_ID FROM SUBSCRIBERS)
Could some use of joins be better and achieve the same result?
Any mature enough SQL database should be able to execute that just as effectively as the equivalent JOIN. Use whatever is more readable to you.
One reason why you might prefer to use a JOIN rather than NOT IN is that if the Values in the NOT IN clause contain any NULLs you will always get back no results. If you do use NOT IN remember to always consider whether the sub query might bring back a NULL value!
RE: Question in Comments
'x' NOT IN (NULL,'a','b')
≡ 'x' <> NULL and 'x' <> 'a' and 'x' <>
'b'
≡ Unknown and True and True
≡ Unknown
Maybe try this
Select cust.*
From dbo.Customers cust
Left Join dbo.Subscribers subs on cust.Customer_ID = subs.Customer_ID
Where subs.Customer_Id Is Null
SELECT Customers.*
FROM Customers
WHERE NOT EXISTS (
SELECT *
FROM SUBSCRIBERS AS s
JOIN s.Cust_ID = Customers.Customer_ID)
When using “NOT IN”, the query performs nested full table scans, whereas for “NOT EXISTS”, the query can use an index within the sub-query.
If you want to know which is more effective, you should try looking at the estimated query plans, or the actual query plans after execution. It'll tell you the costs of the queries (I find CPU and IO cost to be interesting). I wouldn't be surprised much if there's little to no difference, but you never know. I've seen certain queries use multiple cores on our database server, while a rewritten version of that same query would only use one core (needless to say, the query that used all 4 cores was a good 3 times faster). Never really quite put my finger on why that is, but if you're working with large result sets, such differences can occur without your knowing about it.