How does select * on a subquery affect performance - sql

I am writing a query that involves using several subqueries using a WITH clause.
i.e.
WITH z as
(WITH x AS
(SELECT col1, col2 FROM foo LEFT JOIN bar on foo.col1 = bar.col1)
SELECT foo, bar
FROM x
INNER JOIN table2
ON x.col2 = table.col2)
SELECT *
FROM z
LEFT JOIN table3
ON z.col1 = table3.col2
In reality, there are a few more subqueries and a lot more columns. Are there any performance issues with using the SELECT * on the subquery table (in this case, x or z)?
I want to avoid re-typing the same column names multiple times within one query but also need to optimize performance.

The answer depends on the database. CTEs can be handled by:
materializing an intermediate table and storing the results
merging the CTE code with the rest of the query
combining these two approaches
In the first approach, additional columns could have a small effect on performance. In the second, there should be no effect.
That said, what usually dominates query performance is the work done for joins and group bys. Assuming the columns are no unreasonably large, I wouldn't worry about the performance implications of using select * in a CTE.
I would question how you write the CTEs. There is no need for nested CTEs, because they can be defined sequentially.

Related

How to improve SQL query performance containing partially common subqueries

I have a simple table tableA in PostgreSQL 13 that contains a time series of event counts. In stylized form it looks something like this:
event_count sys_timestamp
100 167877672772
110 167877672769
121 167877672987
111 167877673877
... ...
With both fields defined as numeric.
With the help of answers from stackoverflow I was able to create a query that basically counts the number of positive and negative excess events within a given time span, conditioned on the current event count. The query looks like this:
SELECT t1.*,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count >= t1.event_count+10)
AS positive,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count <= t1.event_count-10)
AS negative
FROM tableA as t1
The query works as expected, and returns in this particular example for each row a count of positive and negative excesses (range + / - 10) given the defined time window (+ 1000 [milliseconds]).
However, I will have to run such queries for tables with several million (perhaps even 100+ million) entries, and even with about 500k rows, the query takes a looooooong time to complete. Furthermore, whereas the time frame remains always the same within a given query [but the window size can change from query to query], in some instances I will have to use maybe 10 additional conditions similar to the positive / negative excesses in the same query.
Thus, I am looking for ways to improve the above query primarily to achieve better performance considering primarily the size of the envisaged dataset, and secondarily with more conditions in mind.
My concrete questions:
How can I reuse the common portion of the subquery to ensure that it's not executed twice (or several times), i.e. how can I reuse this within the query?
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000)
Is there some performance advantage in turning the sys_timestamp field which is currently numeric, into a timestamp field, and attempt using any of the PostgreSQL Windows functions? (Unfortunately I don't have enough experience with this at all.)
Are there some clever ways to rewrite the query aside from reusing the (partial) subquery that materially increases the performance for large datasets?
Is it perhaps even faster for these types of queries to run them outside of the database using something like Java, Scala, Python etc. ?
How can I reuse the common portion of the subquery ...?
Use conditional aggregates in a single LATERAL subquery:
SELECT t1.*, t2.positive, t2.negative
FROM tableA t1
CROSS JOIN LATERAL (
SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
, COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000
) t2;
It can be a CROSS JOIN because the subquery always returns a row. See:
JOIN (SELECT ... ) ue ON 1=1?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Use conditional aggregates with the FILTER clause to base multiple aggregates on the same time frame. See:
Aggregate columns with additional (distinct) filters
event_count should probably be integer or bigint. See:
PostgreSQL using UUID vs Text as primary key
Is there any difference in saving same value in different integer types?
sys_timestamp should probably be timestamp or timestamptz. See:
Ignoring time zones altogether in Rails and PostgreSQL
An index on (sys_timestamp) is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count) typically helps some more. If the table is vacuumed enough, you get index-only scans from it.
Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See:
Window Functions or Common Table Expressions: count previous rows within range
What are the pros and cons of performing calculations in sql vs. in your application
You have the right idea.
The way to write statements you can reuse in a query is "with" statements (AKA subquery factoring). The "with" statement runs once as a subquery of the main query and can be reused by subsequent subqueries or the final query.
The first step includes creating parent-child detail rows - table multiplied by itself and filtered down by the timestamp.
Then the next step is to reuse that same detail query for everything else.
Assuming that event_count is a primary index or you have a compound index on event_count and sys_timestamp, this would look like:
with baseQuery as
(
SELECT distinct t1.event_count as startEventCount, t1.event_count+10 as pEndEventCount
,t1.eventCount-10 as nEndEventCount, t2.event_count as t2EventCount
FROM tableA t1, tableA t2
where t2.sys_timestamp between t1.sys_timestamp AND t1.sys_timestamp + 1000
), posSummary as
(
select bq.startEventCount, count(*) as positive
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.pEndEventCount
group by bq.startEventCount
), negSummary as
(
select bq.startEventCount, count(*) as negative
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.nEndEventCount
group by bq.startEventCount
)
select t1.*, ps.positive, nv.negative
from tableA t1
inner join posSummary ps on t1.event_count=ps.startEventCount
inner join negSummary ns on t1.event_count=ns.startEventCount
Notes:
The distinct for baseQuery may not be necessary based on your actual keys.
The final join is done with tableA but could also use a summary of baseQuery as a separate "with" statement which already ran once. Seemed unnecessary.
You can play around to see what works.
There are other ways of course but this best illustrates how and where things could be improved.
With statements are used in multi-dimensional data warehouse queries because when you have so much data to join with so many tables(dimensions and facts), a strategy of isolating the queries helps understand where indexes are needed and perhaps how to minimize the rows the query needs to deal with further down the line to completion.
For example, it should be obvious that if you can minimize the rows returned in baseQuery or make it run faster (check explain plans), your query improves overall.

Use of temp table in joining for performance issue

Is there any basic difference in terms of performance and output between this below two query?
select * from table1
left outer join table2 on table1.col=table2.col
and table2.col1='shhjs'
and
select * into #temp from table2 where table2.col1='shhjs'
select * from table1 left outer join #temp on table1.col=#temp.col
Here table2 have huge number of records while #temp have less amount.
Yes, there is. The second method is going to materialize a table in the temp data bases, which requires additional overhead.
The first method does not require such overhead. And, it can be better optimized. For instance, if an index existed on table2(col, col1), the first version might take advantage of it. The second would not.
However, you can always try the two queries on your system with your data and determine if one noticeably outperforms the other.

Multiple inner joins in a single statement versus pairwise joins

I was given someone else's code that joins 9 (!) tables - I've used it with no problem in the past, but now all the tables have grown over time so that I'm getting weird space errors.
I got advice to break up the joins and do multiple pairwise joins. Should be simple since all the joins are inner and everything I reads says order should make no difference in this case - but I'm getting a different number of cases than I should. Without giving any specific very complicated example, what are some possible reasons for this?
Thanks
To me, joining 9 tables in a single statement is a lot! Pairwise may
have been imprecise - I mean joining two tables then joining that
result to another table, then that result to another table. Obviously
they are ordered to the degree that the necessary key is available at
each point.
This is not obvious to me. In fact, it is not true. MOST SQL platforms (and you still have not said which one you are using) compile SQL statements and form an execution plan. That plan will optimize and move around when joins are executed. On many systems that run in parallel they will execute the joins at the same time.
The way to understand the "order" of the statements is to look at the execution plan.
The way to control the order (on many systems) is to use a CTE. Something like this:
WITH subsetofbigtable AS
(
SELECT *
FROM reallybigtable
WHERE date = '2014-01-01'
)
SELECT *
FROM subsetofbigtable
JOIN anothertable1 ...
JOIN anothertable2 ...
JOIN anothertable3 ...
JOIN anothertable4 ...
JOIN anothertable5 ...
You can also chain CTEs to "order" joins:
WITH subsetofbigtable AS
(
SELECT *
FROM reallybigtable
WHERE date = '2014-01-01'
), chain1 AS
(
SELECT *
FROM subsetofbigtable
JOIN anothertable1 ...
), chain2 AS
(
SELECT *
FROM chain1
JOIN anothertable2 ...
)
SELECT *
FROM chain2
JOIN anothertable3 ...
JOIN anothertable4 ...
JOIN anothertable5 ...

Conditional aggregate database queries and their performance implications

I think this question is best asked with an example: if you want two counts from a table - say one with all the rows with a bit flag set to false and another with all of the ones set to true - is there a best practice for this kind of query and what are the performance implications of any approaches that could be taken?
To expand a little, and basing it off of the article below, how would separate queries compare to the version with the CASE evaluation in the SELECT list from a performance point of view? Are there other methods?
http://www.codeproject.com/Articles/310674/Conditional-Sums-in-SQL-Aggregate-Methods
Other than Blam's way, I think there are three basic ways to get the desired result. I tested the three options below as well as Blam's on my system. The results I found were as follows. Also, a side note, we didn't have any bit data in our system so I counted an indexed column with two values ("H" or "R").
Using Conditional Aggregates method resulted in the fastest performance. Using Blam's Grouping with an Aggregate method resulted in the second fastest way, consistently taking about 33% longer than the Conditional Aggregates. The Two Separate Select Statements method was the third fastest, consistently taking close to 50% longer than the Conditional Aggregates. Finally, the Joins method took the longest, and was close to 1000% slower than the Conditional Aggregates. The joins were expected (by me) to take the longest as you're joining to that table multiple times. The reason I included this method is because it was not discussed (possibly for obvious reasons) in the question; all performance issues aside, it is a viable if not extremely slow option. The two separate select statements also makes sense as you're running two separate aggregates, accessing that table two separate times.
I'm not sure what accounts for the differences between the conditional aggregate method and Blam's method. I've always been pleasantly surprised by the speed and performance of case statements, and today was no different.
I think the case statement method, aside from the performance considerations, is possibly the most versatile method. It allows you to work with just about any type of field and facilitates the selection of a subset of values, whereas Blam's Grouping with an Aggregate method would show all possible column values unless a Where clause were included.
Conditional Aggregates
Select SUM(Case When bitcol = 1 Then 1 Else 0 End) as True_Count
, SUM(Case When bitcol = 0 Then 1 Else 0 End) as False_Count
From Table;
Two separate select statements
Select Count(1) as True_Count
From Table
Where bitcol = 1;
Select Count(1) as False_Count
From Table
Where bitcol = 0;
Using Joins
Select Count(T2.bitcol) as True_Count
, Count(T3.bitcol) as False_Count
From Table T1
Left Outer Join Table T2
on T1.ID = T2.ID
Left Outer Join Table T3
on T1.ID = T3.ID;
SELECT [bitCol], count(*)
FROM [table]
GROUP BY [bitCol]
If that column is indexed it is an index scan followed by a stream aggregate.
Doubt you can do better than that

Is derived table executed once or three times?

Every time you make use of a derived table, that query is going to be executed. When using a CTE, that result set is pulled back once and only once within a single query.
Does the quote suggest that the following query will cause derived table to be executed three times ( once for each aggregate function’s call ):
SELECT
AVG(OrdersPlaced),MAX(OrdersPlaced),MIN(OrdersPlaced)
FROM (
SELECT
v.VendorID,
v.[Name] AS VendorName,
COUNT(*) AS OrdersPlaced
FROM Purchasing.PurchaseOrderHeader AS poh
INNER JOIN Purchasing.Vendor AS v ON poh.VendorID = v.VendorID
GROUP BY v.VendorID, v.[Name]
) AS x
thanx
No that should be one pass, take a look at the execution plan
here is an example where something will run for every row in table table2
select *,(select COUNT(*) from table1 t1 where t1.id <= t2.id) as Bla
from table2 t2
Stuff like this with a running counts will fire for each row in the table2 table
CTE or a nested (uncorrelated) subquery will generally have no different execution plan. Whether a CTE or a subquery is used has never had an effect on my intermediate queries being spooled.
With regard to the Tony Rogerson link - the explicit temp table performs better than the self-join to the CTE because it's indexed better - many times when you go beyond declarative SQL and start to anticipate the work process for the engine, you can get better results.
Sometimes, the benefit of a simpler and more maintainable query with many layered CTEs instead of a complex multi-temp-table process outweighs the performance benefits of a multi-table process. A CTE-based approach is a single SQL statement, which cannot be as quietly broken by a step being accidentally commented out or a schema changing.
Probably not, but it may spool the derived results so it only needs to access it once.
In this case, there should be no difference between a CTE and derived table.
Where is the quote from?