Conditional aggregate database queries and their performance implications

Conditional aggregate database queries and their performance implications - sql

I think this question is best asked with an example: if you want two counts from a table - say one with all the rows with a bit flag set to false and another with all of the ones set to true - is there a best practice for this kind of query and what are the performance implications of any approaches that could be taken?
To expand a little, and basing it off of the article below, how would separate queries compare to the version with the CASE evaluation in the SELECT list from a performance point of view? Are there other methods?
http://www.codeproject.com/Articles/310674/Conditional-Sums-in-SQL-Aggregate-Methods

Other than Blam's way, I think there are three basic ways to get the desired result. I tested the three options below as well as Blam's on my system. The results I found were as follows. Also, a side note, we didn't have any bit data in our system so I counted an indexed column with two values ("H" or "R").
Using Conditional Aggregates method resulted in the fastest performance. Using Blam's Grouping with an Aggregate method resulted in the second fastest way, consistently taking about 33% longer than the Conditional Aggregates. The Two Separate Select Statements method was the third fastest, consistently taking close to 50% longer than the Conditional Aggregates. Finally, the Joins method took the longest, and was close to 1000% slower than the Conditional Aggregates. The joins were expected (by me) to take the longest as you're joining to that table multiple times. The reason I included this method is because it was not discussed (possibly for obvious reasons) in the question; all performance issues aside, it is a viable if not extremely slow option. The two separate select statements also makes sense as you're running two separate aggregates, accessing that table two separate times.
I'm not sure what accounts for the differences between the conditional aggregate method and Blam's method. I've always been pleasantly surprised by the speed and performance of case statements, and today was no different.
I think the case statement method, aside from the performance considerations, is possibly the most versatile method. It allows you to work with just about any type of field and facilitates the selection of a subset of values, whereas Blam's Grouping with an Aggregate method would show all possible column values unless a Where clause were included.
Conditional Aggregates
Select SUM(Case When bitcol = 1 Then 1 Else 0 End) as True_Count
, SUM(Case When bitcol = 0 Then 1 Else 0 End) as False_Count
From Table;
Two separate select statements
Select Count(1) as True_Count
From Table
Where bitcol = 1;
Select Count(1) as False_Count
From Table
Where bitcol = 0;
Using Joins
Select Count(T2.bitcol) as True_Count
, Count(T3.bitcol) as False_Count
From Table T1
Left Outer Join Table T2
on T1.ID = T2.ID
Left Outer Join Table T3
on T1.ID = T3.ID;

SELECT [bitCol], count(*)
FROM [table]
GROUP BY [bitCol]
If that column is indexed it is an index scan followed by a stream aggregate.
Doubt you can do better than that

Related

How to improve SQL query performance containing partially common subqueries

I have a simple table tableA in PostgreSQL 13 that contains a time series of event counts. In stylized form it looks something like this:
event_count sys_timestamp
100 167877672772
110 167877672769
121 167877672987
111 167877673877
... ...
With both fields defined as numeric.
With the help of answers from stackoverflow I was able to create a query that basically counts the number of positive and negative excess events within a given time span, conditioned on the current event count. The query looks like this:
SELECT t1.*,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count >= t1.event_count+10)
AS positive,
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp AND
t2.sys_timestamp <= t1.sys_timestamp + 1000 AND
t2.event_count <= t1.event_count-10)
AS negative
FROM tableA as t1
The query works as expected, and returns in this particular example for each row a count of positive and negative excesses (range + / - 10) given the defined time window (+ 1000 [milliseconds]).
However, I will have to run such queries for tables with several million (perhaps even 100+ million) entries, and even with about 500k rows, the query takes a looooooong time to complete. Furthermore, whereas the time frame remains always the same within a given query [but the window size can change from query to query], in some instances I will have to use maybe 10 additional conditions similar to the positive / negative excesses in the same query.
Thus, I am looking for ways to improve the above query primarily to achieve better performance considering primarily the size of the envisaged dataset, and secondarily with more conditions in mind.
My concrete questions:
How can I reuse the common portion of the subquery to ensure that it's not executed twice (or several times), i.e. how can I reuse this within the query?
(SELECT COUNT(*) FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000)
Is there some performance advantage in turning the sys_timestamp field which is currently numeric, into a timestamp field, and attempt using any of the PostgreSQL Windows functions? (Unfortunately I don't have enough experience with this at all.)
Are there some clever ways to rewrite the query aside from reusing the (partial) subquery that materially increases the performance for large datasets?
Is it perhaps even faster for these types of queries to run them outside of the database using something like Java, Scala, Python etc. ?

How can I reuse the common portion of the subquery ...?
Use conditional aggregates in a single LATERAL subquery:
SELECT t1.*, t2.positive, t2.negative
FROM tableA t1
CROSS JOIN LATERAL (
SELECT COUNT(*) FILTER (WHERE t2.event_count >= t1.event_count + 10) AS positive
, COUNT(*) FILTER (WHERE t2.event_count <= t1.event_count - 10) AS negative
FROM tableA t2
WHERE t2.sys_timestamp > t1.sys_timestamp
AND t2.sys_timestamp <= t1.sys_timestamp + 1000
) t2;
It can be a CROSS JOIN because the subquery always returns a row. See:
JOIN (SELECT ... ) ue ON 1=1?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Use conditional aggregates with the FILTER clause to base multiple aggregates on the same time frame. See:
Aggregate columns with additional (distinct) filters
event_count should probably be integer or bigint. See:
PostgreSQL using UUID vs Text as primary key
Is there any difference in saving same value in different integer types?
sys_timestamp should probably be timestamp or timestamptz. See:
Ignoring time zones altogether in Rails and PostgreSQL
An index on (sys_timestamp) is minimum requirement for this. A multicolumn index on (sys_timestamp, event_count) typically helps some more. If the table is vacuumed enough, you get index-only scans from it.
Depending on exact data distribution (most importantly how much time frames overlap) and other db characteristics, a tailored procedural solution may be faster, yet. Can be done in any client-side language. But a server-side PL/pgsql solution is superior because it saves all the round trips to the DB server and type conversions etc. See:
Window Functions or Common Table Expressions: count previous rows within range
What are the pros and cons of performing calculations in sql vs. in your application

You have the right idea.
The way to write statements you can reuse in a query is "with" statements (AKA subquery factoring). The "with" statement runs once as a subquery of the main query and can be reused by subsequent subqueries or the final query.
The first step includes creating parent-child detail rows - table multiplied by itself and filtered down by the timestamp.
Then the next step is to reuse that same detail query for everything else.
Assuming that event_count is a primary index or you have a compound index on event_count and sys_timestamp, this would look like:
with baseQuery as
(
SELECT distinct t1.event_count as startEventCount, t1.event_count+10 as pEndEventCount
,t1.eventCount-10 as nEndEventCount, t2.event_count as t2EventCount
FROM tableA t1, tableA t2
where t2.sys_timestamp between t1.sys_timestamp AND t1.sys_timestamp + 1000
), posSummary as
(
select bq.startEventCount, count(*) as positive
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.pEndEventCount
group by bq.startEventCount
), negSummary as
(
select bq.startEventCount, count(*) as negative
from baseQuery bq
where t2EventCount between bq.startEventCount and bq.nEndEventCount
group by bq.startEventCount
)
select t1.*, ps.positive, nv.negative
from tableA t1
inner join posSummary ps on t1.event_count=ps.startEventCount
inner join negSummary ns on t1.event_count=ns.startEventCount
Notes:
The distinct for baseQuery may not be necessary based on your actual keys.
The final join is done with tableA but could also use a summary of baseQuery as a separate "with" statement which already ran once. Seemed unnecessary.
You can play around to see what works.
There are other ways of course but this best illustrates how and where things could be improved.
With statements are used in multi-dimensional data warehouse queries because when you have so much data to join with so many tables(dimensions and facts), a strategy of isolating the queries helps understand where indexes are needed and perhaps how to minimize the rows the query needs to deal with further down the line to completion.
For example, it should be obvious that if you can minimize the rows returned in baseQuery or make it run faster (check explain plans), your query improves overall.

SQL query efficiency of subquery in select vs inner join

I have a query with the following structure:
SELECT
Id,
(SELECT COUNT(1) AS [A1]
FROM [dbo].Table2 AS [Extent4]
WHERE (Table1.Id = [Extent4].Id2)) AS [C1]
FROM TPO_User
This query structure is usually used by LINQ as opposed to the following structure:
SELECT Id
FROM Table1
LEFT OUTER JOIN
(SELECT COUNT(1) AS [A1], [Extent4].Id2
FROM [dbo].Table2 AS [Extent4]
GROUP BY [Extent4].Id2) AS [C1] ON C1.Id2 = Table1.Id
When I compare them, the second query has a shorter duration. Could someone explain the exact difference in execution of such a query?
And is it worth it to ever have a subquery in your select statement instead of an inner join?

I would expect both queries to have similar performance characteristics. When doing performance comparisons, you have to be sure you do them correctly. For instance, running two queries in a row is not a good comparison, because the table data has been loaded in to memory.
To really compare the queries, you need a quiescent server and cold caches. That said, the execution plan can be a big help in understanding what is happening.
I would expect the correlated subquery to have good performance with the right indexes. For your example, you want an index on Table2(Id2).
Which has better performance in general? Well, it is simple to devise scenarios where the correlated subquery is better. For instance, if TPO_User has 1 row and Table2 has 1,000,000 rows, then the correlated subquery will be better under almost any circumstances.

In my understanding:
the FROM clause is the definition of the target.
the SELECT clause is the projection (line-by-line) definition.
So the FROM clause load the data you need in memory and after that the projection is made on each line of your select statement.
So if you do a query (or call a function...) in the SELECT clause, you say that you want this sub-job to be done for each line of your projection. Seems quite heavy ;)
A little source about the running order of an SQL request : https://www.periscopedata.com/blog/sql-query-order-of-operations
Hope this helps (and do not hesitate people to correct me if I am wrong)
(And if I remember well there is now an automatic feature to optimize queries in sql server. I think it will do the correction by itself, should it not?)

Time based accumulation based on type: Speed considerations in SQL

Based on surfing the web, I came up with two methods of counting the records in a table "Table1". The counter field increments according to a date field "TheDate". It does this by summing records with an older TheDate value. Furthermore, records with different values for the compound field (Field1,Field2) are counted using separate counters. Field3 is just an informational field that is included for added awareness and does not affect the counting or how records are grouped for counting.
Method 1: Use corrrelated subquery
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
(
SELECT SUM(1) FROM Table1 InnerQuery
WHERE InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
) AS RunningCounter
FROM Table1 MainQuery
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
Method 2: Use join and group-by
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
SUM(1) AS RunningCounter
FROM Table1 MainQuery INNER JOIN Table1 InnerQuery
ON InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
There is no inner query per se in Method 2, but I use the table alias InnerQuery so that a ready parellel with Method 1 can be drawn. The role is the same; the 2nd instance of Table 1 is for accumulating the counts of the records which have TheDate less than that of any record in MainQuery (1st instance of Table 1) with the same Field1 and Field2 values.
Note that in Method 2, Field 3 is include in the Group-By clause even though I said that it does not affect how the records are grouped for counting. This is still true, since the counting is done using the matching records in InnerQuery, whereas the GROUP By applies to Field 3 in MainQuery.
I found that Method 1 is noticably faster. I'm surprised by this because it uses a correlated subquery. The way I think of a correlated subquery is that it is executed for each record in MainQuery (whether or not that is done in practice after optimization). On the other hand, Method 2 doesn't run an inner query over and over again. However, the inner join still has multiple records in InnerQuery matching each record in MainQuery, so in a sense, it deals with a similar order of complexity.
Is there a decent intuitive explanation for this speed difference, as well as best practice or considerations in choosing an approach for time-base accumulation?
I've posted this to
Microsoft Answers
Stack Exchange

In fact, I think the easiest way is to do this:
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
COUNT(*)
FROM Table1 MainQuery
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
(The order by isn't required to get the same data, just to order it. In other words, removing it will not change the number or contents of each row returned, just the order in which they are returned.)
You only need to specify the table once. Doing a self-join (joining a table to itself as both your queries do) is not required. The performance of your two queries will depend on a whole load of things which I don't know - what the primary keys are, the number of rows, how much memory is available, and so on.

First, your experience makes a lot of sense. I'm not sure why you need more intuition. I imagine you learned, somewhere along the way, that correlated subqueries are evil. Well, as with some of the things we teach kids as being really bad ("don't cross the street when the walk sign is not green") turn out to be not so bad, the same is true of correlated subqueries.
The easiest intuition is that the uncorrelated subquery has to aggregate all the data in the table. The correlated version only has to aggregate matching fields, although it has to do this over and over.
To put numbers to it, say you have 1,000 rows with 10 rows per group. The output is 100 rows. The first version does 100 aggregations of 10 rows each. The second does one aggregation of 1,000 rows. Well, aggregation generally scales in a super-linear fashion (O(n log n), technically). That means that 100 aggregations of 10 records takes less time than 1 aggregation of 1000 records.
You asked for intuition, so the above is to provide some intuition. There are a zillion caveats that go both ways. For instance, the correlated subquery might be able to make better use of indexes for the aggregation. And, the two queries are not equivalent, because the correct join would be LEFT JOIN.

Actually, I was wrong in my original post. The inner join is way, way faster than the correlated subquery. However, the correlated subquery is able to display its results records as they are generated, so it appears faster.
As a side curiosity, I'm finding that if the correlated sub-query approach is modified to use sum(-1) instead of sum(1), the number of returned records seems to vary from N-3 to N (where N is the correct number, i.e., the number of records in Table1). I'm not sure if this is due to some misbehaviour in Access's rush to display initial records or what-not.
While it seems that the INNER JOIN wins hands-down, there is a major insidious caveat. If the GROUP BY fields do not uniquely distinguish each record in Table1, then you will not get an individual SUM for each record of Table1. Imagine that a particular combination of GROUP BY field values matching (say) THREE records in Table1. You will then get a single SUM for all of them. The problem is, each of these 3 records in MainQuery also matches all 3 of the same records in InnerQuery, so those instances in InnerQuery get counted multiple times. Very insidious (I find).
So it seems that the sub-query may be the way to go, which is awfully disturbing in view of the above problem with repeatability (2nd paragraph above). That is a serious problem that should send shivers down any spine. Another possible solution that I'm looking at is to turn MainQuery into a subquery by SELECTing the fields of interest and DISTINCTifying them before INNER JOINing the result with InnerQuery.

Oracle optimization -- weird execution plan to left join an uncorrelated subquery

I wrote this query to tie product forecast data to historical shipment data in a Oracle star schema database and the optimizer did not behave in the way that I expected, so I am kind of curious as to what is going on.
Essentially, I have a bunch of dimension tables that will be consistent for both the forecast and the sales fact tables but the fact tables are aggregated at a different level, so I set them up as two subqueries and roll them up so I can tie them together (query example below.) In this case, I want all of the forecast data but only the sales data that matches.
The odd thing is that if I use either of the subqueries by themselves, they each seem to behave the way I would expect and each returns in less than a second (using the same filters -- I tested by just removing one or the other subquery and changing the alias).
Here is an example of the query structure -- I kept it as generic as I could, so there may be a few typos from changing it:
SELECT
TIME_DIMENSION.GREGORIAN_DATE,
LOCATION_DIMENSION.LOCATION_CODE,
DESTINATION_DIMENSION.REGION,
PRODUCT_DIMENSION.PRODUCT_CODE,
SUM(NVL(FIRST_SUBQUERY.VALUE,0)) VALUE1,
SUM(NVL(SECOND_SUBQUERY.VALUE,0)) VALUE2
FROM
TIME_DIMENSION,
LOCATION_DMENSION SOURCE_DIMENSION,
LOCATION_DIMENSION DESTINATION_DIMENSION,
PRODUCT_DIMENSION,
(SELECT
FORECAST_FACT.TIME_KEY,
FORECAST_FACT.SOURCE_KEY,
FORECAST_FACT.DESTINATION_KEY,
FORECAST_FACT.PRODUCT_KEY,
SUM(FORECAST_FACT.VALUE) AS VALUE,
FROM FORECAST_FACT
WHERE [FORECAST_FACT FILTERS HERE]
GROUP BY
FORECAST_FACT.TIME_KEY,
FORECAST_FACT.SOURCE_KEY,
FORECAST_FACT.DESTINATION_KEY) FIRST_SUBQUERY
LEFT JOIN
(SELECT
--This is just as an example offset
(LAST_YEAR_FACT.TIME_KEY + 52) TIME_KEY,
LAST_YEAR_FACT.SOURCE_KEY,
LAST_YEAR_FACT.DESTINATION_KEY,
FORECAST_FACT.PRODUCT_KEY,
SUM(LAST_YEAR_FACT.VALUE) AS VALUE,
FROM LAST_YEAR_FACT
WHERE [LAST_YEAR_FACT FILTERS HERE]
GROUP BY
LAST_YEAR_FACT.TIME_KEY,
LAST_YEAR_FACT.SOURCE_KEY,
LAST_YEAR_FACT.DESTINATION_KEY) SECOND_SUBQUERY
ON
FORECAST_FACT.TIME_KEY = LAST_YEAR_FACT.TIME_KEY
AND FORECAST_FACT.SOURCE_KEY = LAST_YEAR_FACT.SOURCE_KEY
AND FORECAST_FACT.DESTINATION_KEY = LAST_YEAR_FACT.DESTINATION_KEY
--I also tried to tie the last_year subquery to the dimension tables here
WHERE
FORECAST_FACT.TIME_KEY = TIME_DIMENSION.TIME_KEY
AND FORECAST_FACT.SOURCE_KEY = SOURCE_DIMENSION.LOCATION_KEY
AND FORECAST_FACT.DESTINATION_KEY = DESTINATION_DIMENSION.LOCATION_KEY
AND FORECAST_FACT.PRODUCT_KEY = PRODUCT_DIMENSION.PRODUCT_KEY
--I also tried, separately, to tie the last_year subquery to the dimension tables here
AND TIME_DIMENSION.WEEK = 'VALUE'
AND SOURCE_DIMENSION.SOURCE_CODE = 'VALUE'
AND DESTINATION_DIMENSION.REGION IN ('VALUE', 'VALUE')
AND PRODUCT_DIMENSION.CLASS_CODE = 'VALUE'
GROUP BY
TIME_DIMENSION.GREGORIAN_DATE,
SOURCE_DIMENSION.LOCATION_CODE,
DESTINATION_DIMENSION.REGION,
PRODUCT_DIMENSION.PRODUCT_CODE
Essentially, when I run either subquery independently it will utilize the indexes and search only a specific range of the specific partition, whereas with the left join it always does a full table scan on one of the fact tables. What seems to be happening is Oracle is applying the dimension table filters to only the first subquery -- and thus to do the left join it first needs to scan the entire sales table -- even if I explicitly tie and filter the values twice, instead of relying on the implicit filtering... I tried that. Am I thinking about this wrong? To me, the optimizer should use the indexes on both of the fact tables to filter each by the values in the WHERE clause and then left join the resulting subset.
I realize that I could simply add the filters to each of the subqueries, or set this up as a union of two independent queries, but I am curious as to what exactly is going on in terms of the optimization engine -- I can post the execution plan, if that would help.
Thanks!

Be sure that the tables are all analysed. Do it again. The optimizer uses those values for calculating it's execution plan. In cases where Oracle really choose a wrong plan your workaround is to force the optimizer with hints /*+ ... */, specifying the use of indexes, join order, etc.

Creating view ,SQL Query performance

I am trying to create view, But select statement from this view is taking more than 15 secs.How can i make it faster. My query for the view is below.
create view Summary as
select distinct A.Process_date,A.SN,A.New,A.Processing,
COUNT(case when B.type='Sold' and A.status='Processing' then 1 end) as Sold,
COUNT(case when B.type='Repaired' and A.status='Processing' then 1 end) as Repaired,
COUNT(case when B.type='Returned' and A.status='Processing' then 1 end) as Returned
from
(select distinct M.Process_date,M.SN,max(P.enter_date) as enter_date,M.status,
COUNT(case when M.status='New' then 1 end) as New,
COUNT(case when M.status='Processing' and P.cn is null then 1 end) as Processing
from DB1.dbo.Item_details M
left outer join DB2.dbo.track_data P on M.SN=P.SN
group by M.Process_date,M.SN,M.status) A
left outer join DB2.dbo.track_data B on A.SN=B.SN
where A.enter_date=B.enter_date or A.enter_date is null
group by A.Process_date,A.New,A.Processing,A.SN
After this view..my select query is
select distinct process_date,sum(New),sum(Processing),sum(sold),sum(repaired),sum(returned) from Summary where month(process_date)=03 and year(process_date)=2011
Please suggest me on what changes to be made for the query to perform faster.
Thank you
ARB

It is hard to give advices without seeing the actual data and the structure of the tables. I would rewrite the query keeping in mind these principles:
Use inner join instead of outer join if possible.
Get rid of case operator inside COUNT function. Build a query so you use conditions in WHERE section not in COUNT.
Try to not use aggregated values in GROUP BY. Currently you use aggregated values New and Processing for grouping. Use GROUP BY by existing table values if possible.
If the query gets too complicated, break it into smaller queries and combine results in the final query. Writing a store procedure may help in this case.
I hope this helps.

For tuning a database query, I shall add few items additional to what #Davyd has already listed:
Look at the tables and indexing on those tables. Putting the right index and avoiding the wrong ones always speed up the query.
Is there anything in the where condition that is not part of any index? At times we put index on a column and in the query we use a cast or convert on the column. So the underlying index is not effective. You may consider setting the index on the cast/convert of the column.
Look at the normal form conformity or over normalisation. 3.
Good luck.

If your are using Postgresql, I suggest you use a tool like "http://explain.depesz.com/" in order to see more clearly what part of your query is slow. Depending on what you get, you could either optimize your indexes, or rewrite part of your query. If your are using another database, I'm sure a similar tool exists.
If none of these ideas help, the final solution would be to create a "materialized query". There are plenty of infos on the web regarding this.
Good luck.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas