Max vs Count Huge Performance difference on a query - sql

I have to similar queries which the only difference is that one is doing a sum of a column and the other is doing a count(distinct) of another column.
The first one runs in seconds (17s) and the other one never stops (1 hour and counting). I've seen the plan for the count query and it has huge costs. I don't understand why.
They are hitting the exact same views.
Why is this happening and what can I do?
The one that is running fine:
select a11.SOURCEPP SOURCEPP,
a12.DUMMY DUMMY,
a11.SIM_NAME SIM_NAME,
a13.THEORETICAL THEORETICAL,
sum(a11.REVENUE) WJXBFS1
from CLIENT_SOURCE_DATA a11
join DUMMY_V a12
on (a11.SOURCEPP = a12.SOURCEPP)
join SIM_INFO a13
on (a11.SIM_NAME = a13.SIM_NAME)
where (a13.THEORETICAL in (0)
and a11.SIM_NAME in ('ETS40'))
group by a11.SOURCEPP,
a12.DUMMY,
a11.SIM_NAME,
a13.THEORETICAL
the one that doesn't run:
select a12.SOURCEPP SOURCEPP,
a12.SIM_NAME SIM_NAME,
a13.THEORETICAL THEORETICAL,
count(distinct a12.CLIENTID) WJXBFS1
from CLIENT_SOURCE_DATA a12
join SIM_INFO a13
on (a12.SIM_NAME = a13.SIM_NAME)
where (a13.THEORETICAL in (0)
and a12.SIM_NAME in ('ETS40'))
group by a12.SOURCEPP,
a12.SIM_NAME,
a13.THEORETICAL

DISTINCT is very slow when there are many DISTINCT values, database needs to SORT/HASH and store all values (or sets) in memory/temporary tablespace. Also it makes parallel execution much more difficult to apply.
If there is a way how to rewrite the query without using DISTINCT you should definitely do it.

As answered above, DISTINCT has to do a table scan and then hash, aggregate and sort the data into sets. This increases the amount of time it takes across the board (CPU, disk access, and the time it takes to return the data). I would recommend trying a subquery instead if possible. This will limit the aggregation execution to only the data you want to be distinct instead of having the engine perform it on all of the data. Here's an article on how this works in practice, with an example.

Related

SQL Server query plan sort operator cost spikes when column count increased

I use a ROW_NUMBER() function inside a CTE that causes a SORT operator in the query plan. This SORT operator has always been the most expensive element of the query, but has recently spiked in cost after I increased the number of columns read from the CTE/query.
What confuses me is the increase in cost is not proportional to the column count. I can increase the column count without much issue, normally. However, it seems my query has past some threshold and now costs so much it has doubled the query execution time from 1hour to 2hour+.
I can't figure out what has caused the spike in cost and it's having an impact on business. Any ideas or next steps for troubleshooting you can advise?
Here is the query (simplified):
WITH versioned_events AS (
SELECT [event].*
,CASE WHEN [event].[handle_space] IS NOT NULL THEN [inv].[involvement_id]
ELSE [event].[involvement_id]
END AS [derived_involvement_id]
,ROW_NUMBER() OVER (PARTITION BY [event_id], [event_version] ORDER BY [event_created_date] DESC, [timestamp] DESC ) AS [latest_version]
FROM [database].[schema].[event_table] [event]
LEFT JOIN [database].[schema].[involvement] as [inv]
ON [event].[service_delivery_id] = [inv].[service_delivery_id]
AND [inv].[role_type_code] = 't'
AND [inv].latest_involvement = 1
WHERE event.deletion_type IS NULL AND (event.handle_space IS NULL
OR (event.handle_space NOT LIKE 'x%'
AND event.handle_space NOT LIKE 'y%'))
)
INSERT INTO db.schema.table (
....
)
SELECT
....
FROM versioned_events AS [event]
INNER JOIN (
SELECT DISTINCT service_delivery_id, derived_involvement_id
FROM versioned_events
WHERE latest_version = 1
WHERE ([versioned_events].[timestamp] > '2022-02-07 14:18:09.777610 +00:00')
) AS [delta_events]
ON COALESCE([event].[service_delivery_id],'NULL') = COALESCE([delta_events].[service_delivery_id],'NULL')
AND COALESCE([event].[derived_involvement_id],'NULL') = COALESCE([delta_events].[derived_involvement_id],'NULL')
WHERE [event].[latest_version] = 1
Here is the query plan from the version with the most columns that experiences the cost spike (all others look the same except this operator takes much less time (40-50mins):
I did a comparison of three executions, each with different column counts in the INSERT INTO SELECT FROM clause. I can't share the spreadsheet, but I will try convey my findings so far. The following is true of the query with the most columns:
It takes more than twice as long to execute than the other two executions
It performs more logical & physical reads and scans
It has more CPU time
It reads the most from Tempdb
The increase in execution time is not proportional with the increase in reads or other mentioned metrics
It is true that there is a memory spill level 8 happening. I have tried updating statistics, but it didn't help and all the versions of the query suffer the same problem so like-for-like is still compared.
I know it can be hard to help with this kind of problem without being able to poke around but I would be grateful if anyone could point me in the direction for what to check / try next.
P.S. the table it reads from is a heap and the table it joins to is indexed. The heap table needs to be a heap otherwise inserts into it will take too long and the problem is kicked down the road.
Also, when I say added more columns, I mean in the SELECT FROM versioned_events statement. The columns are replaced with "...." in the above example.
UPDATE
Using a temp table halved the execution time when the column count is the high number that caused the issue but actually takes longer with a reduced column count. It goes back to the idea that a threshold is crossed when the column count is increased :(. In any event, we've used a temp table for now to see if it helps in production.

BigQuery join too slow for a table of small size

I have a table with the following details:
- Table Size 39.6 MB
- Number of Rows 691,562
- 2 columns : contact_guid STRING, program_completed STRING
- column 1 data type is like uuid . around 30 char length
- column 2 data type is string with around 50 char length
I am trying this query:
#standardSQL
SELECT
cp1.contact_guid AS p1,
cp2.contact_guid AS p2,
COUNT(*) AS cnt
FROM
`data.contact_pairs_program_together` cp1
JOIN
`data.contact_pairs_program_together` cp2
ON
cp1.program_completed=cp2.program_completed
WHERE
cp1.contact_guid < cp2.contact_guid
GROUP BY
cp1.contact_guid,
cp2.contact_guid having cnt >1 order by cnt desc
Time taken to execute: 1200 secs
I know I am doing a self join and it is mentioned in best practices to avoid self join.
My Questions:
I feel this table size in terms of mb is too small for BigQuery therefore why is it taking so much time? And what does small table mean for BigQuery in context of join in terms of number of rows and size in bytes?
Is the number of rows too large? 700k ^ 2 is 10^11 rows during join. What would be a realistic number of rows for joins?
I did check the documentation regarding joins, but did not find much regarding how big a table can be for joins and how much time can be expected for it to run. How do we estimate rough execution time?
Execution Details:
As shown on the screenshot you provided - you are dealing with an exploding join.
In this case step 3 takes 1.3 million rows, and manages to produce 459 million rows. Steps 04 to 0B deal with repartitioning and re-shuffling all that extra data - as the query didn't provision enough resources to deal with these number of rows: It scaled up from 1 parallel input to 10,000!
You have 2 choices here: Either avoid exploding joins, or assume that exploding joins will take a long time to run. But as explained in the question - you already knew that!
How about if you generate all the extra rows in one op (do the join, materialize) and then run another query to process the 459 million rows? The first query will be slow for the reasons explained, but the second one will run quickly as BigQuery will provision enough resource to deal with that amount of data.
Agree with below suggestions
see if you can rephrase your query using analytic functions (by Tim)
Using analytic functions would be a much better idea (by Elliott)
Below is how I would make it
#standardSQL
SELECT
p1, p2, COUNT(1) AS cnt
FROM (
SELECT
contact_guid AS p1,
ARRAY_AGG(contact_guid) OVER(my_win) guids
FROM `data.contact_pairs_program_together`
WINDOW my_win AS (
PARTITION BY program_completed
ORDER BY contact_guid DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
), UNNEST(guids) p2
GROUP BY p1, p2
HAVING cnt > 1
ORDER BY cnt DESC
Please try and let us know if helped

fuzzy merge using SAS proc sql

I have two files which I would like to match by name and I would like to take account of spelling errors by using the compged function. The names have been thoroughly cleaned and I have no other useful match variables that could be used to reduce the search space.
The files name1 and name2 have over 500k rows each and thus after 11 hours this code has not run.
Is there some way I can code this more efficiently or is my issue purely due to computing power?
proc sql;
create table name1_name2_Fuzzy as
select a.*, b.*
from name1 as a
inner join name2 as b
on COMPGED(a.match_name, b.match_name) < 200;
quit;
You have a parameter in compged function that you didn't use, and that can improve the performance (maybe 6 or 7 hours instead of 11..).
this parameter is the cutoff. If you choose 300 as a cutoff, when the distance between the words reaches 300, sas stops the calculation and outputs 300.
So here in your case, you should choose a cutoff >200 (and NOT >=200).
Complev function is faster than Compged. If you don't need an exact cost of each operation (with call compost routine), you can use it instead of compged, and you can reduce minutes or maybe hours of computations. Complev has also the cutoff option.
Hope this helps !
Working off memory here, but if the first char of each match_name is different, the COMPGED will be over 200, true? So, you wouldn't consider them a match?
If so, make an indexed column with the first character of match_name in each table, and join on that before the COMPGED. That should eliminate most of the non-matches so far fewer COMPGED calculations will be needed.

SQL Server aggregate performance

I am wondering whether SQL Server knows to 'cache' if you like aggregates while in a query, if they are used again.
For example,
Select Sum(Field),
Sum(Field) / 12
From Table
Would SQL Server know that it has already calculated the Sum function on the first field and then just divide it by 12 for the second? Or would it run the Sum function again then divide it by 12?
Thanks
It calculates once
Select
Sum(Price),
Sum(Price) / 12
From
MyTable
The plan gives:
|--Compute Scalar(DEFINE:([Expr1004]=[Expr1003]/(12.)))
|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1010]=(0) THEN NULL ELSE [Expr1011] END))
|--Stream Aggregate(DEFINE:([Expr1010]=Count(*), [Expr1011]=SUM([myDB].[dbo].[MyTable].[Price])))
|--Index Scan(OBJECT:([myDB].[dbo].[MyTable].[IX_SomeThing]))
This table has 1.35 million rows
Expr1011 = SUM
Expr1003 = some internal thing to do with "no rows" etc but is Expr1011 basically
Expr1004 = Expr1011 / 12
According to the execution plan, it doesn't re-sum the column.
good question, i think the answer is no, it doesn't not cache it.
I ran a test query with around 3000 counts in it, and it was much slower than one with only a few. Still want to test if the query would be just as slow selecting just plain columns
edit: OK, i just tried selecting a large amount of columns or just one, and the amount of columns (when talking about thousands being returned) does effect the speed.
Overall, unless you are using that aggregate number a ton of times in your query, you should be fine. Push comes to shove, you could always save the outcome to a variable and do the math after the fact.

Poor DB Performance when using ORDER BY

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?