PostgreSQL Distinct Sort For Huge Amount of Data

PostgreSQL Distinct Sort For Huge Amount of Data - sql

Here my query is:
explain(buffers, analyze) SELECT DISTINCT e.eventid, e.objectid, e.clock, e.ns, e.name, e.severity
FROM EVENTS e, functions f, items i, hosts_groups hg
WHERE e.source='0' AND e.object='0' AND NOT EXISTS
(SELECT NULL FROM functions f, items i, hosts_groups hgg
LEFT JOIN rights r ON r.id=hgg.groupid AND r.groupid IN (12, 13, 14, ...)
WHERE e.objectid=f.triggerid AND f.itemid=i.itemid AND i.hostid=hgg.hostid
GROUP BY i.hostid HAVING MAX(permission)<2 OR MIN(permission) IS NULL OR MIN(permission)=0)
AND e.objectid=f.triggerid AND f.itemid=i.itemid AND i.hostid=hg.hostid
AND hg.groupid IN (1, 2, 3, ...)
AND e.value=1
ORDER BY e.eventid DESC;
You can find the related execution plan here.
As you can see, it spills to the disk. Because default value of work_mem is 8 MB. Than, I set work_mem to 1 GB on my session to see difference and run the query again. The new execution plans is here. Now, it is doing quicksort but still, the execution time is 779213.763 ms.
This query is a auto - generated query by a third party tool but we can change it I assume.
Doing distinct - sort for ~602k rows is insane. That is why I want to add more filter for clock column. Yet, I want to ask is there any other options to decrease execution time of this query?
Specifications for database server:
$ lscpu
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 1
Memory: 96 GB
The database settings for:
max_parallel_workers_per_gather
---------------------------------
4
max_worker_processes
----------------------
16
max_parallel_workers
----------------------
16
Thanks!

It looks like the core of the problem is that the planner is not using a hashed subplan (where it runs it in bulk once and memorizes the results in a hash) for the NOT EXISTS, but rather is running it parameterized for each tuple in a loop. Usually this is because the planner thinks it will take too much memory to hash the results, but in this case I think it is just because it can not figure out how to analyze GROUP BY...HAVING.
You can guide it down the (presumably) correct path here by replacing the NOT EXISTS (...) with:
AND e.objectid NOT IN (
SELECT triggerid FROM functions f, items i, hosts_groups hgg
LEFT JOIN rights r ON r.id=hgg.groupid AND r.groupid IN (12, 13, 14 /*...*/)
WHERE f.itemid=i.itemid AND i.hostid=hgg.hostid
GROUP BY triggerid, i.hostid HAVING MAX(permission)<2 OR MIN(permission) IS NULL OR MIN(permission)=0
)
But before trying this, I might run just the inner query there by itself to see how long it takes and how many rows it returns.
If this ends up working, it might be worthwhile to investigate what it would take to make the planner smart enough to do this conversion on its own.

Related

SQL Server query plan sort operator cost spikes when column count increased

I use a ROW_NUMBER() function inside a CTE that causes a SORT operator in the query plan. This SORT operator has always been the most expensive element of the query, but has recently spiked in cost after I increased the number of columns read from the CTE/query.
What confuses me is the increase in cost is not proportional to the column count. I can increase the column count without much issue, normally. However, it seems my query has past some threshold and now costs so much it has doubled the query execution time from 1hour to 2hour+.
I can't figure out what has caused the spike in cost and it's having an impact on business. Any ideas or next steps for troubleshooting you can advise?
Here is the query (simplified):
WITH versioned_events AS (
SELECT [event].*
,CASE WHEN [event].[handle_space] IS NOT NULL THEN [inv].[involvement_id]
ELSE [event].[involvement_id]
END AS [derived_involvement_id]
,ROW_NUMBER() OVER (PARTITION BY [event_id], [event_version] ORDER BY [event_created_date] DESC, [timestamp] DESC ) AS [latest_version]
FROM [database].[schema].[event_table] [event]
LEFT JOIN [database].[schema].[involvement] as [inv]
ON [event].[service_delivery_id] = [inv].[service_delivery_id]
AND [inv].[role_type_code] = 't'
AND [inv].latest_involvement = 1
WHERE event.deletion_type IS NULL AND (event.handle_space IS NULL
OR (event.handle_space NOT LIKE 'x%'
AND event.handle_space NOT LIKE 'y%'))
)
INSERT INTO db.schema.table (
....
)
SELECT
....
FROM versioned_events AS [event]
INNER JOIN (
SELECT DISTINCT service_delivery_id, derived_involvement_id
FROM versioned_events
WHERE latest_version = 1
WHERE ([versioned_events].[timestamp] > '2022-02-07 14:18:09.777610 +00:00')
) AS [delta_events]
ON COALESCE([event].[service_delivery_id],'NULL') = COALESCE([delta_events].[service_delivery_id],'NULL')
AND COALESCE([event].[derived_involvement_id],'NULL') = COALESCE([delta_events].[derived_involvement_id],'NULL')
WHERE [event].[latest_version] = 1
Here is the query plan from the version with the most columns that experiences the cost spike (all others look the same except this operator takes much less time (40-50mins):
I did a comparison of three executions, each with different column counts in the INSERT INTO SELECT FROM clause. I can't share the spreadsheet, but I will try convey my findings so far. The following is true of the query with the most columns:
It takes more than twice as long to execute than the other two executions
It performs more logical & physical reads and scans
It has more CPU time
It reads the most from Tempdb
The increase in execution time is not proportional with the increase in reads or other mentioned metrics
It is true that there is a memory spill level 8 happening. I have tried updating statistics, but it didn't help and all the versions of the query suffer the same problem so like-for-like is still compared.
I know it can be hard to help with this kind of problem without being able to poke around but I would be grateful if anyone could point me in the direction for what to check / try next.
P.S. the table it reads from is a heap and the table it joins to is indexed. The heap table needs to be a heap otherwise inserts into it will take too long and the problem is kicked down the road.
Also, when I say added more columns, I mean in the SELECT FROM versioned_events statement. The columns are replaced with "...." in the above example.
UPDATE
Using a temp table halved the execution time when the column count is the high number that caused the issue but actually takes longer with a reduced column count. It goes back to the idea that a threshold is crossed when the column count is increased :(. In any event, we've used a temp table for now to see if it helps in production.

SQL Query takes a long time when filtering recent rows

I have this SQL query, but I've found that it can take up to 11 seconds to run. I'm really confused because when I change the date selection to a 2018 date, it returns instantly.
Here's the query:
select
cv.table3ID, dm.Column1 ,dm.Column2, mm.Column1,
convert(varchar, cv.Date, 107) as Date,
mm.table2ID, dm.table1ID, mm.Column2,
count(ctt.table4ID) as Total
from
table1 dm
inner join
table2 mm on mm.table2ID = dm.table1ID
inner join
table3 cv on cv.table3ID = mm.table2ID
left join
table4 ct on ct.table4CVID = cv.table3ID
inner join
table4 ctt on ctt.table4MMID = mm.table2ID
where
ctt.table4Date >= '2019-01-19'
and ct.table4CVID is null
and dm.Column1 like '%Albert%'
and cv.Column1 = 39505
and cv.Status = 'A'
group by
cv.table3ID, dm.Column1 ,dm.Column2, mm.Column1,
cv.Date, mm.table2ID, dm.table1ID, mm.Column2
I've found that when I execute that query with ctt.table4Date >= '2018-01-19', the response is immediate. But with '2019-01-19', it takes 11 seconds.
Initially, when I found that the query took 11 seconds, I thought it had to be an indexing issue, but I'm not sure any more if its got to do with the index since it executes well for an older date.
I've looked at the execution plan for the query with the different dates and they look completely different.
Any thoughts on why this might be happening? Does it have anything to do with updating the statistics?
[Update]
This image below is the comparison of the execution plan between 2018 and 2019 for table4 ctt. According to the execution plan, this takes up 43% of the operator cost in 2018 and 45% in 2019.
Execution Plan comparison of table4 ctt 2019 and 2018. Top is 2019, bottom is 208
The image here is the comparison of the execution plan again for table4 as ct. Same here, top is 2019 and bottom is 2018.
Execution plan of table4 ct comparison 2019 and 2018. Top is 2019, bottom is 208
[Update 2]
Here are the SQL Execution Plans:
When using '2018-01-19' as the date: https://www.brentozar.com/pastetheplan/?id=SyUh8xXQV
When using '2019-01-19' as the date: https://www.brentozar.com/pastetheplan/?id=rkELW1Q7V

The problem most likely is the fact that more rows are being returned from the other tables. The clustered index scan that you have linked with your [update] just shows the clustered index seek.
You do, however, need to realise that the number of times the index seek is being invoked is 144. and the actual number of rows read is in 8 digits which is causing the slow response.
I'm guessing that when this works fine for you, the actual number of executions on this table would be 1. The 144 is killing you here; given the poor seek predicates. If you know the query plan that works for you, and the indexes are already present to back it up you should forceseek plans and give explicit hints to join in particular order.
Edit
Took a look at the shared plans, changing the date to 2018 works faster for you since SQL switches to using a Hash Match in place of a loop join given the amount of data being processed.

BigQuery join too slow for a table of small size

I have a table with the following details:
- Table Size 39.6 MB
- Number of Rows 691,562
- 2 columns : contact_guid STRING, program_completed STRING
- column 1 data type is like uuid . around 30 char length
- column 2 data type is string with around 50 char length
I am trying this query:
#standardSQL
SELECT
cp1.contact_guid AS p1,
cp2.contact_guid AS p2,
COUNT(*) AS cnt
FROM
`data.contact_pairs_program_together` cp1
JOIN
`data.contact_pairs_program_together` cp2
ON
cp1.program_completed=cp2.program_completed
WHERE
cp1.contact_guid < cp2.contact_guid
GROUP BY
cp1.contact_guid,
cp2.contact_guid having cnt >1 order by cnt desc
Time taken to execute: 1200 secs
I know I am doing a self join and it is mentioned in best practices to avoid self join.
My Questions:
I feel this table size in terms of mb is too small for BigQuery therefore why is it taking so much time? And what does small table mean for BigQuery in context of join in terms of number of rows and size in bytes?
Is the number of rows too large? 700k ^ 2 is 10^11 rows during join. What would be a realistic number of rows for joins?
I did check the documentation regarding joins, but did not find much regarding how big a table can be for joins and how much time can be expected for it to run. How do we estimate rough execution time?
Execution Details:

As shown on the screenshot you provided - you are dealing with an exploding join.
In this case step 3 takes 1.3 million rows, and manages to produce 459 million rows. Steps 04 to 0B deal with repartitioning and re-shuffling all that extra data - as the query didn't provision enough resources to deal with these number of rows: It scaled up from 1 parallel input to 10,000!
You have 2 choices here: Either avoid exploding joins, or assume that exploding joins will take a long time to run. But as explained in the question - you already knew that!
How about if you generate all the extra rows in one op (do the join, materialize) and then run another query to process the 459 million rows? The first query will be slow for the reasons explained, but the second one will run quickly as BigQuery will provision enough resource to deal with that amount of data.

Agree with below suggestions
see if you can rephrase your query using analytic functions (by Tim)
Using analytic functions would be a much better idea (by Elliott)
Below is how I would make it
#standardSQL
SELECT
p1, p2, COUNT(1) AS cnt
FROM (
SELECT
contact_guid AS p1,
ARRAY_AGG(contact_guid) OVER(my_win) guids
FROM `data.contact_pairs_program_together`
WINDOW my_win AS (
PARTITION BY program_completed
ORDER BY contact_guid DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING
)
), UNNEST(guids) p2
GROUP BY p1, p2
HAVING cnt > 1
ORDER BY cnt DESC
Please try and let us know if helped

Slow performance running ordered query on SQL Server table with hierarchyid field

I have a tree of categories stored in a table which at present has both a traditional ParentCategoryId field and a hierarchyid field called Echelon.
The following query successfully pulls out the data in the format I require, with a depth level and the categories ordered by their depth and by the category name:
WITH q AS
(
SELECT
c.Id,
c.Name,
c.Echelon,
c.Echelon AS NewEchelon
FROM
Category c
WHERE
Deleted IS NULL AND ParentCategoryId IS NULL
UNION ALL
SELECT
c2.Id,
c2.Name,
c2.Echelon,
hierarchyid::Parse(q.NewEchelon.ToString() + CAST(ROW_NUMBER() OVER (ORDER BY q.Name) AS NVARCHAR(MAX)) + '/')
FROM
q
JOIN
Category c2 ON c2.Deleted IS NULL AND c2.Echelon.IsDescendantOf(q.Echelon) = 1 AND c2.Echelon.GetLevel() = q.Echelon.GetLevel() + 1
)
SELECT * FROM q ORDER BY NewEchelon
The performance of this query is unfortunately, not great. The real table only has 319 categories, and 89 of them are soft deleted with a non-null value in the Deleted column.
The timings for that query are as follows:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
(230 row(s) affected)
SQL Server Execution Times:
CPU time = 218 ms, elapsed time = 321 ms.
Now a third of a second seems pretty crazy for such a small amount of data to me. The table has no indexes on it aside from the clustered PK one. Is there a way to rewrite it so that it's faster? Do I just need to add appropriate indexes? Should I look at storing a HierarchySortOrder that I generate with the above query whenever the categories or their structure change? Should I look into caching the category tree at an application level?
My gut tells me that this query shouldn't be taking as long as it is, and that I'm likely missing a trick, but I'd appreciate any advice on the matter!
Here is the execution plan:
Execution Plan http://s27.postimg.org/mo30oy2yp/executionplam.png
Link to full sized image

The processing time seems fair to me given that its a union query, a few functions and a joins
Also using With bring a sub-query processing time to take into account

Poor DB Performance when using ORDER BY

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;

The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html

I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index

If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.

First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.

First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.

I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.

Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas