I'm trying to use ST_SnapToGrid and then GROUP BY the grid cells (x, y). Here is what I did first:
SELECT
COUNT(*) AS n,
ST_X(ST_SnapToGrid(geom, 50)) AS x,
ST_Y(ST_SnapToGrid(geom, 50)) AS y
FROM points
GROUP BY x, y
I don't want to recompute ST_SnapToGrid for both x and y. So I changed it to use a sub-query:
SELECT
COUNT(*) AS n,
ST_X(geom) AS x,
ST_Y(geom) AS y
FROM (
SELECT
ST_SnapToGrid(geom, 50) AS geom
FROM points
) AS tmp
GROUP BY x, y
But when I run EXPLAIN, both of these queries have the exact same execution plan:
GroupAggregate (...)
-> Sort (...)
Sort Key: (st_x(st_snaptogrid(points.geom, 0::double precision))), (st_y(st_snaptogrid(points.geom, 0::double precision)))
-> Seq Scan on points (...)
Question: Will PostgreSQL reuse the result value of ST_SnapToGrid()?
If not, is there a way to make it do this?
Test timing
You don't see the evaluation of individual functions per row in the EXPLAIN output.
Test with EXPLAIN ANALYZE to get actual query times to compare overall effectiveness. Run a couple of times to rule out caching artifacts. For simple queries like this, you get more reliable numbers for the total runtime with:
EXPLAIN (ANALYZE, TIMING OFF) SELECT ...
Requires Postgres 9.2+. Per documentation:
TIMING
Include actual startup time and time spent in each node in the output. The overhead of repeatedly reading the system clock can slow
down the query significantly on some systems, so it may be useful to
set this parameter to FALSE when only actual row counts, and not exact
times, are needed. Run time of the entire statement is always
measured, even when node-level timing is turned off with this option.
This parameter may only be used when ANALYZE is also enabled. It
defaults to TRUE.
Prevent repeated evaluation
Normally, expressions in a subquery are evaluated once. But Postgres can collapse trivial subqueries if it thinks that will be faster.
To introduce an optimization barrier, you could use a CTE instead of the subquery. This guarantees that Postgres computes ST_SnapToGrid(geom, 50) once only:
WITH cte AS (
SELECT ST_SnapToGrid(geom, 50) AS geom1
FROM points
)
SELECT COUNT(*) AS n
, ST_X(geom1) AS x
, ST_Y(geom1) AS y
FROM cte
GROUP BY geom1; -- see below
However, this it's probably slower than a subquery due to more overhead for a CTE. The function call is probably very cheap. Generally, Postgres knows better how to optimize a query plan. Only introduce such an optimization barrier if you know better.
Simplify
I changed the name of the computed point in the subquery / CTE to geom1 to clarify it's different from the original geom. That helps to clarify the more important thing here:
GROUP BY geom1
instead of:
GROUP BY x, y
That's obviously cheaper - and may have an influence on whether the function call is repeated. So, this is probably fastest:
SELECT COUNT(*) AS n
, ST_X(ST_SnapToGrid(geom, 50)) AS x
, ST_y(ST_SnapToGrid(geom, 50)) AS y
FROM points
GROUP BY ST_SnapToGrid(geom, 50); -- same here!
Or maybe this:
SELECT COUNT(*) AS n
, ST_X(geom1) AS x
, ST_y(geom1) AS y
FROM (
SELECT ST_SnapToGrid(geom, 50) AS geom1
FROM points
) AS tmp
GROUP BY geom1;
Test all three with EXPLAIN ANALYZE or EXPLAIN (ANALYZE, TIMING OFF) and see for yourself. Testing >> guessing.
Related
I use a ROW_NUMBER() function inside a CTE that causes a SORT operator in the query plan. This SORT operator has always been the most expensive element of the query, but has recently spiked in cost after I increased the number of columns read from the CTE/query.
What confuses me is the increase in cost is not proportional to the column count. I can increase the column count without much issue, normally. However, it seems my query has past some threshold and now costs so much it has doubled the query execution time from 1hour to 2hour+.
I can't figure out what has caused the spike in cost and it's having an impact on business. Any ideas or next steps for troubleshooting you can advise?
Here is the query (simplified):
WITH versioned_events AS (
SELECT [event].*
,CASE WHEN [event].[handle_space] IS NOT NULL THEN [inv].[involvement_id]
ELSE [event].[involvement_id]
END AS [derived_involvement_id]
,ROW_NUMBER() OVER (PARTITION BY [event_id], [event_version] ORDER BY [event_created_date] DESC, [timestamp] DESC ) AS [latest_version]
FROM [database].[schema].[event_table] [event]
LEFT JOIN [database].[schema].[involvement] as [inv]
ON [event].[service_delivery_id] = [inv].[service_delivery_id]
AND [inv].[role_type_code] = 't'
AND [inv].latest_involvement = 1
WHERE event.deletion_type IS NULL AND (event.handle_space IS NULL
OR (event.handle_space NOT LIKE 'x%'
AND event.handle_space NOT LIKE 'y%'))
)
INSERT INTO db.schema.table (
....
)
SELECT
....
FROM versioned_events AS [event]
INNER JOIN (
SELECT DISTINCT service_delivery_id, derived_involvement_id
FROM versioned_events
WHERE latest_version = 1
WHERE ([versioned_events].[timestamp] > '2022-02-07 14:18:09.777610 +00:00')
) AS [delta_events]
ON COALESCE([event].[service_delivery_id],'NULL') = COALESCE([delta_events].[service_delivery_id],'NULL')
AND COALESCE([event].[derived_involvement_id],'NULL') = COALESCE([delta_events].[derived_involvement_id],'NULL')
WHERE [event].[latest_version] = 1
Here is the query plan from the version with the most columns that experiences the cost spike (all others look the same except this operator takes much less time (40-50mins):
I did a comparison of three executions, each with different column counts in the INSERT INTO SELECT FROM clause. I can't share the spreadsheet, but I will try convey my findings so far. The following is true of the query with the most columns:
It takes more than twice as long to execute than the other two executions
It performs more logical & physical reads and scans
It has more CPU time
It reads the most from Tempdb
The increase in execution time is not proportional with the increase in reads or other mentioned metrics
It is true that there is a memory spill level 8 happening. I have tried updating statistics, but it didn't help and all the versions of the query suffer the same problem so like-for-like is still compared.
I know it can be hard to help with this kind of problem without being able to poke around but I would be grateful if anyone could point me in the direction for what to check / try next.
P.S. the table it reads from is a heap and the table it joins to is indexed. The heap table needs to be a heap otherwise inserts into it will take too long and the problem is kicked down the road.
Also, when I say added more columns, I mean in the SELECT FROM versioned_events statement. The columns are replaced with "...." in the above example.
UPDATE
Using a temp table halved the execution time when the column count is the high number that caused the issue but actually takes longer with a reduced column count. It goes back to the idea that a threshold is crossed when the column count is increased :(. In any event, we've used a temp table for now to see if it helps in production.
I encounter a problem with optimization.
When I use a query like this:
Select * (around 100 columns)
from x
where RepoDate = '2020-05-18'
It's taking around 0.2 seconds.
Unless I'm using query like this:
Select * (around 100 columns)
from x
where RepoDate = (select max(RepoDate) from y)
It takes around 1 hour.
Table y has only dates (2020-05-17, 2020-05-18, ... )
Can you tell me why there is so much difference in time to execute?
Technically, you are comparing a simple first query with a query having another subquery that might be processing a heavy table "y" already.
So these two queries are not the same to start with. We need to have explain plans to have an estimation of the cost of the subquery first (i.e. how many rows, index usage etc.) then we move to the parent query.
I'm learning SQL and had a hard time understanding the following recursive SQL statement.
WITH RECURSIVE t(n) AS (
SELECT 1
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
What is n and t from SELECT sum(n) FROM t;? As far as I could understand, n is a number of t is a set. Am I right?
Also how is recursion triggered in this statement?
The syntax that you are using looks like Postgres. "Recursion" in SQL is not really recursion, it is iteration. Your statement is:
WITH RECURSIVE t(n) AS (
SELECT 1
UNION ALL
SELECT n+1 FROM t WHERE n < 100
)
SELECT sum(n) FROM t;
The statement for t is evaluated as:
Evaluate the non-self-referring part (select 1).
Then evaluate the self-referring part. (Initially this gives 2.)
Then evaluation the self-referring part again. (3).
And so on while the condition is still valid (n < 100).
When this is done the t subquery is finished, and the final statement can be evaluated.
This is called a Common Table Expression, or CTE.
The RECURSIVE from the query doesn't mean anything: it's just another name like n or t. What makes things recursive is that the CTE named t references itself inside the expression. To produce the result of the expression, the query engine must therefore recursively build the result, where each evaluation triggers the next. It reaches this point: SELECT n+1 FROM t... and has to stop and evaluate t. To do that, it has to call itself again, and so on, until the condition (n < 100) no longer holds. The SELECT 1 provides a starting point, and the WHERE n < 100 makes it so that the query does not recur forever.
At least, that's how it's supposed to work conceptually. What generally really happens is that the query engine builds the result iteratively, rather than recursively, if it can, but that's another story.
Let's break this apart:
WITH RECURSIVE t(n) AS (
A Common Table Expression (CTE) which is supposed to include a seed query and a recursive query. CTE is called t and returns 1 column: n
The seed query:
SELECT 1
returns ans answer set (in this case a just a single row: 1) and puts a copy of it into the final answer set
Now starts the recursive part:
UNION ALL
The rows returned from the seed query are now processed and n+1 is returned (again a single row answer set: 2) and copied into the final answer set:
SELECT n+1 FROM t WHERE n < 100
If this step returned a non-empty answer set (activity_count > 0) it's repeated (forever).
A WHERE-condition on a calculation like this n+1 is usually used to avoid an endless recursion. One usually knows the maximum possible level based on the data and for complex queries it's too easy to put some conditions wrong ;-)
Finally the answer set is returned:
)
SELECT sum(n) FROM t;
When you simply do a SELECT * FROM t; you'll see all numbers from 1 to 100, it's not a very efficient way to produce this list.
The most important thing to remember is that each step produces a part of the final result and only those rows from the previous step are processed in the next recursion level.
I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?
I need to select data from one table and insert it into another table. Currently the SQL looks something like this:
INSERT INTO A (x, y, z)
SELECT x, y, z
FROM B b
WHERE ...
However, the SELECT is huge, resulting in over 2 millions rows and we think it is taking up too much memory. Informix, the db in this case, runs out of virtual memory when the query is run.
How would I go about selecting and inserting a set of rows (say 2000)? Given that I don't think there are any row ids etc.
You can do SELECT FIRST n * from Table. Where n is the amount of rows you want, say 2000. Also, in the WHERE clause do an embedded select that checks the table you are inserting in to for rows already existing. So that the next time the statement is ran, it will not include already inserted data.
I assume that you have some script that this is executed from? You can just loop and limit as long as you order the values returned from the nested select. Here is some pseudo code.
total = SELECT COUNT(x) FROM B WHERE ...
while (total > 0)
INSERT INTO A (x, y, z) SELECT x, y, z FROM B b WHERE ... ORDER BY x LIMIT 2000
total = total - 2000
end
I'm almost certain that IDS only lets you use the FIRST clause where the data is returned to the client1, and that is something you want to avoid if at all possible.
You say you get an out of memory error (rather than, say, a long transaction aborted error)? Have you looked at the configuration of your server to ensure it has a reasonable amount of memory?
It depends in part on how big your data set is, and what the constraints are - why you are doing the load across tables. But I would normally aim to determine a way of partitioning the data into loadable subsets and run those sequentially in a loop. For example, if the sequence numbers are between 1 and 10,000,000, I might run the loop ten times, with condition on the sequence number for AND seqnum >= 0 AND seqnum < 1000000' and thenAND seqnum >= 1000000 AND seqnum < 2000000', etc. Preferably in a language with the ability to substitute the range via variables.
This is a bit nuisancy, and you want to err on the conservative side in terms of range size (more smaller partitions rather than fewer bigger ones - to reduce the risk of running out of memory).
1 Over-simplifying slightly. A stored procedure would have to count as 'the client', for example, and the communication cost in a stored procedure is (a lot) less than the cost of going to the genuine client.