are SQL subqueries within CASE WHEN run once for a query, or for each row? - sql

Basically, is the code below efficient (if I cannot use # variables in MonetDB), or will this call the subqueries more than once each?
CREATE VIEW sys.share26cuts_2007 (peorglopnr,share26cuts_2007) AS (
SELECT peorglopnr, CASE WHEN share26_2007 < (SELECT QUANTILE(share26_2007,0.25) FROM sys.share26_2007) THEN 1
WHEN share26_2007 < (SELECT QUANTILE(share26_2007,0.5) FROM sys.share26_2007) THEN 2
WHEN share26_2007 < (SELECT QUANTILE(share26_2007,0.75) FROM sys.share26_2007) THEN 3
ELSE 4 END AS share26cuts_2007
FROM sys.share26_2007
);
I would rather not use a user-defined function either, though this came up in other questions.

As e.g. GoatCO commented on the question, this is probably better avoided. The SET command that MonetDB support can be used with SELECT as in the code below. The remaining question is why all quantiles are zero where my data is surely not (I also got division by zero errors before using NULLIF). I show more of the code now.
CREATE VIEW sys.over26_2007 (personlopnr,peorglopnr,loneink,below26_loneink) AS (
SELECT personlopnr,peorglopnr,loneink, CASE WHEN fodelsear < 1981 THEN 0 ELSE loneink END AS below26_loneink
FROM sys.ds_chocker_lev_lisaindivid_2007
);
SELECT COUNT(*) FROM over26_2007;
CREATE VIEW sys.share26_2007 (peorglopnr,share26_2007) AS (
SELECT peorglopnr, SUM(below26_loneink)/NULLIF(SUM(loneink),0)
FROM sys.over26_2007
GROUP BY peorglopnr
);
SELECT COUNT(*) FROM share26_2007;
DECLARE firstq double;
SET firstq = (SELECT QUANTILE(share26_2007,0.25) FROM sys.share26_2007);
SELECT firstq;
DECLARE secondq double;
SET secondq = (SELECT QUANTILE(share26_2007,0.5) FROM sys.share26_2007);
SELECT secondq;
DECLARE thirdq double;
SET thirdq = (SELECT QUANTILE(share26_2007,0.275) FROM sys.share26_2007);
SELECT thirdq;
CREATE VIEW sys.share26cuts_2007 (peorglopnr,share26cuts_2007) AS (
SELECT peorglopnr, CASE WHEN share26_2007 < firstq THEN 1
WHEN share26_2007 < secondq THEN 2
WHEN share26_2007 < thirdq THEN 3
ELSE 4 END AS share26cuts_2007
FROM sys.share26_2007
);
SELECT COUNT(*) FROM share26cuts_2007;

About inspecting plans, MonetDB supports:
PLAN to see the logical plan
EXPLAIN to see the physical plan in terms of MAL instructions
TRACE same as EXPLAIN, but actually execute the MAL plan and return statistics for all instructions.
About your question on repeating the subqueries, in principle nothing will be repeated and you will not need to take care of it explicitly.
That's because the default optimization pipeline includes the commonTerms optimizer. Your SQL will be translated to a sequence of MAL instructions, with duplicate calls. MAL is designed to be simple: many short instruction calls, a bit like assembly, which operate on columns, not rows (hence don't apply the same reasoning you would use for SQL Server when you think of execution efficiency). This makes it easier to run some optimizations on it. The commonTerms optimizer will detect the duplicate calls and reuse all results that it can. This is done per-column. So you should really be able to run your query and be happy.
However, I said in principle. Not all cases will be detected (though most will), plus some limitations have been introduced on purpose. For example, the search-space for detecting duplicates is a window (too small for my taste - I have removed it altogether in my installations) over the whole MAL plan: if the duplicate instruction is too far down the plan it won't be detected. This was done for efficiency. In your case, that single query isn't that big, but if it is part of a longer chain of views, then all these views will compile into a large MAL plan - which might make commonTerms less effective - it really depends on the actual queries.

Related

Does using EXISTS instruction improves this query

Im learning how to improve some queries I have, for example I saw that using EXISTS over IN does better, so I did the following modification but im not 100% sure if im getting the same results I expect by using EXISTS, so far it gives me the same results when I execute with different periods but im still doubting, could some one clear for me if it does better performance EXISTS and also if its doing the same?
VARIABLE periodo STRING;
SELECT MO.STRMOVANOMES,
C.STRCLINOMBRE,
MO.STRCLINIT,
MO.STROBLOBLIGSARC,
MO.NUMPROCODIGO,
MO.NUMMOVTIPOCREDITO,
MO.NUMMOVTIPOGARANTIA,
MO.STROBLMODALIDAD,
MO.NUMMOVCALIFICACION,
MO.NUMMOVVLRCAPCREDITO,
MO.NUMMOVVLRINTCREDI,
MO.NUMMOVVLRCAPOTRO
FROM TBLMOVOBLIGACIONES MO,
TBLCLIENTES C
WHERE MO.STRMOVANOMES = :periodo
AND C.STRCLITIPOID ='N'
AND MO.NUMMOVTIPOCREDITO = 2
AND MO.NUMPROCODIGO IN (1,3,4,5,6,7,10,15,16,18,24,29,32,38,40,43,44,45,49,51,54,55,56,70,71,72,73,74,75,76,77,78,81,82,83,84,85)--ALL APPLICATIONS
AND C.STRCLINIT=MO.STRCLINIT
AND SUBSTR(MO.STRCLINIT,1,9) >= 600000000
AND SUBSTR(MO.STRCLINIT,1,9) <= 999999999;
-- USING EXISTS
SELECT MO.STRMOVANOMES,
C.STRCLINOMBRE,
MO.STRCLINIT,
MO.STROBLOBLIGSARC,
MO.NUMPROCODIGO,
MO.NUMMOVTIPOCREDITO,
MO.NUMMOVTIPOGARANTIA,
MO.STROBLMODALIDAD,
MO.NUMMOVCALIFICACION,
MO.NUMMOVVLRCAPCREDITO,
MO.NUMMOVVLRINTCREDI,
MO.NUMMOVVLRCAPOTRO
FROM TBLMOVOBLIGACIONES MO,
TBLCLIENTES C
WHERE MO.STRMOVANOMES = :periodo
AND C.STRCLITIPOID ='N'
AND MO.NUMMOVTIPOCREDITO = 2
AND EXISTS (SELECT NUMPROCODIGO FROM TBLMOVOBLIGACIONES)--ALL APPLICATIONS
AND C.STRCLINIT=MO.STRCLINIT
AND SUBSTR(MO.STRCLINIT,1,9) >= 600000000
AND SUBSTR(MO.STRCLINIT,1,9) <= 999999999;
First, some general info
In theory, the IN clause is to be used when there is a need to check some value against a list of values, so the DB has to loop through the list looking for a value given. Whereas EXIST is to let db to perform a quick query (which is db designed for) in order to check if there is at least one row.
If the values list you're checking against is in a table, it is better to use EXISTS in general. That was some theory.
In practice though, starting from version 10 (which is very old one) Oracle builds same exec plans for IN and EXISTS queries, here is a good explanation if you need some more experienced guy.
To check whether it affects query execution, check explain plans (google for it for your dev-tool) and compare overall "costs" (cost comparison it's not always a good thing to relay on, it is okay for such case though)
Now, back to your query.
The EXISTS in the way you are using it will always return true whenever table TBLMOVOBLIGACIONES has at least one row. Not sure this is what you were looking for.
I believe you need to use it somehow in this manner
AND EXISTS (SELECT 1 FROM TBLMOVOBLIGACIONES WHERE NUMPROCODIGO = MO.NUMPROCODIGO)
Thus it will check whether there is at least one record connected to the data you selected on previous step.
Next step I see here is the EXISTS clause can be easily converted to a regular table join, so instead of choosing between IN or EXISTS you might write the following
FROM TBLMOVOBLIGACIONES MO,
TBLCLIENTES C,
TBLMOVOBLIGACIONES M
WHERE MO.STRMOVANOMES = :periodo
AND C.STRCLITIPOID ='N'
AND MO.NUMMOVTIPOCREDITO = 2
AND M.NUMPROCODIGO = MO.NUMPROCODIGO--ALL APPLICATIONS
AND C.STRCLINIT=MO.STRCLINIT
AND SUBSTR(MO.STRCLINIT,1,9) >= 600000000
AND SUBSTR(MO.STRCLINIT,1,9) <= 999999999;
It is worth doing because joins are always better option allowing database to vary table joins ordering to achieve better performance.

SQL Query Performance with case statement

I have a simple select that is running very slow and have narrowed it down to one particular where statement.
I am not sure if you need to see the whole query, or maybe will be able to help me understand why the case is affecting performance so much. I feel like I found the problem, but can't seem to resolve it. I've worked with case statement before, and have never ran into such huge performance issues.
For this particular example. the declaration is as follows: Declare #lastInvOnly as int = 0
the problem where statement follow and runs for about 20 seconds:
AND ird.inventorydate = CASE WHEN #lastinvonly=0 THEN
-- get the last reported inventory in respect to the specified parameter
(SELECT MAX(ird2.inventorydate)
FROM irdate ird2
WHERE ird2.ris =r.ris AND
ird2.generateddata!='g' AND
ird2.inventorydate <= #inventorydate)
END
Removing the case makes it run in 1 second which is a HUGE difference. I can't understand why.
AND ird.inventorydate =
(SELECT MAX(ird2.inventorydate)
FROM irdate ird2
WHERE ird2.ris = r.ris AND
ird2.generateddata! = 'g' AND
ird2.inventorydate <= #inventorydate)
It should almost certainly be a derived table and you should join to it instead. Sub-selects tend to have poor performance and when used conditionally, even worse. Try this instead:
INNER JOIN (
select
ris
,max(inventorydate) AS [MaxInvDate]
from irdate
where s and generateddata!='g'
and inventorydate <= #inventorydate
GROUP BY ris
) AS MaxInvDate ON MaxInvDate.ris=r.ris
and ird.inventorydate=MaxInvDate.MaxInvDate
and #lastinvonly=0
I'm not 100% positive this logically works with the whole query as your question only provides a small part.
I can't tell for sure without seeing an execution plan but the branch in your filter is likely the cause of the performance problems. Theoretically, the optimizer can take the version without the case and apply an optimization that transforms the subquery in your filter into a join; when the case statement is added this optimization is no longer possible and the subquery is executed for every row. One can refactor the code to help the optimizer out, something like this should work:
outer apply (
select max(ird2.inventorydate) as maxinventorydate
from irdate ird2
where ird2.ris = r.ris
and ird2.generateddata <> 'g'
and ird2.inventorydate <= #inventorydate
and #lastinvonly = 0
) as ird2
where ird.inventorydate = ird2.maxinventorydate

Repeating operations vs multilevel queries

I was always bothered by how should I approach those, which solution is better. I guess the sample code should explain it better.
Lets imagine we have a table that has 3 columns:
(int)Id
(nvarchar)Name
(int)Value
I want to get the basic columns plus a number of calculations on the Value column, but with each of the calculation being based on a previous one, In other words something like this:
SELECT
*,
Value + 10 AS NewValue1,
Value / NewValue1 AS SomeOtherValue,
(Value + NewValue1 + SomeOtherValue) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
Obviously this will not work. NewValue1, SomeOtherValue and YetAnotherValue are on the same level in the query so they can't refer to each other in the calculations.
I know of two ways to write queries that will give me the desired result. The first one involves repeating the calculations.
SELECT
*,
Value + 10 AS NewValue1,
Value / (Value + 10) AS SomeOtherValue,
(Value + (Value + 10) + (Value / (Value + 10))) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
The other one involves constructing a multilevel query like this:
SELECT
t2.*,
(t2.Value + t2.NewValue1 + t2.SomeOtherValue) / 10 AS YetAnotherValue
FROM
(
SELECT
t1.*,
t1.Value / t1.NewValue1 AS SomeOtherValue
FROM
(
SELECT
*,
Value + 10 AS NewValue1
FROM
MyTable
WHERE
Name LIKE "A%"
) t1
) t2
But which one is the right way to approach the problem or simply "better"?
P.S. Yes, I know that "better" or even "good" solution isn't always the same thing in SQL and will depend on many factors.
I have tired a number of different combination of calculations in both variants. They always produced the same execution plan, so it could be assumed that there is no difference in the performance aspect. From the code usability perspective the first approach i obviously better as the code is more readable and compact.
There is no "right" way to write such queries. SQL Server, as with most databases (MySQL being a notable exception), does not create intermediate tables for each subquery. Instead, it optimizes the query as a whole and often moves all the calculations for the expressions into a single processing node.
The reason that column aliases cannot be re-used at the same level goes to the ANSI standard definition. In particular, nothing in the standard specifies the order of evaluation for the individual expressions. Without knowing the order, SQL cannot guarantee that the variable is defined before evaluated.
I often write multi-level queries -- either using subqueries or CTEs -- to make queries more readable and more maintainable. But then again, I will also copy logic from one variable to the other because it is expedient. In my opinion, this is something that the writer of the query needs to decide on, taking into account whether the query is part of the code for a system that needs to be maintained, local coding standards, whether the query is likely to be modified, and similar considerations.

Multiple Joins + Lots of Data Optimization

I am working on a massive join at work and have very limited resources in terms of being able to add indexes and such as well as what I can do in the query itself due to the environment (i.e. I can only select data, no variables or table creations allowed). I have read somewhere that a subquery will automatically index the result, is this true? Also for my major join tables (3) each has ~140K rows. I have to join 2 extra tables to ensure filtering is correct. I have the query listed below which I currently have criteria on the JOIN clause. Another question is if I move my criteria to a where clause either in or out of the subquery will it benefit?
SELECT *
FROM (SELECT NULL AS A1,
DFS_ROHEADER.TECHID,
DFS_ROHEADER.RONUMBER,
DFS_ROHEADER.CUSTOMERNUMBER,
DFS_CUSTOMER.BNAME,
DFS_ROHEADER.UNITNUMBER,
DFS_ROHEADER.MILEAGE,
DFS_ROHEADER.OPENEDDATE,
DFS_ROHEADER.CLOSEDDATE,
DFS_ROHEADER.STATUS,
DFS_ROHEADER.PONUMBER,
DFS_TECH.REGION,
DFS_TECH.RSM,
DFS_ROPART.PARTID,
CONVERT(NVARCHAR(max), DFS_RODETAIL.STORY) AS STORY
FROM DFS_ROHEADER
LEFT JOIN DFS_CUSTOMER
ON DFS_ROHEADER.CUSTOMERNUMBER = DFS_CUSTOMER.CUST_NO
LEFT JOIN DFS_TECH
ON DFS_ROHEADER.TECHID = DFS_TECH.TECHID
INNER JOIN DFS_RODETAIL
ON DFS_ROHEADER.RONUMBER = DFS_RODETAIL.RONUMBER
INNER JOIN DFS_ROPART
ON DFS_RODETAIL.RONUMBER = DFS_ROPART.RONUMBER
AND DFS_RODETAIL.LINENUMBER = DFS_ROPART.LINENUMBER
AND DFS_ROHEADER.RONUMBER LIKE '%$FF_RONumber%'
AND DFS_ROHEADER.UNITNUMBER LIKE '%$FF_UnitNumber%'
AND DFS_ROHEADER.PONUMBER LIKE '%$FF_PONumber%'
AND ( DFS_CUSTOMER.BNAME LIKE '%$FF_Customer%'
OR DFS_CUSTOMER.BNAME IS NULL )
AND DFS_ROHEADER.TECHID LIKE '%$FF_TechID%'
AND DFS_ROHEADER.CLOSEDDATE BETWEEN
FF_ClosedBegin AND FF_ClosedEnd
AND ( DFS_TECH.REGION LIKE '%$FilterRegion%'
OR DFS_TECH.REGION IS NULL )
AND ( DFS_TECH.RSM LIKE '%$FF_RSM%'
OR DFS_TECH.RSM IS NULL )
AND DFS_RODETAIL.STORY LIKE '%$FF_Story%'
AND DFS_ROPART.PARTID LIKE '%$FF_PartID%'
WHERE DFS_ROHEADER.DELETED_BY < 0
AND DFS_RODETAIL.DELETED_BY < 0
AND DFS_ROPART.DELETED_BY < 0) T
ORDER BY T.RONUMBER
This query works; however, it can take forever to run, and can timeout. I have other queries that also run in the environment and I will take whatever you can give me in terms of suggestions and apply it to those. I am using SQLServer 2000, Thanks for the help.
EDIT:
Execution Plan:
https://dl.dropboxusercontent.com/u/99733863/ExecutionPlan.sqlplan
UPDATE:
I have come to the conclusion the environment I'm working in is the cause of the problem. My query works as intended and is not slow at all (1 sec. for 18,000 rows). As stated in the comments I have to fill grids with limited flexibility and I believe that these grids fill by first filling a temporary grid with the SQL statement and then copying row by row into the desired grid. There is a good chance that this is the cause of my issues. Thanks for the help.
I have come to the conclusion the environment I'm working in is the cause of the problem. My query works as intended and is not slow at all (1 sec. for 18,000 rows). As stated in the comments I have to fill grids with limited flexibility and I believe that these grids fill by first filling a temporary grid with the SQL statement and then copying row by row into the desired grid. There is a good chance that this is the cause of my issues. Thanks for the help everyone.
My 2 cents here.. In general LIKE is not very well optimized. In your case you also seem to be using LIKE with '%value%'. In that case the query optimizer has to scan the entire index. At a minimum I would see if there is a way to avoid using this.

Poor DB Performance when using ORDER BY

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?