SQL Optimization in Oracle - sql

We are using Oracle 11 and I recently acquired a Dell SQL Optimizer (included with the Xpert Toad package). We had a statement this morning that was taking longer than normal to run, and after we eventually got it running (missing some conditions from when it was created) I was curious, having never used any SQL optimizer before, what it would change it to. It came back with over 150 variations of the same statement, but the one with the lowest cost simply added to the following line.
AND o.curdate > 0 + UID * 0
We already had o.curdate > 0, and the "+ UID * 0" was added. This decreased the runtime from over a minute to 3 seconds. I assume it has something to do with how Oracle translates and processes the conditions, but I was curious if any of the Oracle gurus would be able to provide some insight as to how this addition to the greater than zero check decreased the runtime by 15 times. Thanks!

The UID * 0 is used to hide the 0 from the optimizer. The optimizer would use its statistic data to find out whether using an index scan on o.curdate > 0 makes sense. As long as the optimizer knows the value in o.curdate > value it will do so. But when the value is unknown (here because the function UID will be called on execution and somehow mathed into the value), the optimizers cannot foresee what percentage of rows may be accessed and thus choses an avarage best access method.
Example: You have a table with IDs 1 to 100. Asking for ID > 0 will result in a full table scan, whereas asking for ID > 99 will likely result in an index range scan. When asking for ID > 0 + UID * 0 suddenly makes the optimizer blind to the value, and it may chose the index plan rather then full table scan.

Related

Does SQL Server Table-Scan Time depend on the Query?

I observed that doing a full table scan takes a different time based on the query. I believed that under similar conditions (set of columns under select, column data types) a table scan should take a somewhat similar time. Seems like it's not the case. I just want to understand the reason behind that.
I have used "CHECKPOINT" and "DBCC DROPCLEANBUFFERS" before querying to make sure there is no impact from the query cache.
Table:
10 Columns
10M rows Each column has different densities ranging from 0.1 to 0.000001
No indexes
Queries:
Query A: returned 100 rows, time took: ~ 900ms
SELECT [COL00]
FROM [TEST].[dbo].[Test]
WHERE COL07 = 50000
Query B: returned 910595 rows, time took: ~ 15000ms
SELECT [COL00]
FROM [TEST].[dbo].[Test]
WHERE COL01 = 5
** Where column COL07 was randomly populated with integers ranging from 0 to 100000 and column COL01 was randomly populated with integers ranging from 0 to 10
Time Taken:
Query A: around 900 ms
Query B: around 18000 ms
What's the point I'm missing here?
Query A: (returned 100 rows, time took: ~ 900ms)
Query B: (returned 910595 rows, time took: ~ 15000ms)
I believe that what you are missing is that there are about x100 more rows to fetch in the second query. That only could explain why it took 20 times longer.
The two columns have different density of the data.
Query A, COL07: 10000000/100000 = 100
Query B, COL05: 10000000/10 = 1000000
The fact that both the search parameters are in the middle of the data range doesn't necessarily impact the speed of the search. This is depending on the number of times the engine scans the column to return the values of the search predicate.
In order to see if this is indeed the case, I would try the following:
COL04: 10000000/1000 = 10000. Filtering on WHERE COL04 = 500
COL08: 10000000/10000 = 1000. Filtering on WHERE COL05 = 5000
Considering the times from the initial test, you would expect to see COL04 at ~7200ms and COL05 at ~3600ms.
An interesting article about SQL Server COUNT() Function Performance Comparison
Full Table Scan (also known as Sequential Scan) is a scan made on a database where each row of the table under scan is read in a sequential (serial) order
Reference
In your case, full table scan scans sequentially (in ordered way) so that it does not need to scan whole table in order to advance next record because Col7 is ordered.
but in Query2 the case is not like that, Col01 is randomly distributed so full table scan is needed.
Query 1 is optimistic scan where as Query 2 is pessimistic can.

Return types of glob() and like() and failure of using index although 'LIKE optimization' applies

I'm writing this finding due to auto-created SQL by EntityFramework (see related question):
When returning the result of glob() (or like()) it appears that the type of these functions is bit:
SELECT Name, glob('admin*', Name) as globresult
FROM Users
returns for example (it's really just an example, I'm NOT doing such user searches):
Name globresult
Administrator 1
Springy 0
But when using it this way in a WHERE clause the query plan (extracted with EXPLAIN QUERY PLAN [..]) changes from good (=using index) to bad:
GOOD:
SELECT FolderID, Name
FROM Folders
WHERE glob('1_2_*', RootPath)
Query plan:
0 0 0 SEARCH TABLE Folders USING INDEX IX_RootPath (RootPath>? AND RootPath<?)
BAD: (only difference is the = 1 comparison)
SELECT FolderID, Name
FROM Folders
WHERE glob('1_2_*', RootPath) = 1
Query plan:
0 0 0 SCAN TABLE Folders
Qualifies this for a bug report or is there a reason why this should be by design?
SQLite recognizes columns to be used in index lookups only when they are used directly in an expression in a WHERE clause:
x = 5 AND y GLOB 'x*'
Any more complex expression (such as (x = 5) = 1 or even +x = 5) prevents the optimizer from recognizing a supported pattern (and this is documented).
While the meaning of these expression is actually the same, the optimizer lacks the code to be able to prove it.

Optimize complicated SQL Update

Somebody at work made this UPDATE some years ago and itt works, the problem is it's taking almost 5 hours when called multiple times in a process, this is not a regular UPDATE, there is no 1 to 1 record matching between tables, this does an update based on accumulative (SUM) of a parituclar field in the same table, and things get more complicated because this SUM is restricted to special conditions based on dates and another field.
I think this is something like an (implicit) inner join with no 1 to 1 match, like ALL VS ALL, so when having for example 7000 records in the table this thing will process 7000 * 7000 records, more than 55 million, in my opinion cursors should have been used here, but now i need more speed and i don't think cursors will get me there.
My question is: Is there any way to rewrite this and make it faster?? Pay attention to the conditions on that SUM, this is not an easy to see UPDATE (at least for me).
More info:
CodCtaCorriente and CodCtaCorrienteMon are primary keys on this table but, as I said before there is no intention to make a 1 to 1 match here that's why this keys are not used in the query, CodCtaCorrienteMon is used in conditions but not as a join condition (ON).
UPDATE #POS SET SaldoDespuesEvento =
(SELECT SUM(Importe)
FROM #POS CTACTE2
WHERE CTACTE2.CodComitente = #POS.CodComitente
AND CTACTE2.CodMoneda = #POS.CodMoneda
AND CTACTE2.EstaAnulado = 0
AND (DATEDIFF(day, CTACTE2.FechaLiquidacion, #POS.FechaLiquidacion) > 0
OR
(DATEDIFF(day, CTACTE2.FechaLiquidacion, #POS.FechaLiquidacion) = 0
AND (#POS.CodCtaCorrienteMon >= CTACTE2.CodCtaCorrienteMon))))
WHERE #POS.EstaAnulado = 0 AND #POS.EsSaldoAnterior = 0
From your query plan it looks like its spending most of the time in the filter right after the index spool.
If you are going to run this query a few times, I would create an index on the 'CodComitente', 'CodMoneda', 'EstaAnulado', 'FechaLiquidacion', and 'CodCtaCorrienteMon' columns.
I don't know much about the Index Spool iterator; but basically from what I understand about it, its used as a 'temporary' index created at query time. So if you are running this query multiple times, I would create that index once, then run the query as many times as you need.
Also, I would try creating a variable to store the result of your sum operation, so you can avoid running that as much as possible.
DECLARE #sumVal AS INT
SET #sumVal = SELECT SUM(Importe)
FROM #POS CTACTE2
WHERE CTACTE2.CodComitente = #POS.CodComitente
AND CTACTE2.CodMoneda = #POS.CodMoneda
AND CTACTE2.EstaAnulado = 0
AND (DATEDIFF(day, CTACTE2.FechaLiquidacion, #POS.FechaLiquidacion) > 0
OR
(DATEDIFF(day, CTACTE2.FechaLiquidacion, #POS.FechaLiquidacion) = 0
AND (#POS.CodCtaCorrienteMon >= CTACTE2.CodCtaCorrienteMon)))
UPDATE #POS SET SaldoDespuesEvento = #sumVal
WHERE #POS.EstaAnulado = 0 AND #POS.EsSaldoAnterior = 0
It is hard to help much without the query plan but I would make the an assumption that if there is not already indexes on the FechaLiquidacion and CodCtaCorrienteMon columns then performance would be improved by creating them as long as database storage space is not an issue.
Found the solution, this is a common problem: Running Totals
This is one of the few cases CURSORS perform better, see this and more available solutions here (or browse stackoverflow, there are many cases like this):
http://weblogs.sqlteam.com/mladenp/archive/2009/07/28/SQL-Server-2005-Fast-Running-Totals.aspx

SQL Server aggregate performance

I am wondering whether SQL Server knows to 'cache' if you like aggregates while in a query, if they are used again.
For example,
Select Sum(Field),
Sum(Field) / 12
From Table
Would SQL Server know that it has already calculated the Sum function on the first field and then just divide it by 12 for the second? Or would it run the Sum function again then divide it by 12?
Thanks
It calculates once
Select
Sum(Price),
Sum(Price) / 12
From
MyTable
The plan gives:
|--Compute Scalar(DEFINE:([Expr1004]=[Expr1003]/(12.)))
|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1010]=(0) THEN NULL ELSE [Expr1011] END))
|--Stream Aggregate(DEFINE:([Expr1010]=Count(*), [Expr1011]=SUM([myDB].[dbo].[MyTable].[Price])))
|--Index Scan(OBJECT:([myDB].[dbo].[MyTable].[IX_SomeThing]))
This table has 1.35 million rows
Expr1011 = SUM
Expr1003 = some internal thing to do with "no rows" etc but is Expr1011 basically
Expr1004 = Expr1011 / 12
According to the execution plan, it doesn't re-sum the column.
good question, i think the answer is no, it doesn't not cache it.
I ran a test query with around 3000 counts in it, and it was much slower than one with only a few. Still want to test if the query would be just as slow selecting just plain columns
edit: OK, i just tried selecting a large amount of columns or just one, and the amount of columns (when talking about thousands being returned) does effect the speed.
Overall, unless you are using that aggregate number a ton of times in your query, you should be fine. Push comes to shove, you could always save the outcome to a variable and do the math after the fact.

Poor DB Performance when using ORDER BY

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?