I'm trying to create a faster query, right now i have large databases. My table sizes are 5 col, 530k rows, and 300 col, 4k rows (sadly i have 0 control over architecture, otherwise I wouldn't be having this silly problem with a poor db).
SELECT cast( table2.foo_1 AS datetime ) as date,
table1.*, table2.foo_2, foo_3, foo_4, foo_5, foo_6, foo_7, foo_8, foo_9, foo_10, foo_11, foo_12, foo_13, foo_14, foo_15, foo_16, foo_17, foo_18, foo_19, foo_20, foo_21
FROM table1, table2
WHERE table2.foo_0 = table1.foo_0
AND table1.bar1 >= NOW()
AND foo_20="tada"
ORDER BY
date desc
LIMIT 0,10
I've indexed the table2.foo_0 and table1.foo_0 along with foo_20 in hopes that it would allow for faster querying.. i'm still at nearly 7 second load time.. is there something else I can do?
Cheers
I think an index on bar1 is the key. I always run into performance issues with dates because it has to compare each of the 530K rows.
Create the following indexes:
CREATE INDEX ix_table1_0_1 ON table1 (foo_1, foo_0)
CREATE INDEX ix_table2_20_0 ON table2 (foo_20, foo_0)
and rewrite you query as this:
SELECT cast( table2.foo_1 AS datetime ) as date,
table1.*, table2.foo_2, foo_3, foo_4, foo_5, foo_6, foo_7, foo_8, foo_9, foo_10, foo_11, foo_12, foo_13, foo_14, foo_15, foo_16, foo_17, foo_18, foo_19, foo_20, foo_21
FROM table1
JOIN table2
ON table2.foo_0 = table1.foo_0
AND table2.foo_20 = "tada"
WHERE table1.bar1 >= NOW()
ORDER BY
table1.foo_1 DESC
LIMIT 0, 10
The first index will be used for ORDER BY, the second one will be used for JOIN.
You, though, may benefit more from creating the first index like this:
CREATE INDEX ix_table1_0_1 ON table1 (bar, foo_0)
which may apply more restrictive filtering on bar.
I have a blog post on this:
Choosing index
, which advices on how to choose which index to create for cases like that.
Indexing table1.bar1 may improve the >=NOW comparison.
A compound index on table2.foo_0 and table2.foo_20 will help.
An index on table2.foo_1 may help the sort.
Overall, pasting the output of your query with EXPLAIN prepended may also give some hints.
table2 needs a compound index on foo_0, foo_20, and bar1.
An index on table1.foo_0, table1.bar1 could help too, assuming that foo_20 belongs to table1.
See How to use MySQL indexes and Optimizing queries with explain.
Use compound indexes that corresponds to your WHERE equalities (in general leftmost col in the index), WHERE commparison to abolute value (middle), and ORDER BY clause (right, in the same order).
Related
I need to find a fast way to determine if records exist in a database table. The normal method of IF Exists (condition) is not fast enough for my needs. I've found something that is faster but does not work quite as intended.
The normal IF Exists (condition) which works but is too slow for my needs:
IF EXISTS (SELECT *
From dbo.SecurityPriceHistory
Where FortLabel = 'EP'
and TradeTime >= '2020-03-20 15:03:53.000'
and Price >= 2345.26)
My work around that doesn't work, but is extremely fast:
IF EXISTS (SELECT IIF(COUNT(*) = 0, null, 1)
From dbo.SecurityPriceHistory
Where FortLabel = 'EP'
and TradeTime >= '2020-03-20 15:03:53.000'
and Price >= 2345.26)
The issue with the second solution is that when the count(*) = 0, null is returned, but that causes IF EXISTS(null) to return true.
The second solution is fast because it doesn't read any data in the execution plan, while the first one does read data.
I suggested leaving the original code unchanged, but adding an index to cover one (or more) of the columns in the WHERE clause.
If I changed anything, I might limit the SELECT clause to a single non-null small column.
Switching to a column store index in my particular use case appears to solve my performance problem.
For this query:
IF EXISTS (SELECT *
From dbo.SecurityPriceHistory
Where FortLabel = 'EP' and
TradeTime >= '2020-03-20 15:03:53.000' and
Price >= 2345.26
)
You either want an index on:
SecurityPriceHistory(Fortlabel, TradeTime, Price)
SecurityPriceHistory(Fortlabel, Price, TradeTime)
The difference is whether TradeTime or Price is more selective. A single column index is probably not sufficient for this query.
The third column in the index is just there so the index covers the query and doesn't have to reference the data pages.
Query is
SELECT DISTINCT A.X1, A.X2, A.X3, TO_DATE(A.EVNT_SCHED_DATE,'DD-Mon-YYYY') AS EVNT_SCHED_DATE,
A.X4, A.MOVEMENT_TYPE, TRIM(A.EFFECTIVE_STATUS) AS STATUS, A.STATUS_TIME, A.TYPE,
A.LEG_NUMBER,
CASE WHEN A.EFFECTIVE_STATUS='BT' THEN 'NLT'
WHEN A.EFFECTIVE_STATUS='NLT' THEN 'NLT'
WHEN A.EFFECTIVE_STATUS='MKUP' THEN 'MKUP'
END AS STATUS
FROM PHASE1.DY_STATUS_ZONE A
WHERE A.LAST_LEG_FLAG='Y'
AND SCHLD_DATE>='01-Apr-2019'--TO_DATE(''||MNTH_DATE||'','DD-Mon-YYYY')
AND SCHLD_DATE<='20-Feb-2020'--TO_DATE(''||TILL_DATE||'','DD-Mon-YYYY')
AND A.MOVEMENT_TYPE IN ('P')
AND (EXCEPTIONAL_FLAG='N' OR EXCEPTION_TYPE='5') ---------SS
PHASE1.DY_STATUS_ZONE has 710246 records in it , Please guide if this query can be optimized ?
You could try adding an index which covers the WHERE clause:
CREATE INDEX idx ON PHASE1.DY_STATUS_ZONE (LAST_LEG_FLAG, SCHLD_DATE, MOVEMENT_TYPE,
EXCEPTIONAL_FLAG, EXCEPTION_TYPE);
Depending on the cardinality of your data, the above index may or may not be used.
The problem might be the select distinct. This can be hard to optimize because it removes duplicates. Even if no rows are duplicated, Oracle still does the work. If it is not needed remove it.
For your particular query, I would write it as:
WHERE A.LAST_LEG_FLAG = 'Y' AND
SCHLD_DATE >= DATAE '2019-04-01 AND
SCHLD_DATE <= DATE '2020-02-20' AND
A.MOVEMENT_TYPE = 'P' AND
EXCEPTIONAL_FLAG IN ('N', '5')
The date formats don't affect performance. Just readability and maintainability.
For this query, the optimal index is probably: (LAST_LEG_FLAG, MOVEMENT_TYPE, SCHLD_DATE, EXCEPTIONAL_FLAG). The last two columns might be switched, if EXCEPTIONAL_FLAG is more selective than SCHLD_DATE.
However, if this returns many rows, then the SELECT DISTINCT will be the gating factor for the query. And that is much more difficult to optimize.
I am new to this site, but please don't hold it against me. I have only used it once.
Here is my dilemma: I have moderate SQL knowledge but am no expert. The query below was created by a consultant a long time ago.
On most mornings it takes a 1.5 hours to run because there is lots of data. BUT other mornings, it takes 4-6 hours. I have tried eliminating any jobs that are running. I am thoroughly confused as to what to try to find out what is causing this problem.
Any help would be appreciated.
I have already broken this query into 2 queries, but any tips on ways to help boost performance would be greatly appreciated.
This query builds back our inventory transactions to find what our stock on hand value was at any given point in time.
SELECT
ITCO, ITIM, ITLOT, Time, ITWH, Qty, ITITCD,ITIREF,
SellPrice, SellCost,
case
when Transaction_Cost is null
then Qty * (SELECT ITIACT
FROM (Select Top 1 B.ITITDJ, B.ITIREF, B.ITIACT
From OMCXIT00 AS B
Where A.ITCO = B.ITCO
AND A.ITWH = B.ITWH
AND A.ITIM = B.ITIM
AND A.ITLOT = B.ITLOT
AND ((A.ITITDJ > B.ITITDJ)
OR (A.ITITDJ = B.ITITDJ AND A.ITIREF <= B.ITIREF))
ORDER BY B.ITITDJ DESC, B.ITIREF DESC) as C)
else Transaction_Cost
END AS Transaction_Cost,
case when ITITCD = 'S' then ' Shipped - Stock' else null end as TypeofSale,
case when ititcd = 'S' then ITIREF else null end as OrderNumber
FROM
dbo.InvTransTable2 AS A
Here is the execution plan.
http://i.imgur.com/mP0Cu.png
Here is the DTA but I am unsure how to read it since the recommedations are blank. Shouldn't that say "Create"?
http://i.imgur.com/4ycIP.png
You can not do match with dbo.InvTransTable2, because of you are selected all records from it, so it will be left scanning records.
Make sure that you have clustered index on OMCXIT00, it looks like it is a heap, no clustered index.
Make sure that clustered index is small, but has more distinct values in it.
If you have not many records OMCXIT00, it may be sufficient to create index with key ITCO and include following columns in include ( ITITDJ , ITIREF, ITWH,ITCO ,ITIM,ITLOT )
Index creation example:
CREATE INDEX IX_dbo_OMCXIT00
ON OMCXIT00 ([ITCO])
INCLUDE ( ITITDJ , ITIREF)
If it does not help, then you need to see which columns in the predicates that you are searching for has more distinct values, and create index with key one or some of them and make sure reorder predicate order in where clause.
A.ITCO = B.ITCO
AND A.ITWH = B.ITWH
AND A.ITIM = B.ITIM
AND A.ITLOT = B.ITLOT
besides adding indexes to change table scans for index seeks, ask to yourself: "do i really need this order by in this sql code?". if you dont neet this sorting, remove order by from your sql code. next, there is a good chance your code will be faster.
I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?
SELECT MAX(verification_id)
FROM VERIFICATION_TABLE
WHERE head = 687422
AND mbr = 23102
AND RTRIM(LTRIM(lname)) = '.iq bzw'
AND TO_CHAR(dob,'MM/DD/YYYY')= '08/10/2004'
AND system_code = 'M';
This query is taking 153 seconds to run. there are millions of rows in VERIFICATION_TABLE.
I think query is taking long because of the functions in where clause. However, I need to do ltrim rtrim on the columns and also date has to be matched in MM/DD/YYYY format. How can I optimize this query?
Explain plan:
SELECT STATEMENT, GOAL = ALL_ROWS 80604 1 59
SORT AGGREGATE 1 59
TABLE ACCESS FULL P181 VERIFICATION_TABLE 80604 1 59
Primary key:
VRFTN_PK Primary VERIFICATION_ID
Indexes:
N_VRFTN_IDX2 head, mbr, dob, lname, verification_id
N_VRFTN_IDX3 last_update_date
N_VRFTN_IDX4 mbr, lname, dob, verification_id
N_VRFTN_IDX4 verification_id
Though, in the explain plan I dont see indexes/primary key being used. is that the problem?
Try this:
SELECT MAX(verification_id)
FROM VERIFICATION_TABLE
WHERE head = 687422
AND mbr = 23102
AND TRIM(lname) = '.iq bzw'
AND TRUNCATE(dob) = TO_DATE('08/10/2004')
AND system_code = 'M';
Remove that TRUNCATE() if dob doesn't have time on it already, from the looks of it (Date of Birth?) it may not. Past that, you need some indexing work. If you're querying that much in this style, I'd index mbr and head in a 2 column index, if you said what the columns mean it'd help determine the best indexing here.
The only index that is a possible candidate for use in your query is N_VRFTN_IDX2, because it indexes four of the columns you use in your WHERE clause: HEAD, MBR, DOB and LNAME.
However, because you apply functions to both DOB and LNAME they are ineligible for consideration. The optimizer may then decide not to use that index because it thinks HEAD+MBR on their own are an insufficiently selective combination. If you removed the TO_CHAR() call from DOB then you have three leading columns on N_VRFTN_IDX2 which might make it more attractive to the optimizer. Likewise, is it necessary to TRIM() LNAME?
The other thing is, the need to look up SYSTEM_CODE means the query has to read from the table (because that column is not indexed). If N_VRFTN_IDX2 has a poor clustering factoring the optimizer may decide to go for a FULL TABLE SCAN because the indexed reads are an overhead. Whereas if you added SYSTEM_CODE to the index the entire query could be satisfied by an INDEX RANGE SCAN, which would be a lot faster.
Finally, how fresh are your statistics? If your statistics are stale, that might lead the optimizer to make a duff decision. For instance, more accurate statistics might lead the optimizer to use the compound index even with just the two leading columns.
You should turn the literal into a DATE and not the column into a VARCHAR2 like this:
AND dob = TO_DATE('08/10/2004','MM/DD/YYYY')
Or use the preferable ANSI date literal syntax:
AND dob = DATE '2004-08-10'
If the dob column contains time (a date of birth doesn't usually, except presumably in a hospital!) then you can do:
AND dob >= DATE '2004-08-10'
AND dob < DATE '2004-08-11'
Check the datatypes for HEAD and MBR.
The values "687422 and 23102" have the 'feel' of being quite selective. That is, if you have hundreds of thousands of values for head and millions of records in the table, it would seem that HEAD is quite selective. [That could be totally misleading though.]
Anyway, you may find that HEAD and/or MBR are actually stored as VARCHAR2 or CHAR fields rather than NUMBER. If so, comparing the character to a number would prevent the use of the index. Try the following (and I've included the conversion of the dob predicate with a date but added the explicit format mask).
SELECT MAX(verification_id)
FROM VERIFICATION_TABLE
WHERE head = '687422'
AND mbr = '23102'
AND RTRIM(LTRIM(lname)) = '.iq bzw'
AND TRUNCATE(dob) = TO_DATE('08/10/2004','MM/DD/YYYY')
AND system_code = 'M';
Please provide an EXPLAIN output on this query so we know where the slow-down occurs. Two thoughts:
change
AND TO_CHAR(dob,'MM/DD/YYYY')= '08/10/2004'
to
AND dob = <date here, not sure which oracle str2date function you need>
and use a function based index on
RTRIM(LTRIM(lname))
Try this:
SELECT MAX(verification_id)
FROM VERIFICATION_TABLE
WHERE head = 687422
AND mbr = 23102
AND TRIM(lname) = '.iq bzw'
AND dob between TO_DATE('08/10/2004') and TO_DATE('08/11/2004')
AND system_code = 'M';
This way a possible index on dob will be used.