Postgres SQL slow when using OR statement - sql

I have a query that take 12 seconds (journal table 1.2m rows), the root cause is below
select 1
from
myTable myTable
join myJournal Journal on (Journal.status=0 and myTable.id = Journal.myTableId)
join arrival Arrival on (myTable.id = Arrival.myTableId)
join calc Calc on (myTable.id = Calculated.myTableId)
join ms ms on (Parent.id = ms.myTableId)
join perf Perf on (myTable.id = Perf.myTableId)
join ref Ref on (myTable.id = Ref.myTableId)
where
((myTable.name like 'cheese%' or
Journal.algorithm like 'cheese%' or --if this is removed, its fine <1sec
myTable.client like 'cheese%' or
myTable.something like 'cheese%'))
However the journal table performs ok. Doing
select * from myJournal where algorithm like 'cheese%' --takes < 1 sec.
The query also performs fine, if I remove four of the joins (that are not used in the where clause).
When I join on more than 3 tables the performance degrades dramatically/exponentially.

I would start by writing the query as:
select 1
from myTable t join
myJournal j
on t.id = j.myTableId
where j.status = 0 and
(t.name like 'cheese%' or
j.algorithm like 'cheese%' or
t.client like 'cheese%' or
t.something like 'cheese%'
)
Then the optimal indexes for this are probably myJournal(status, myTableId, algorithm) and myTable(id, name, client, something).
These indexes are primarily for the join and the first filter condition. They won't help much with the string comparisons. But, those are hard to optimize for due to the or conditions.

Related

SQL optimization JOIN before table scan?

I have a SQL query similar to
SELECT columnName
FROM
(SELECT columnName, someColumnWithXml
FROM _Table1
INNER JOIN _Activity ON _Activity.oid = _Table1.columnName
INNER JOIN _ActivityType ON _Activity.activityType = _ActivityType.oid
--_ActivityType.forType is a string
WHERE _ActivityType.forType = '_Disclosure'
AND _Activity.emailRecipients IS NOT NULL) subquery
WHERE subquery.someColumnWithXml LIKE '%'+'9D62EE8855797448A7C689A09D193042'+'%'
There are 15 million rows in _Table1 and the WHERE subquery.someColumnWithXml LIKE '%'+'9D62EE8855797448A7C689A09D193042'+'%' results in an execution plan that performs a full table scan on all 15 million rows. The subquery results in only a few hundred thousand rows and those are all the rows that really need to have the LIKE run on them. Is there a way to make this more efficient by running the LIKE only on the results of the subquery rather than running a TABLE SCAN with a LIKE on 15,000,000 rows? The someColumnWithXML column is not indexed.
For this query:
SELECT columnName, someColumnWithXml
FROM _Table1 t1 INNER JOIN
_Activity a
ON a.oid = t1.columnName INNER JOIN
_ActivityType at
ON a.activityType = at.oid --_ActivityType.forType is a string
WHERE at.forType = '_Disclosure' AND
a.emailRecipients IS NOT NULL AND
t1.someColumnWithXml LIKE '%'+'9D62EE8855797448A7C689A09D193042'+'%';
You have a challenge with optimizing this query. I don't know if the filtering conditions are particularly restrictive. If they are, then indexes on:
_ActivityType(forType, oid)
_Activity(activityType, emailRecipients, oid)
_Table1(columnName)
If these don't help, then you might an index on the XML column. Perhaps an XML index would work. Such an index would not really help for a generic LIKE, but that might not be needed if you parse the XML.
You could filter the in the subquery directly avoinding the scan for unuseful rows
SELECT columnName, someColumnWithXml
FROM _Table1
INNER JOIN _Activity on _Activity.oid = _Table1.columnName
INNER JOIN _ActivityType on _Activity.activityType = _ActivityType.oid
--_ActivityType.forType is a string
WHERE _ActivityType.forType = '_Disclosure'
AND _Activity.emailRecipients IS NOT NULL
someColumnWithXml LIKE '%'+'9D62EE8855797448A7C689A09D193042'+'%'

Why do multiple EXISTS break a query

I am attempting to include a new table with values that need to be checked and included in a stored procedure. Statement 1 is the existing table that needs to be checked against, while statement 2 is the new table to check against.
I currently have 2 EXISTS conditions that function independently and produce the results I am expecting. By this I mean if I comment out Statement 1, statement 2 works and vice versa. When I put them together the query doesn't complete, there is no error but it times out which is unexpected because each statement only takes a few seconds.
I understand there is likely a better way to do this but before I do, I would like to know why I cannot seem to do multiple exists statements like this? Are there not meant to be multiple EXISTS conditions in the WHERE clause?
SELECT *
FROM table1 S
WHERE
--Statement 1
EXISTS
(
SELECT 1
FROM table2 P WITH (NOLOCK)
INNER JOIN table3 SA ON SA.ID = P.ID
WHERE P.DATE = #Date AND P.OTHER_ID = S.ID
AND
(
SA.FILTER = ''
OR
(
SA.FILTER = 'bar'
AND
LOWER(S.OTHER) = 'foo'
)
)
)
OR
(
--Statement 2
EXISTS
(
SELECT 1
FROM table4 P WITH (NOLOCK)
INNER JOIN table5 SA ON SA.ID = P.ID
WHERE P.DATE = #Date
AND P.OTHER_ID = S.ID
AND LOWER(S.OTHER) = 'foo'
)
)
EDIT: I have included the query details. Table 1-5 represent different tables, there are no repeated tables.
Too long to comment.
Your query as written seems correct. The timeout will only be able to be troubleshot from the execution plan, but here are a few things that could be happening or that you could benefit from.
Parameter sniffing on #Date. Try hard-coding this value and see if you still get the same slowness
No covering index on P.OTHER_ID or P.DATE or P.ID or SA.ID which would cause a table scan for these predicates
Indexes for the above columns which aren't optimal (including too many columns, etc)
Your query being serial when it may benefit from parallelism.
Using the LOWER function on a database which doesn't have a case sensitive collation (most don't, though this function doesn't slow things down that much)
You have a bad query plan in cache. Try adding OPTION (RECOMPILE) at the bottom so you get a new query plan. This is also done when comparing the speed of two queries to ensure they aren't using cached plans, or one isn't when another is which would skew the results.
Since your query is timing out, try including the estimated execution plan and post it for us at past the plan
I found putting 2 EXISTS in the WHERE condition made the whole process take significantly longer. What I found fixed it was using UNION and keeping the EXISTS in separate queries. The final result looked like the following:
SELECT *
FROM table1 S
WHERE
--Statement 1
EXISTS
(
SELECT 1
FROM table2 P WITH (NOLOCK)
INNER JOIN table3 SA ON SA.ID = P.ID
WHERE P.DATE = #Date AND P.OTHER_ID = S.ID
AND
(
SA.FILTER = ''
OR
(
SA.FILTER = 'bar'
AND
LOWER(S.OTHER) = 'foo'
)
)
)
UNION
--Statement 2
SELECT *
FROM table1 S
WHERE
EXISTS
(
SELECT 1
FROM table4 P WITH (NOLOCK)
INNER JOIN table5 SA ON SA.ID = P.ID
WHERE P.DATE = #Date
AND P.OTHER_ID = S.ID
AND LOWER(S.OTHER) = 'foo'
)

Refactoring slow SQL query

I currently have this very very slow query:
SELECT generators.id AS generator_id, COUNT(*) AS cnt
FROM generator_rows
JOIN generators ON generators.id = generator_rows.generator_id
WHERE
generators.id IN (SELECT "generators"."id" FROM "generators" WHERE "generators"."client_id" = 5212 AND ("generators"."state" IN ('enabled'))) AND
(
generators.single_use = 'f' OR generators.single_use IS NULL OR
generator_rows.id NOT IN (SELECT run_generator_rows.generator_row_id FROM run_generator_rows)
)
GROUP BY generators.id;
An I'm trying to refactor it/improve it with this query:
SELECT g.id AS generator_id, COUNT(*) AS cnt
from generator_rows gr
join generators g on g.id = gr.generator_id
join lateral(select case when exists(select * from run_generator_rows rgr where rgr.generator_row_id = gr.id) then 0 else 1 end as noRows) has on true
where g.client_id = 5212 and "g"."state" IN ('enabled') AND
(g.single_use = 'f' OR g.single_use IS NULL OR has.norows = 1)
group by g.id
For reason it doesn't quite work as expected(It returns 0 rows). I think I'm pretty close to the end result but can't get it to work.
I'm running on PostgreSQL 9.6.1.
This appears to be the query, formatted so I can read it:
SELECT gr.generators_id, COUNT(*) AS cnt
FROM generators g JOIN
generator_rows gr
ON g.id = gr.generator_id
WHERE gr.generators_id IN (SELECT g.id
FROM generators g
WHERE g.client_id = 5212 AND
g.state = 'enabled'
) AND
(g.single_use = 'f' OR
g.single_use IS NULL OR
gr.id NOT IN (SELECT rgr.generator_row_id FROM run_generator_rows rgr)
)
GROUP BY gr.generators_id;
I would be inclined to do most of this work in the FROM clause:
SELECT gr.generators_id, COUNT(*) AS cnt
FROM generators g JOIN
generator_rows gr
ON g.id = gr.generator_id JOIN
generators gg
on g.id = gg.id AND
gg.client_id = 5212 AND gg.state = 'enabled' LEFT JOIN
run_generator_rows rgr
ON g.id = rgr.generator_row_id
WHERE g.single_use = 'f' OR
g.single_use IS NULL OR
rgr.generator_row_id IS NULL
GROUP BY gr.generators_id;
This does make two assumptions that I think are reasonable:
generators.id is unique
run_generator_rows.generator_row_id is unique
(It is easy to avoid these assumptions, but the duplicate elimination is more work.)
Then, some indexes could help:
generators(client_id, state, id)
run_generator_rows(id)
generator_rows(generators_id)
Generally avoid inner selects as in
WHERE ... IN (SELECT ...)
as they are usually slow.
As it was already shown for your problem it's a good idea to think of SQL as of set- theory.
You do NOT join tables on their sole identity:
In fact you take (SQL does take) the set (- that is: all rows) of the first table and "multiply" it with the set of the second table - thus ending up with n times m rows.
Then the ON- clause is used to (often strongly) reduce the result by simply selecting each one of those many combinations by evaluating this portion to either true (take) or false (drop). This way you can chose any arbitrary logic to select those combinations in favor.
Things get trickier with LEFT JOIN and RIGHT JOIN, but one can easily think of them as to take one side for granted:
output the combinations of that row IF the logic yields true (once at least) - exactly like JOIN does
output exactly ONE row, with 'the other side' (right side on LEFT JOIN and vice versa) consisting of ALL NULL for every column.
Count(*) is great either, but if things getting complicated don't stick to it: Use Sub- Selects for the keys only, and once all the hard word is done join the Fun- Stuff to it. Like in
SELECT SUM(VALID), ID
FROM SELECT
(
(1 IF X 0 ELSE) AS VALID, ID
FROM ...
)
GROUP BY ID) AS sub
JOIN ... AS details ON sub.id = details.id
Difference is: The inner query is executed only once. The outer query does usually have no indices left to work with and will be slow, but if the inner select here doesn't make the data explode this is usually many times faster than SELECT ... WHERE ... IN (SELECT..) constructs.

Why does separate table perform significantly better than subquery?

I was trying to improve performance of a SQL query and tried few combinations.
Original Query
SELECT ALIAS_A.id1,
ALIAS_A.id2,
ALIAS_B.columnA,
ALIAS_C.columnB,
ALIAS_B.columnC
FROM db_A.table_A ALIAS_A
LEFT OUTER JOIN db_A.table_B ALIAS_B
ON ALIAS_A.id2 = ALIAS_B.id2
LEFT OUTER JOIN db_B.table_C ALIAS_C
ON ALIAS_B.columnA = ALIAS_C.item_num
LEFT OUTER JOIN db_A.table_D ALIAS_D
ON ALIAS_A.id2 = ALIAS_D.id2
INNER JOIN db_C.table_E ALIAS_E
ON Cast(ALIAS_A.column_date AS DATE) BETWEEN
ALIAS_E.column_startdate AND ALIAS_E.column_enddate
WHERE ALIAS_E.fiscalyear >= 2016
AND Cast(ALIAS_A.columnD AS DATE) BETWEEN
CURRENT_DATE - 5 AND CURRENT_DATE
The above query consumes nearly 400k impactCPU
Optimized Query 1
SELECT New_sub_table.id1,
New_sub_table.id2,
ALIAS_B.columnA,
ALIAS_C.columnB,
ALIAS_B.columnC
--changed part start--
FROM ( sel * from db_A.table_A ALIAS_A WHERE Cast(ALIAS_A.columnD AS DATE) BETWEEN
CURRENT_DATE - 5 AND CURRENT_DATE ) New_sub_table -- created a subquery
--changed part end--
LEFT OUTER JOIN db_A.table_B ALIAS_B
ON New_sub_table.id2 = ALIAS_B.id2
LEFT OUTER JOIN db_B.table_C ALIAS_C
ON ALIAS_B.columnA = ALIAS_C.item_num
LEFT OUTER JOIN db_A.table_D ALIAS_D
ON New_sub_table.id2 = ALIAS_D.id2
INNER JOIN db_C.table_E ALIAS_E
ON Cast(New_sub_table.column_date AS DATE) BETWEEN
ALIAS_E.column_startdate AND ALIAS_E.column_enddate
WHERE ALIAS_E.fiscalyear >= 2016
I thought to filter the data first and then do the joins. After I checked the performance stats. It was consuming nearly 390k CPU. Not much of a difference.
Optimized Query 2
SELECT ALIAS_A.id1,
ALIAS_A.id2,
ALIAS_B.columnA,
ALIAS_C.columnB,
ALIAS_B.columnC
--changed part start--
FROM INTERMEDIATE_DB.INTERMEDIATE_TABLE ALIAS_A --CREATED AN INTERMEDIATE TABLE
--changed part end--
LEFT OUTER JOIN db_A.table_B ALIAS_B
ON ALIAS_A.id2 = ALIAS_B.id2
LEFT OUTER JOIN db_B.table_C ALIAS_C
ON ALIAS_B.columnA = ALIAS_C.item_num
LEFT OUTER JOIN db_A.table_D ALIAS_D
ON ALIAS_A.id2 = ALIAS_D.id2
INNER JOIN db_C.table_E ALIAS_E
ON Cast(ALIAS_A.column_date AS DATE) BETWEEN
ALIAS_E.column_startdate AND ALIAS_E.column_enddate
WHERE ALIAS_E.fiscalyear >= 2016
MACRO for loading data into intermediate table
INSERT INTO INTERMEDIATE_DB.INTERMEDIATE_TABLE
sel * from db_A.table_A ALIAS_A WHERE Cast(ALIAS_A.columnD AS DATE) BETWEEN
CURRENT_DATE - 5 AND CURRENT_DATE
So what I did here was. I used an intermediate table instead of subquery. The intermediate table gets loaded via the macro first and then the select query runs. It now consumes only 50k impactCPU (for both Macro and Select query combined).
My question -
I am unable to reason why this is happening even though the logic behind both queries is same (or so I think it is). What would be the best practice if this is incorrect way ?
Your main problem is the Cast(ALIAS_A.columnD AS DATE). When you check Explains you will notice the optimizer has no confidence for this step, probably greatly overestimating the number of rows returned.
But when you materialize the Select the number of rows is better known and the order of joins changes.
You would probably get the same plan when you Collect Statistics on the Cast(ALIAS_A.columnD AS DATE), run DIAGNOSTIC HELPSTATS ON FOR SESSION; and Explain should show you this as recommended stats.

Improve Performance of SQL query joining 14 tables

I am trying to join 14 tables in which few tables I need to join using left join.
With the existing data which is around 7000 records,its taking around 10 seconds to execute the below query.I am afraid what if the records are more than million.Please help me improve the performance of the below query.
CREATE proc [dbo].[GetTodaysActualInvoiceItemSoldHistory]
#fromdate datetime,
#todate datetime
as
Begin
select SDID.InvoiceDate as [Sold Date],Cust.custCompanyName as [Sold To] ,
case SQBD.TransferNo when '0' then IVM.VendorName else SQBD.TransferNo end as [Purchase From],
SQBD.BatchSellQty as SoldQty,SQID.SellPrice,
SDID.InvoiceNo as [Sales Invoice No],INV.PRInvoiceNo as [PO Invoice No],INV.PRInvoiceDate as [PO Invoice Date],
SQID.ItemDesc as [Item Description],SQID.NetPrice,SDHM.DeliveryHeaderMasterName as DeliveryHeaderName,
SQID.ItemCode as [Item Code],
SQBD.BatchNo,SQBD.ExpiryDate,SQID.Amount,
SQID.Dept_ID as Dept_ID,
Dept_Name as [Department],SQID.Catg_ID as Catg_ID,
Category_Name as [Category],SQID.Brand_ID as Brand_ID,
BrandName as BrandName, SQID.Manf_Id as Manf_Id,
Manf.ManfName as [Manufacturer],
STM.TaxName, SQID.Tax_ID as Tax_ID,
INV.VendorID as VendorID,
SQBD.ItemID,SQM.Isdeleted,
SDHM.DeliveryHeaderMasterID,Cust.CustomerMasterID
from SD_QuotationMaster SQM
inner join SD_InvoiceDetails SDID on SQM.QuoteID = SDID.QuoteID
inner join SD_QuoteItemDetails SQID on SDID.QuoteID = SQID.QuoteID
inner join SD_QuoteBatchDetails SQBD on SDID.QuoteID = SQBD.QuoteID and SQID.ItemID=SQBD.ItemID
inner join INV_ProductInvoice INV on SQBD.InvoiceID=INV.ProductInvoiceID
inner jOIN INV_VendorMaster IVM ON INV.VendorID = IVM.VendorID
inner jOIN Sys_TaxMaster STM ON SQID.Tax_ID = STM.Tax_ID
inner join Cust_CustomerMaster Cust on SQM.CustomerMasterID = Cust.CustomerMasterID
left jOIN INV_DeptartmentMaster Dept ON SQID.Dept_ID = Dept.Dept_ID
left jOIN INV_BrandMaster BRD ON SQID.Brand_ID = BRD.Brand_ID
left jOIN INV_ManufacturerMaster Manf ON SQID.Manf_Id = Manf.Manf_Id
left join INV_CategoryMaster CAT ON SQID.Catg_ID = CAT.Catg_ID
left join SLRB_DeliveryCustomerMaster SDCM on SQM.CustomerMasterID=SDCM.CustomerMasterID and SQM.DeliveryHeaderMasterID=SDCM.DeliveryHeaderMasterID
left join SLRB_DeliveryHeaderMaster SDHM on SDCM.DeliveryHeaderMasterID=SDHM.DeliveryHeaderMasterID
where (SQM.IsDeleted=0) and SQBD.BatchSellQty > 0
and SDID.InvoiceDate between #fromdate and #todate
order by ItemDesc
End
Only the below tables contain more data while other tables have records less than 20
InvoiceDetails, QuoteMaster, QuoteItemDetails, QuoteBatchDetails ProductInvoice
Below is the link for execution plan
http://jmp.sh/CSZc2x2
Thanks.
Let's start with an obvious error:
(isnull(SQBD.BatchSellQty,0) > 0)
That one is not indexable, so it should not happen. Seriously, BatchSellQty should not be unknown (nullable) in most cases, or you better handle null properly. That field should be indexed and I am not sure I would like that with an isNull - there are likely tons of batches. Also note that a filtered index (condition >0) may work here.
Second, check that you have all proper indices and the execution plan makes sense.
Thids, you have to test with a ton of data. Index statistics may make a difference. Check where the time is spent - it may be tempdb in which case you really need a good tempdb IO speed.... and it is not realted to the input side.
You can try to use query hints to help SQL Server optimizer build a optimal query execution plan. For example, you can force the order of tables will be joined, using FORCE ORDER statement. If you order your tables in order that joins with minimum result size at each step, query will execute faster (may be, needs to try). Example:
We need to A join B join C
If A join B = 2000 records x 1000 records = ~400 records (we suspect this result)
And A join C = 2000 records x 10 records = ~3 records (and this)
And B join C = 1000 records x 10 records = 10 000 records (and this)
In this case optimal order will be
A join C join B = ~3 records x 1000 records = ~3000 records