Query taking too long - Optimization - sql

I am having an issue with the following query returning results a bit too slow and I suspect I am missing something basic. My initial guess is the 'CASE' statement is taking too long to process its result on the underlying data. But it could be something in the derived tables as well.
The question is, how can I speed this up? Are there any glaring errors in the way I am pulling the data? Am I running into a sorting or looping issues somewhere? The query runs for about 40 seconds, which seems quite long. C# is my primary expertise, SQL is a work in progress.
Note I am not asking "write my code" or "fix my code". Just for a pointer in the right direction, I can't seem to figure out where the slow down occurs. Each derived table runs very quickly (less than a second) by themselves, the joins seem correct and the result set is returning exactly what I need. It's just too slow and I'm sure there are better SQL scripter's out there ;) Any tips would be greatly appreciated!
SELECT
hdr.taker
, hdr.order_no
, hdr.po_no as display_po
, cust.customer_name
, hdr.customer_id
, 'INCORRECT-LARGE ORDER' + CASE
WHEN (ext_price_calc >= 600.01 and ext_price_calc <= 800) and fee_price.unit_price <> round(ext_price_calc * -.01,2)
THEN '-1%: $' + cast(cast(ext_price_calc * -.01 as decimal(18,2)) as varchar(255))
WHEN ext_price_calc >= 800.01 and ext_price_calc <= 1000 and fee_price.unit_price <> round(ext_price_calc * -.02,2)
THEN '-2%: $' + cast(cast(ext_price_calc * -.02 as decimal(18,2)) as varchar(255))
WHEN ext_price_calc > 1000 and fee_price.unit_price <> round(ext_price_calc * -.03,2)
THEN '-3%: $' + cast(cast(ext_price_calc * -.03 as decimal(18,2)) as varchar(255))
ELSE
'OK'
END AS Status
FROM
(myDb_view_oe_hdr hdr
LEFT OUTER JOIN myDb_view_customer cust
ON hdr.customer_id = cust.customer_id)
LEFT OUTER JOIN wpd_view_sales_territory_by_customer territory
ON cust.customer_id = territory.customer_id
LEFT OUTER JOIN
(select
order_no,
SUM(ext_price_calc) as ext_price_calc
from
(select
hdr.order_no,
line.item_id,
(line.qty_ordered - isnull(qty_canceled,0)) * unit_price as ext_price_calc
from myDb_view_oe_hdr hdr
left outer join myDb_view_oe_line line
on hdr.order_no = line.order_no
where
line.delete_flag = 'N'
AND line.cancel_flag = 'N'
AND hdr.projected_order = 'N'
AND hdr.delete_flag = 'N'
AND hdr.cancel_flag = 'N'
AND line.item_id not in ('LARGE-ORDER-1%','LARGE-ORDER-2%', 'LARGE-ORDER-3%', 'FUEL','NET-FUEL', 'CONVENIENCE-FEE')) as line
group by order_no) as order_total
on hdr.order_no = order_total.order_no
LEFT OUTER JOIN
(select
order_no,
count(order_no) as convenience_count
from oe_line with (nolock)
left outer join inv_mast inv with (nolock)
on oe_line.inv_mast_uid = inv.inv_mast_uid
where inv.item_id in ('LARGE-ORDER-1%','LARGE-ORDER-2%', 'LARGE-ORDER-3%')
and oe_line.delete_flag <> 'Y'
group by order_no) as fee_count
on hdr.order_no = fee_count.order_no
INNER JOIN
(select
order_no,
unit_price
from oe_line line with (nolock)
where line.inv_mast_uid in (select inv_mast_uid from inv_mast with (nolock) where item_id in ('LARGE-ORDER-1%','LARGE-ORDER-2%', 'LARGE-ORDER-3%'))) as fee_price
ON fee_count.order_no = fee_price.order_no
WHERE
hdr.projected_order = 'N'
AND hdr.cancel_flag = 'N'
AND hdr.delete_flag = 'N'
AND hdr.completed = 'N'
AND territory.territory_id = ‘CUSTOMERTERRITORY’
AND ext_price_calc > 600.00
AND hdr.carrier_id <> '100004'
AND fee_count.convenience_count is not null
AND CASE
WHEN (ext_price_calc >= 600.01 and ext_price_calc <= 800) and fee_price.unit_price <> round(ext_price_calc * -.01,2)
THEN '-1%: $' + cast(cast(ext_price_calc * -.01 as decimal(18,2)) as varchar(255))
WHEN ext_price_calc >= 800.01 and ext_price_calc <= 1000 and fee_price.unit_price <> round(ext_price_calc * -.02,2)
THEN '-2%: $' + cast(cast(ext_price_calc * -.02 as decimal(18,2)) as varchar(255))
WHEN ext_price_calc > 1000 and fee_price.unit_price <> round(ext_price_calc * -.03,2)
THEN '-3%: $' + cast(cast(ext_price_calc * -.03 as decimal(18,2)) as varchar(255))
ELSE
'OK' END <> 'OK'

Just as a clue to the right direction for optimization:
When you do an OUTER JOIN to a query with calculated columns, you are guaranteeing not only a full table scan, but that those calculations must be performed against every row in the joined table. It appears that you can actually do your join to oe_line without the column calculations (i.e. by filtering ext_price_calc to a specific range).
You don't need to do most of the subqueries that are in your query--the master query can be recrafted to use regular table join syntax. Joins to subqueries containing subqueries presents a challenge to the SQL optimizer that it may not be able to meet. But by using regular joins, the optimizer has a much better chance at identifying more efficient query strategies.
You don't tag which SQL engine you're using. Every database has proprietary extensions that may allow for speedier or more efficient queries. It would be easier to provide useful feedback if you indicated whether you were using MySQL, SQL Server, Oracle, etc.
Regardless of the database you're using, reviewing the query plan is always a good place to start. This will tell you where most of the I/O and time in your query is being spent.
Just on general principle, make sure your statistics are up-to-date.

It's may not be solvable by any of us without the real stuff to test with.
IF that's the case and nobody else posts the answer, I can still help. Here is how to trouble shoot it.
(1) take joins and pieces out one by one.
(2) this will cause errors. Remove or fake the references to get rid of them.
(3) see how that works.
(4) Put items back before you try taking something else out
(5) keep track...
(6) also be aware where a removal of something might drastically reduce the result set.
You might find you're missing an index or some other smoking gun.

I was having the same problem and I was able to solve it by indexing one of the tables and setting a primary key.

I strongly suspect that the problem lies in the number of joins you're doing. A lot of databases do joins basically by systemically checking all possible combinations of the various tables as being valid - so if you're joinging table A and B on column C, and A looks like:
Name:C
Fred:1
Alice:2
Betty:3
While B looks like:
C:Pet
1:Alligator
2:Lion
3:T-Rex
When you do the join, it checks all 9 possibilities:
Fred:1:1:Alligator
Fred:1:2:Lion
Fred:1:3:T-Rex
Alice:2:1:Alligator
Alice:2:2:Lion
Alice:2:3:T-Rex
Betty:3:1:Alligator
Betty:3:2:Lion
Betty:3:3:T-Rex
And goes through and deletes the non-matching ones:
Fred:1:1:Alligator
Alice:2:2:Lion
Betty:3:3:T-Rex
... which means with three entries in each table, it creates nine temporary records, sorts through them all, and deletes six of them ... all before it actually sorts through the results for what you're after (so if you are looking for Betty's Pet, you only want one row on that final result).
... and you're doing how many joins and sub-queries?

Related

the below select statement takes a long in running

This select statement takes a long time running, after my investigation I found that the problem un subquery, stored procedure, please I appreciate your help.
SELECT DISTINCT
COKE_CHQ_NUMBER,
COKE_PAY_SUPPLIER
FROM
apps.Q_COKE_AP_CHECKS_SIGN_STATUS_V
WHERE
plan_id = 40192
AND COKE_SIGNATURE__A = 'YES'
AND COKE_SIGNATURE__B = 'YES'
AND COKE_AUDIT = 'YES'
AND COKE_CHQ_NUMBER NOT IN (SELECT DISTINCT COKE_CHQ_NUMBER_DELIVER
FROM apps.Q_COKE_AP_CHECKS_DELIVERY_ST_V
WHERE UPPER(COKE_CHQ_NUMBER_DELIVER_STATUS) <> 'DELIVERED')
AND COKE_CHQ_NUMBER NOT IN (SELECT COKE_CHQ_NUMBER_DELIVER
FROM apps.Q_COKE_AP_CHECKS_DELIVERY_ST_V)
Well there are a few issues with your SELECT statement that you should address:
First let's look at this condition:
COKE_CHQ_NUMBER NOT IN (SELECT DISTINCT COKE_CHQ_NUMBER_DELIVER
FROM apps.Q_COKE_AP_CHECKS_DELIVERY_ST_V
WHERE UPPER(COKE_CHQ_NUMBER_DELIVER_STATUS) <> 'DELIVERED')
First you select DISTINCT cheque numbers with a not delivered status then you say you don't want this. Rather than saying I don't want non delivered it is much more readable to say I want delivered ones. However this is not really an issue but rather it would make your SELECT easier to read and understand.
Second let's look at your second cheque condition:
COKE_CHQ_NUMBER NOT IN (SELECT COKE_CHQ_NUMBER_DELIVER
FROM apps.Q_COKE_AP_CHECKS_DELIVERY_ST_V)
Here you want to exclude all cheques that have an entry in Q_COKE_AP_CHECKS_DELIVERY_ST_V. This makes your first DISTINCT condition redundant as whatever cheques numbers will bring back would be rejected by this second condition of yours. I do't know if Oracle SQL engine is clever enough to work out this redundancy but this could cause your slowness as SELECT distinct can take longer to run
In addition to this if you don't have them already I would recommend adding the following indexes:
CREATE INDEX index_1 ON q_coke_ap_checks_sign_status_v(coke_chq_number, coke_pay_supplier);
CREATE INDEX index_2 ON q_coke_ap_checks_sign_status_v(plan_id, coke_signature__a, coke_signature__b, coke_audit);
CREATE INDEX index_3 ON q_coke_ap_checks_delivery_st_v(coke_chq_number_deliver);
I called the index_1,2,3 for easy to read obviously not a good naming convention.
With this in place your select should be optimized to retrieve you your data in an acceptable performance. But of course it all depends on the size and the distribution of your data which is hard to control without performing specific data analysis.
looking to you code .. seems you have redundant where condition the second NOT IN implies the firts so you could avoid
you could also transform you NOT IN clause in a MINUS clause .. join the same query with INNER join of you not in subquery
and last be careful you have proper composite index on table
Q_COKE_AP_CHECKS_SIGN_STATUS_V
cols (plan_id,COKE_SIGNATURE__A , COKE_SIGNATURE__B, COKE_AUDIT, COKE_CHQ_NUMBER, COKE_PAY_SUPPLIER)
SELECT DISTINCT
COKE_CHQ_NUMBER,
COKE_PAY_SUPPLIER
FROM
apps.Q_COKE_AP_CHECKS_SIGN_STATUS_V
WHERE
plan_id = 40192
AND COKE_SIGNATURE__A = 'YES'
AND COKE_SIGNATURE__B = 'YES'
AND COKE_AUDIT = 'YES'
MINUS
SELECT DISTINCT
COKE_CHQ_NUMBER,
COKE_PAY_SUPPLIER
FROM apps.Q_COKE_AP_CHECKS_SIGN_STATUS_V
INNER JOIN (
SELECT COKE_CHQ_NUMBER_DELIVER
FROM apps.Q_COKE_AP_CHECKS_DELIVERY_ST_V
) T ON T.COKE_CHQ_NUMBER_DELIVER = apps.Q_COKE_AP_CHECKS_SIGN_STATUS_V
WHERE
plan_id = 40192
AND COKE_SIGNATURE__A = 'YES'
AND COKE_SIGNATURE__B = 'YES'
AND COKE_AUDIT = 'YES'

Performance when adding a parameter to where clause

I'm trying to understand this behaviour. My below statement takes nearly half an hour to complete. However when I replace the parameter #IsGazEnabled (in the case statement of the where clause at the bottom) with a value of 1, then it takes just a second.
Looking at the estimated execution plan, when using the parameter and when it takes 30 minutes, most of the cost (92%) lies with a Nested Loop (Left Anti Semi Join). there also seems to be some Parallelism going on. I'm only just learning about Execution plans and I'm left a bit confused. Very different to the plan produced when not using the parameter.
So how does having a parameter compared to not having one make such a difference to the execution plan and performance?
declare #IsGazEnabled tinyint;
set #IsGazEnabled = 1;
select 'CT Ref: ' + accountreference + ' - Not synced due to missing property ref ' + t.PropertyReference
from CTaxAccountTemp t
where not exists (
select *
from ccaddress a
left join w2addresscrossref x on x.UPRN = a.UPRN
and x.appcode in (
select w2source
from GazSourceConfig
where GazSource in (
select GazSource
from GazSourceConfig
where W2Source = 'CTAX'
)
union all select 'URB'
)
where t.PropertyReference = case #IsGazEnabled when 1 then x.PropertyReference else a.PropertyReference end
);
This can occur because SQL Server (the query optimizer) uses the value of the provided parameter when creating the initial execution plan. If the values in some of your tables are not evenly distributed, the created plan may work really well for certain values of the parameter, but really poorly for others. This is generally referred to a parameter sniffing. You can get around this using query hints (OPTIMIZE FOR X) or recompiling the stored procedure before each run (WITH RECOMPILE). You should read up on these options thoroughly before implementing as they both have side effects.
See a couple articles on Brent Ozar's site for more details, here and here.
I think you should rethink this query. Try and avoid the NOT EXISTS() for starters - as thats generally quite inefficient (I usually prefer a LEFT JOIN in these instances - and a corresponding WHERE x IS NULL - the x being something in the right hand side)
The main cause of woe for you though is likely to be the CASE based WHERE - as that is now causing the inner query to be evaluated for EVERY ROW!. I think you'd be better left joining both sets of disqualifying criteria, but include the parameter in the join conditions - and then check that there is nothing on the right hand side of either of the 2 left joined criteria
Heres how I think it could be rewritten:
declare #IsGazEnabled tinyint;
set #IsGazEnabled = 1;
select 'CT Ref: ' + accountreference + ' - Not synced due to missing property ref ' + t.PropertyReference
from CTaxAccountTemp t
left join ccaddress a2 ON t.PropertyReference = a2.PropertyReference and #IsGazEnabled = 0
left join
(
ccaddress a
join w2addresscrossref x on x.UPRN = a.UPRN
and x.appcode in ( -- could make this a join for efficiency....
select w2source
from GazSourceConfig
where GazSource in (
select GazSource
from GazSourceConfig
where W2Source = 'CTAX'
)
union all select 'URB'
)
) ON t.PropertyReference = x.PropertyReference AND and #IsGazEnabled = 1
WHERE
a2.PropertyReference IS NULL
AND x.PropertyReference IS NULL
;

How can I optimize this SQL query? (Solarwinds Orion)

I'm very new to SQL, and still learning. I'm using a reporting tool called Solarwinds Orion, and I'm honestly not sure how specific the query I have written is to the program, so if there's anything in the query that's confusing, let me know and I'll try to figure out if it's specific to the program or not.
The problem with the query I'm running is that it times out after a very long time (maybe an hour) of running. The database I'm using is huge. Unfortunately I don't really know how huge, but I've been told it's huge.
Is there anything I am doing wrong that would have a huge performance impact?
SELECT TOP 10000
Nodes.Caption AS NodeName,
NetflowApplicationSummary.AppName AS Application_Name,
SUM(NetflowApplicationSummary.TotalBytes) AS SUM_of_Bytes_Transferred,
AVG(Case OutBandwidth
When 0 Then 0
Else (NetflowApplicationSummary.TotalBytes/OutBandwidth) * 100
End) AS TEST_PERCENT
FROM
((NetflowApplicationSummary
INNER JOIN Nodes ON (NetflowApplicationSummary.NodeID = Nodes.NodeID))
INNER JOIN InterfaceTraffic ON (Nodes.NodeID = InterfaceTraffic.InterfaceID))
INNER JOIN Interfaces ON (Nodes.NodeID = Interfaces.NodeID)
WHERE
( InterfaceTraffic.DateTime > (GetDate()-30) )
AND
(Nodes.WANCircuit = 1)
GROUP BY Nodes.Caption, NetflowApplicationSummary.AppName
EDIT: I ran COUNT() on each of my tables with the below result.
SELECT COUNT(*) FROM NetflowApplicationSummary # 50671011
SELECT COUNT(*) FROM Nodes # 898
SELECT COUNT(*) FROM InterfaceTraffic # 18000166
SELECT COUNT(*) FROM Interfaces # 3938
# Total : 68,676,013
I really have no idea if 68 million items is a huge database to be honest.
A couple of notes:
The INNER JOIN operator is associative, so get rid of those parenthesis in the FROM clause and let the optimizer figure out the best join order.
You may have an implied cursor from the getdate() function being called for every row. Store the value in a local variable and compare to that.
The resulting SQL should look like this:
DECLARE #Date as datetime = getdate() - 30;
SELECT TOP 10000
Nodes.Caption AS NodeName,
NetflowApplicationSummary.AppName AS Application_Name,
SUM(NetflowApplicationSummary.TotalBytes) AS SUM_of_Bytes_Transferred,
AVG(Case OutBandwidth
When 0 Then 0
Else (NetflowApplicationSummary.TotalBytes/OutBandwidth) * 100
End) AS TEST_PERCENT
FROM NetflowApplicationSummary
INNER JOIN Nodes ON NetflowApplicationSummary.NodeID = Nodes.NodeID
INNER JOIN InterfaceTraffic ON Nodes.NodeID = InterfaceTraffic.InterfaceID
INNER JOIN Interfaces ON Nodes.NodeID = Interfaces.NodeID
WHERE InterfaceTraffic.DateTime > #Date
AND Nodes.WANCircuit = 1
GROUP BY Nodes.Caption, NetflowApplicationSummary.AppName
Also, make sure you have an index on table InterfaceTraffic with a leading field of DateTime. If this doesn't exist you may need to pay the penalty of a first time creation of it.
If this doesn't help, then you may need to post the execution plan where it can be inspected.
Out of interest, also perform a count() on all four tables and post that result, just so members here can make their own assessment of how big your database really is. It is amazing how many non-technical people still think a 1 or 10 GB database is huge, while I run that easily on my workstation!

Slow SQL query when joining tables

This query is very very slow and i'm not sure where I'm going wrong to cause it to be so slow.
I'm guessing it's something to do with the flight_prices table
because if I remove that join it goes from 16 seconds to less than one.
SELECT * FROM OPENQUERY(mybook,
'SELECT wb.booking_ref
FROM web_bookings wb
LEFT JOIN prod_info pi ON wb.location = pi.location
LEFT JOIN flight_prices fp ON fp.dest_date = pi.dest_airport + '' '' + wb.sort_date
WHERE fp.dest_cheapest = ''Y''
AND wb.inc_flights = ''Y''
AND wb.customer = ''12345'' ')
Any ideas how I can speed up this join??
You're unlikely to get any indexing on flight_prices.dest_date to be used as you're not actually joining to another column which makes it hard for the optimiser.
If you can change the schema I'd make it so flight_prices.dest_date was split into two columns dest_airport and dest_Date as it appears to be currently a composite of airport and date. If you did that you could then join like this
fp.dest_date = wb.sort_date and fp.dest_airport = pi.dest_airport
Try EXPLAIN PLAN and see what your database comes back with.
If you see TABLE SCAN, you might need to add indexes.
That second JOIN looks rather odd to me. I'd wonder if that could be rewritten.
Your statement reformatted gives me this
SELECT wb.booking_ref
FROM web_bookings wb
LEFT JOIN prod_info pi ON wb.location = pi.location
LEFT JOIN flight_prices fp ON fp.dest_date = pi.dest_airport + ' ' + wb.sort_date
WHERE fp.dest_cheapest = 'Y'
AND wb.inc_flights = 'Y'
AND wb.customer = '12345'
I would make sure that following fields have indexes
dest_cheapest
dest_date
location
customer, inc_flights, booking_ref (covering index)

MySQL to PostgreSQL: GROUP BY issues

So I decided to try out PostgreSQL instead of MySQL but I am having some slight conversion problems. This was a query of mine that samples data from four tables and spit them out all in on result.
I am at a loss of how to convey this in PostgreSQL and specifically in Django but I am leaving that for another quesiton so bonus points if you can Django-fy it but no worries if you just pure SQL it.
SELECT links.id, links.created, links.url, links.title, user.username, category.title, SUM(votes.karma_delta) AS karma, SUM(IF(votes.user_id = 1, votes.karma_delta, 0)) AS user_vote
FROM links
LEFT OUTER JOIN `users` `user` ON (`links`.`user_id`=`user`.`id`)
LEFT OUTER JOIN `categories` `category` ON (`links`.`category_id`=`category`.`id`)
LEFT OUTER JOIN `votes` `votes` ON (`votes`.`link_id`=`links`.`id`)
WHERE (links.id = votes.link_id)
GROUP BY votes.link_id
ORDER BY (SUM(votes.karma_delta) - 1) / POW((TIMESTAMPDIFF(HOUR, links.created, NOW()) + 2), 1.5) DESC
LIMIT 20
The IF in the select was where my first troubles began. Seems it's an IF true/false THEN stuff ELSE other stuff END IF yet I can't get the syntax right. I tried to use Navicat's SQL builder but it constantly wanted me to place everything I had selected into the GROUP BY and that I think it all kinds of wrong.
What I am looking for in summary is to make this MySQL query work in PostreSQL. Thank you.
Current Progress
Just want to thank everybody for their help. This is what I have so far:
SELECT links_link.id, links_link.created, links_link.url, links_link.title, links_category.title, SUM(links_vote.karma_delta) AS karma, SUM(CASE WHEN links_vote.user_id = 1 THEN links_vote.karma_delta ELSE 0 END) AS user_vote
FROM links_link
LEFT OUTER JOIN auth_user ON (links_link.user_id = auth_user.id)
LEFT OUTER JOIN links_category ON (links_link.category_id = links_category.id)
LEFT OUTER JOIN links_vote ON (links_vote.link_id = links_link.id)
WHERE (links_link.id = links_vote.link_id)
GROUP BY links_link.id, links_link.created, links_link.url, links_link.title, links_category.title
ORDER BY links_link.created DESC
LIMIT 20
I had to make some table name changes and I am still working on my ORDER BY so till then we're just gonna cop out. Thanks again!
Have a look at this link GROUP BY
When GROUP BY is present, it is not
valid for the SELECT list expressions
to refer to ungrouped columns except
within aggregate functions, since
there would be more than one possible
value to return for an ungrouped
column.
You need to include all the select columns in the group by that are not part of the aggregate functions.
A few things:
Drop the backticks
Use a CASE statement instead of IF() CASE WHEN votes.use_id = 1 THEN votes.karma_delta ELSE 0 END
Change your timestampdiff to DATE_TRUNC('hour', now()) - DATE_TRUNC('hour', links.created) (you will need to then count the number of hours in the resulting interval. It would be much easier to compare timestamps)
Fix your GROUP BY and ORDER BY
Try to replace the IF with a case;
SUM(CASE WHEN votes.user_id = 1 THEN votes.karma_delta ELSE 0 END)
You also have to explicitly name every column or calculated column you use in the GROUP BY clause.