SQL optimization JOIN before table scan? - sql

I have a SQL query similar to
SELECT columnName
FROM
(SELECT columnName, someColumnWithXml
FROM _Table1
INNER JOIN _Activity ON _Activity.oid = _Table1.columnName
INNER JOIN _ActivityType ON _Activity.activityType = _ActivityType.oid
--_ActivityType.forType is a string
WHERE _ActivityType.forType = '_Disclosure'
AND _Activity.emailRecipients IS NOT NULL) subquery
WHERE subquery.someColumnWithXml LIKE '%'+'9D62EE8855797448A7C689A09D193042'+'%'
There are 15 million rows in _Table1 and the WHERE subquery.someColumnWithXml LIKE '%'+'9D62EE8855797448A7C689A09D193042'+'%' results in an execution plan that performs a full table scan on all 15 million rows. The subquery results in only a few hundred thousand rows and those are all the rows that really need to have the LIKE run on them. Is there a way to make this more efficient by running the LIKE only on the results of the subquery rather than running a TABLE SCAN with a LIKE on 15,000,000 rows? The someColumnWithXML column is not indexed.

For this query:
SELECT columnName, someColumnWithXml
FROM _Table1 t1 INNER JOIN
_Activity a
ON a.oid = t1.columnName INNER JOIN
_ActivityType at
ON a.activityType = at.oid --_ActivityType.forType is a string
WHERE at.forType = '_Disclosure' AND
a.emailRecipients IS NOT NULL AND
t1.someColumnWithXml LIKE '%'+'9D62EE8855797448A7C689A09D193042'+'%';
You have a challenge with optimizing this query. I don't know if the filtering conditions are particularly restrictive. If they are, then indexes on:
_ActivityType(forType, oid)
_Activity(activityType, emailRecipients, oid)
_Table1(columnName)
If these don't help, then you might an index on the XML column. Perhaps an XML index would work. Such an index would not really help for a generic LIKE, but that might not be needed if you parse the XML.

You could filter the in the subquery directly avoinding the scan for unuseful rows
SELECT columnName, someColumnWithXml
FROM _Table1
INNER JOIN _Activity on _Activity.oid = _Table1.columnName
INNER JOIN _ActivityType on _Activity.activityType = _ActivityType.oid
--_ActivityType.forType is a string
WHERE _ActivityType.forType = '_Disclosure'
AND _Activity.emailRecipients IS NOT NULL
someColumnWithXml LIKE '%'+'9D62EE8855797448A7C689A09D193042'+'%'

Related

Why do multiple EXISTS break a query

I am attempting to include a new table with values that need to be checked and included in a stored procedure. Statement 1 is the existing table that needs to be checked against, while statement 2 is the new table to check against.
I currently have 2 EXISTS conditions that function independently and produce the results I am expecting. By this I mean if I comment out Statement 1, statement 2 works and vice versa. When I put them together the query doesn't complete, there is no error but it times out which is unexpected because each statement only takes a few seconds.
I understand there is likely a better way to do this but before I do, I would like to know why I cannot seem to do multiple exists statements like this? Are there not meant to be multiple EXISTS conditions in the WHERE clause?
SELECT *
FROM table1 S
WHERE
--Statement 1
EXISTS
(
SELECT 1
FROM table2 P WITH (NOLOCK)
INNER JOIN table3 SA ON SA.ID = P.ID
WHERE P.DATE = #Date AND P.OTHER_ID = S.ID
AND
(
SA.FILTER = ''
OR
(
SA.FILTER = 'bar'
AND
LOWER(S.OTHER) = 'foo'
)
)
)
OR
(
--Statement 2
EXISTS
(
SELECT 1
FROM table4 P WITH (NOLOCK)
INNER JOIN table5 SA ON SA.ID = P.ID
WHERE P.DATE = #Date
AND P.OTHER_ID = S.ID
AND LOWER(S.OTHER) = 'foo'
)
)
EDIT: I have included the query details. Table 1-5 represent different tables, there are no repeated tables.
Too long to comment.
Your query as written seems correct. The timeout will only be able to be troubleshot from the execution plan, but here are a few things that could be happening or that you could benefit from.
Parameter sniffing on #Date. Try hard-coding this value and see if you still get the same slowness
No covering index on P.OTHER_ID or P.DATE or P.ID or SA.ID which would cause a table scan for these predicates
Indexes for the above columns which aren't optimal (including too many columns, etc)
Your query being serial when it may benefit from parallelism.
Using the LOWER function on a database which doesn't have a case sensitive collation (most don't, though this function doesn't slow things down that much)
You have a bad query plan in cache. Try adding OPTION (RECOMPILE) at the bottom so you get a new query plan. This is also done when comparing the speed of two queries to ensure they aren't using cached plans, or one isn't when another is which would skew the results.
Since your query is timing out, try including the estimated execution plan and post it for us at past the plan
I found putting 2 EXISTS in the WHERE condition made the whole process take significantly longer. What I found fixed it was using UNION and keeping the EXISTS in separate queries. The final result looked like the following:
SELECT *
FROM table1 S
WHERE
--Statement 1
EXISTS
(
SELECT 1
FROM table2 P WITH (NOLOCK)
INNER JOIN table3 SA ON SA.ID = P.ID
WHERE P.DATE = #Date AND P.OTHER_ID = S.ID
AND
(
SA.FILTER = ''
OR
(
SA.FILTER = 'bar'
AND
LOWER(S.OTHER) = 'foo'
)
)
)
UNION
--Statement 2
SELECT *
FROM table1 S
WHERE
EXISTS
(
SELECT 1
FROM table4 P WITH (NOLOCK)
INNER JOIN table5 SA ON SA.ID = P.ID
WHERE P.DATE = #Date
AND P.OTHER_ID = S.ID
AND LOWER(S.OTHER) = 'foo'
)

SQL query returning same row of data twice

This query returns the same row of data twice. Is there something wrong with my inner join? Or where clause?
SELECT
transaction_details.transaction_number,
transaction_details.transaction_id,
transaction_details.product_id,
Products3.ProductName
FROM
transaction_details
INNER JOIN
Products3 ON transaction_details.product_id = Products3.productID
INNER JOIN
transaction_status ON transaction_details.transaction_id = transaction_status.transaction_id
WHERE
transaction_details.transaction_id = 'tr-y9404'
AND status_of_transaction = 'pending'
Here is the output
You might have multiple entries in your table? In that case you can use SELECT DISTINCT. This will remove duplicates.

Is there any scope of improvement in my query to make it run faster?

I am new to writing queries and wanted to know if the execution of the query can be made faster
select
p.attr_value product,
m.attr_value model,
u.attr_value usage,
t4.attr_value location
from table1 t1 join table2 t2 on t1.e_subid = t2.e_subid
join table4 t4 on t4.loc_id = t1.loc_id
join table3 p on t2.e_cid = p.e_cid
join table3 m on t2.e_cid = m.e_cid
join table3 u on t2.e_cid = u.e_cid
Where
t4.attr_name = 'Location'
and p.attr_name = 'Product'
and m.attr_name = 'Model'
and u.attr_name = 'Usage'
order by product,location;
A brute force method to improve performance is to create index the columns are used in joins, where clause and order by
In your case create indexes on the below columns
table4.attr_name (used in where clause)
table3.attr_name (used in where clause)
table1.e_subid (used in join clause)
table2.e_subid (used in join clause)
table4.loc_id (used in join clause)
Note that indexes will help you with faster select query performance. But slows down update, insert and delete statements
See if oracle materialized views can fit your needs
The alias 'l' you used in projection, is not defined in your query. You might want to check it

SQL Server 2008 R2 query

I'm running the following query, but it is taking too long. Is there a way to make it faster or change the way the query is written?
Please help.
SELECT *
FROM ProductGroupLocUpdate WITH (nolock)
WHERE CmStatusFlag > 2
AND EngineID IN ( 0, 1 )
AND NOT EXISTS (SELECT DISTINCT APGV.LocationID
FROM CM_ST_ActiveProductGroupsView AS APGV WITH(nolock)
WHERE APGV.LocationID = ProductGroupLocUpdate.Locationid);
Try rewriting the query with a join
SELECT PGLU.* from ProductGroupLocUpdate PGLU WITH (NOLOCK)
LEFT JOIN CM_ST_ActiveProductGroupsView APGV WITH (NOLOCK)
ON PGLU.LocationId = APGV.LocationID
WHERE APGV.LocationID IS NULL AND CmStatusFlag>2 AND EngineID IN (0,1)
Depending on how much data is in your table, check add indexes to LocationId (in both tables), CmStatusFlag and EngineID

Optimize SQL with Interbase

I was inspired by the good answers from my previous question about SQL.
Now this SQL is run on a DB with Interbase 2009. It is about 21 GB in size.
SELECT DistanceAsMeters, AddrDistance.Bold_Id, AddrDistance.Created, AddressFrom.CityName_CO as FromCity, AddressTo.CityName_CO as ToCity
FROM AddrDistance
LEFT JOIN Address AddressFrom ON AddrDistance.FromAddress = AddressFrom.Bold_Id
LEFT JOIN Address AddressTo ON AddrDistance.ToAddress = AddressTo.Bold_Id
Where DistanceAsMeters = 0 and PseudoDistanceAsCostKm = 0
and not AddrDistance.bold_id in (select bold_id from DistanceQueryTask)
Order By Created Desc
There are 840000 rows with AddrDistance
190000 rows with Address and 4 with DistanceQueryTask.
The question is, can this be done faster? I guess, the same query is run many times select bold_id from DistanceQueryTask. Note that I'm not interested in stored procedures, just plain SQL :)
EDIT1 Here is the current execution plan:
Statement: SELECT DistanceAsMeters, AddrDistance.Bold_Id, AddrDistance.Created, AddressFrom.CityName_CO as FromCity, AddressTo.CityName_CO as ToCity
FROM AddrDistance
LEFT JOIN Address AddressFrom ON AddrDistance.FromAddress = AddressFrom.Bold_Id
LEFT JOIN Address AddressTo ON AddrDistance.ToAddress = AddressTo.Bold_Id
Where DistanceAsMeters = 0 and PseudoDistanceAsCostKm = 0
and not AddrDistance.bold_id in (select bold_id from DistanceQueryTask)
Order By Created Desc
PLAN (DISTANCEQUERYTASK INDEX (RDB$PRIMARY218))
PLAN SORT (JOIN (JOIN (ADDRDISTANCE NATURAL,ADDRESSFROM INDEX (RDB$PRIMARY234)),ADDRESSTO INDEX (RDB$PRIMARY234)))
And yes, DistanceQueryTask is meant to have a low number if rows in the database.
Using Left Join and subqueries will slow down any query.
You can get some improvements with the correct indexes (on Bold_id, DistanceMeters, PseudoDistanceAsCostKm ) remember that more indexes increase the size of the database
I suppose bold_id is your key, and thus properly indexed.
Then replacing the subselect and the not...in by a join might help the optimizer.
SELECT DistanceAsMeters, Bold_Id, Created, AddressFrom.CityName_CO as FromCity, AddressTo.CityName_CO as ToCity
FROM AddrDistance
LEFT JOIN Address AddressFrom ON AddrDistance.FromAddress = AddressFrom.Bold_Id
LEFT JOIN Address AddressTo ON AddrDistance.ToAddress = AddressTo.Bold_Id
LEFT JOIN DistanceQueryTask ON AddrDistance.bold_id = DistanceQueryTask.bold_id
Where DistanceAsMeters = 0 and PseudoDistanceAsCostKm = 0
and DistanceQueryTask.bold_id is null
Order By Created Desc
Create an index for this part: (DistanceAsMeters = 0 and PseudoDistanceAsCostKm = 0)
because it does a (bad) table scan for it: ADDRDISTANCE NATURAL
And try to use the join instead of subselect as stated by Francois.
As Daniel and Andre sugges an index helps a lot.
I would suggest this index (DistanceMeters, PseudoDistanceAsCostKm, Bold_id), because the first 2 parts of the index is constant, then its a smal portion of the index that is needed to read.
If it is a fact that FromAddress and/or ToAddress exist you can change the LEFT JOIN to INNER JOIN, because it is often faster (the query optimizer can make some assumptions).