Improve performance of query with multiple joins - sql

I have a query with four joins that is taking a considerable amount of time to execute. Is there a way to optimize the query? I tried to include the smaller PORTFOLIO table on the joins to try speeding up the process.
SELECT
A.*
, B.REPORTING_PERIOD
, D.HPI AS CURRENT_HPI
, E.USSWAP10
, B.DLQ_STATUS AS CURRENT_STATUS
, C.DLQ_STATUS AS NEXT_STATUS
FROM PORTFOLIO A
JOIN ALL_PERFORMANCE B ON
A.AGENCY = B.AGENCY
AND A.LOAN_ID = B.LOAN_ID
JOIN ALL_PERFORMANCE C ON
A.AGENCY = C.AGENCY
AND A.LOAN_ID = C.LOAN_ID
AND DATEADD(MONTH, 1, B.REPORTING_PERIOD) = C.REPORTING_PERIOD
LEFT JOIN CASE_SHILLER D ON
A.GEO_CODE = D.GEO_CODE
AND B.REPORTING_PERIOD = D.AS_OF_DATE
LEFT JOIN SWAP_10Y E ON
B.REPORTING_PERIOD = E.AS_OF_DATE

You can index the columns you join on

Make sure you have indexes for the combinations of joins you are using to increase performance. Thesee are the indexes you should have:
Index on (PORTFOLIO.AGENCY, PORTFOLIO.LOAN_ID)
Index on (ALL_PERFORMANCE.AGENCY, ALL_PERFORMANCE.LOAN_ID)
Index on ALL_PERFORMANCE.REPORTING_PERIOD
Index on (ALL_PERFORMANCE.AGENCY, ALL_PERFORMANCE.LOAN_ID, ALL_PERFORMANCE.REPORTING_PERIOD)
Index on (CASE_SHILLER.GEO_CODE, CASE_SHILLER.AS_OF_DATE)
Index on PORTFOLIO.GEO_CODE
Index on ALL_PERFORMANCE.REPORTING_PERIOD
Index on SWAP_10Y.AS_OF_DATE
Index on ALL_PERFORMANCE.REPORTING_PERIOD

Related

Too many columns in GROUP BY

I'm trying to aggregate some data, but I've a problem. There's my query (using 3 tables):
SELECT
ufc.counter_id,
gcrvf.goal_id,
gcrvf.date_of_visit,
ufc.utm_campaign,
ufc.utm_source,
ufc.utm_medium,
ufc.utm_content,
ufc.utm_term,
ufc.original_join_id,
max(gcrvf.last_update_time) AS last_update_time,
sum(gcrvf.conversions) AS conversions,
c.name AS counter_name,
c.owner_login AS owner_login,
c.status AS counter_status,
concat(g.goal_source,CAST('Goal','text')) AS metric_type,
multiIf(g.is_retargeting = 0,'non-retargeting',g.is_retargeting = 1,'retargeting',NULL) AS metric_key,
concat(g.name,' (',CAST(gcrvf.goal_id,'String'),')') AS metric_name
FROM connectors_yandex_metrika.goal_conversions_report_v_final AS gcrvf
INNER JOIN connectors_yandex_metrika.utm_for_collect AS ufc ON gcrvf.counter_id = ufc.counter_id
LEFT JOIN connectors_yandex_metrika.counter AS c ON gcrvf.counter_id = c.id
LEFT JOIN connectors_yandex_metrika.goal AS g ON gcrvf.goal_id = g.id
WHERE
((gcrvf.utm_campaign = ufc.utm_campaign) OR (ufc.utm_campaign IS NULL))
AND ((gcrvf.utm_source = ufc.utm_source) OR (ufc.utm_source IS NULL))
AND ((gcrvf.utm_medium = ufc.utm_medium) OR (ufc.utm_medium IS NULL))
AND ((gcrvf.utm_content = ufc.utm_content) OR (ufc.utm_content IS NULL))
AND ((gcrvf.utm_term = ufc.utm_term ) OR (ufc.utm_term IS NULL))
GROUP BY
ufc.counter_id,
gcrvf.date_of_visit,
gcrvf.goal_id,
ufc.utm_campaign,
ufc.utm_source,
ufc.utm_medium,
ufc.utm_content,
ufc.utm_term,
ufc.original_join_id,
c.name,
c.owner_login,
c.status,
metric_type,
metric_key,
metric_name
I have to GROUP BY by almost all columns. Is it a real problem?
Columns ufc.original_join_id, c.name,c.owner_login, c.status, metric_type, metric_key,metric_name are not necessary here. I added them to group by just because I need these columns. And I want to ask: any way to make it more abbreviated? Any ways to avoid unnecessary columns from group by? Or it's okay?
And my second question: does ClickHouse cache right table when we use JOINs? So I always should put huge table as left table?
All columns are required in the group by. It is not possible to leaf some columns out which where mentioned as select columns.
Depending on your indexed columns you can improve the speed of the query. You should try to make an index on the key columns.
The Database will handle the cache logic for you. Depending on how often you execute the query.

SQL optimization JOIN before table scan?

I have a SQL query similar to
SELECT columnName
FROM
(SELECT columnName, someColumnWithXml
FROM _Table1
INNER JOIN _Activity ON _Activity.oid = _Table1.columnName
INNER JOIN _ActivityType ON _Activity.activityType = _ActivityType.oid
--_ActivityType.forType is a string
WHERE _ActivityType.forType = '_Disclosure'
AND _Activity.emailRecipients IS NOT NULL) subquery
WHERE subquery.someColumnWithXml LIKE '%'+'9D62EE8855797448A7C689A09D193042'+'%'
There are 15 million rows in _Table1 and the WHERE subquery.someColumnWithXml LIKE '%'+'9D62EE8855797448A7C689A09D193042'+'%' results in an execution plan that performs a full table scan on all 15 million rows. The subquery results in only a few hundred thousand rows and those are all the rows that really need to have the LIKE run on them. Is there a way to make this more efficient by running the LIKE only on the results of the subquery rather than running a TABLE SCAN with a LIKE on 15,000,000 rows? The someColumnWithXML column is not indexed.
For this query:
SELECT columnName, someColumnWithXml
FROM _Table1 t1 INNER JOIN
_Activity a
ON a.oid = t1.columnName INNER JOIN
_ActivityType at
ON a.activityType = at.oid --_ActivityType.forType is a string
WHERE at.forType = '_Disclosure' AND
a.emailRecipients IS NOT NULL AND
t1.someColumnWithXml LIKE '%'+'9D62EE8855797448A7C689A09D193042'+'%';
You have a challenge with optimizing this query. I don't know if the filtering conditions are particularly restrictive. If they are, then indexes on:
_ActivityType(forType, oid)
_Activity(activityType, emailRecipients, oid)
_Table1(columnName)
If these don't help, then you might an index on the XML column. Perhaps an XML index would work. Such an index would not really help for a generic LIKE, but that might not be needed if you parse the XML.
You could filter the in the subquery directly avoinding the scan for unuseful rows
SELECT columnName, someColumnWithXml
FROM _Table1
INNER JOIN _Activity on _Activity.oid = _Table1.columnName
INNER JOIN _ActivityType on _Activity.activityType = _ActivityType.oid
--_ActivityType.forType is a string
WHERE _ActivityType.forType = '_Disclosure'
AND _Activity.emailRecipients IS NOT NULL
someColumnWithXml LIKE '%'+'9D62EE8855797448A7C689A09D193042'+'%'

SQL - faster to filter by large table or small table

I have the below query which takes a while to run, since ir_sales_summary is ~ 2 billion rows:
select c.ChainIdentifier, s.SupplierIdentifier, s.SupplierName, we.Weekend,
sum(sales_units_cy) as TY_unitSales, sum(sales_cost_cy) as TY_costDollars, sum(sales_units_ret_cy) as TY_retailDollars,
sum(sales_units_ly) as LY_unitSales, sum(sales_cost_ly) as LY_costDollars, sum(sales_units_ret_ly) as LY_retailDollars
from ir_sales_summary i
left join Chains c
on c.ChainID = i.ChainID
inner join Suppliers s
on s.SupplierID = i.SupplierID
inner join tmpWeekend we
on we.SaleDate = i.saledate
where year(i.saledate) = '2017'
group by c.ChainIdentifier, s.SupplierIdentifier, s.SupplierName, we.Weekend
(Worth noting, it takes roughly 3 hours to run since it is using a view that brings in data from a legacy service)
I'm thinking there's a way to speed up the filtering, since I just need the data from 2017. Should I be filtering from the big table (i) or be filtering from the much smaller weekending table (which gives us just the week ending dates)?
Try this. This might help, joining a static table as first table in query onto a fact/dynamic table will impact query performance i believe.
SELECT c.ChainIdentifier
,s.SupplierIdentifier
,s.SupplierName
,i.Weekend
,sum(sales_units_cy) AS TY_unitSales
,sum(sales_cost_cy) AS TY_costDollars
,sum(sales_units_ret_cy) AS TY_retailDollars
,sum(sales_units_ly) AS LY_unitSales
,sum(sales_cost_ly) AS LY_costDollars
,sum(sales_units_ret_ly) AS LY_retailDollars
FROM Suppliers s
INNER JOIN (
SELECT we
,weeekend
,supplierid
,chainid
,sales_units_cy
,sales_cost_cy
,sales_units_ret_cy
,sales_units_ly
,sales_cost_ly
,sales_units_ret_ly
FROM ir_sales_summary i
INNER JOIN tmpWeekend we
ON we.SaleDate = i.saledate
WHERE year(i.saledate) = '2017'
) i
ON s.SupplierID = i.SupplierID
INNER JOIN Chains c
ON c.ChainID = i.ChainID
GROUP BY c.ChainIdentifier
,s.SupplierIdentifier
,s.SupplierName
,i.Weekend

Query faster with top attribute

Why is this query faster in SQL Server 2008 R2 (Version 10.50.2806.0)
SELECT
MAX(AtDate1),
MIN(AtDate2)
FROM
(
SELECT TOP 1000000000000
at.Date1 AS AtDate1,
at.Date2 AS AtDate2
FROM
dbo.tab1 a
INNER JOIN
dbo.tab2 at
ON
a.id = at.RootId
AND CAST(GETDATE() AS DATE) BETWEEN at.Date1 AND at.Date2
WHERE
a.Number = 223889
)B
then
SELECT
MAX(AtDate1),
MIN(AtDate2)
FROM
(
SELECT
at.Date1 AS AtDate1,
at.Date2 AS AtDate2
FROM
dbo.tab1 a
INNER JOIN
dbo.tab2 at
ON
a.id = at.RootId
AND CAST(GETDATE() AS DATE) BETWEEN at.Date1 AND at.Date2
WHERE
a.Number = 223889
)B
?
The second statement with the TOP attribute is six times faster.
The count(*) of the inner subquery is 9280 rows.
Can I use a HINT to declare that SQL Server optimiser make it right?
I see you've now posted the plans. Just luck of the draw.
Your actual query is a 16 table join.
SELECT max(atDate1) AS AtDate1,
min(atDate2) AS AtDate2,
max(vtDate1) AS vtDate1,
min(vtDate2) AS vtDate2,
max(bgtDate1) AS bgtDate1,
min(bgtDate2) AS bgtDate2,
max(lftDate1) AS lftDate1,
min(lftDate2) AS lftDate2,
max(lgtDate1) AS lgtDate1,
min(lgtDate2) AS lgtDate2,
max(bltDate1) AS bltDate1,
min(bltDate2) AS bltDate2
FROM (SELECT TOP 100000 at.Date1 AS atDate1,
at.Date2 AS atDate2,
vt.Date1 AS vtDate1,
vt.Date2 AS vtDate2,
bgt.Date1 AS bgtDate1,
bgt.Date2 AS bgtDate2,
lft.Date1 AS lftDate1,
lft.Date2 AS lftDate2,
lgt.Date1 AS lgtDate1,
lgt.Date2 AS lgtDate2,
blt.Date1 AS bltDate1,
blt.Date2 AS bltDate2
FROM dbo.Tab1 a
INNER JOIN dbo.Tab2 at
ON a.id = at.Tab1Id
AND cast(Getdate() AS DATE) BETWEEN at.Date1 AND at.Date2
INNER JOIN dbo.Tab5 v
ON v.Tab1Id = a.Id
INNER JOIN dbo.Tab16 g
ON g.Tab5Id = v.Id
INNER JOIN dbo.Tab3 vt
ON v.id = vt.Tab5Id
AND cast(Getdate() AS DATE) BETWEEN vt.Date1 AND vt.Date2
LEFT OUTER JOIN dbo.Tab4 vk
ON v.id = vk.Tab5Id
LEFT OUTER JOIN dbo.VerkaufsTab3 vkt
ON vk.id = vkt.Tab4Id
LEFT OUTER JOIN dbo.Plu p
ON p.Tab4Id = vk.Id
LEFT OUTER JOIN dbo.Tab15 bg
ON bg.Tab5Id = v.Id
LEFT OUTER JOIN dbo.Tab7 bgt
ON bgt.Tab15Id = bg.Id
AND cast(Getdate() AS DATE) BETWEEN bgt.Date1 AND bgt.Date2
LEFT OUTER JOIN dbo.Tab11 b
ON b.Tab15Id = bg.Id
LEFT OUTER JOIN dbo.Tab14 lf
ON lf.Id = b.Id
LEFT OUTER JOIN dbo.Tab8 lft
ON lft.Tab14Id = lf.Id
AND cast(Getdate() AS DATE) BETWEEN lft.Date1 AND lft.Date2
LEFT OUTER JOIN dbo.Tab13 lg
ON lg.Id = b.Id
LEFT OUTER JOIN dbo.Tab9 lgt
ON lgt.Tab13Id = lg.Id
AND cast(Getdate() AS DATE) BETWEEN lgt.Date1 AND lgt.Date2
LEFT OUTER JOIN dbo.Tab10 bl
ON bl.Tab11Id = b.Id
LEFT OUTER JOIN dbo.Tab6 blt
ON blt.Tab10Id = bl.Id
AND cast(Getdate() AS DATE) BETWEEN blt.Date1 AND blt.Date2
WHERE a.Nummer = 223889) B
On both the good and bad plans the Execution Plan shows "Reason for Early Termination of Statement Optimization" as "Time Out".
The two plans have slightly different join orders.
The only join in the plans not satisfied by an index seek is that on Tab9. This has 63,926 rows.
The missing index details in the execution plan suggest that you create the following index.
CREATE NONCLUSTERED INDEX [miising_index]
ON [dbo].[Tab9] ([Date1],[Date2])
INCLUDE ([Tab13Id])
The problematic part of the bad plan can be clearly seen in SQL Sentry Plan Explorer
SQL Server estimates that 1.349174 rows will be returned from the previous joins coming into the join on Tab9. And therefore costs the nested loops join as if it will need to execute the scan on the inside table 1.349174 times.
In fact 2,600 rows feed into that join meaning that it does 2,600 full scans of Tab9 (2,600 * 63,926 = 164,569,600 rows.)
It just so happens that on the good plan the estimated number of rows coming in to the join is 2.74319. This is still wrong by three orders of magnitude but the slightly increased estimate means SQL Server favors a hash join instead. A hash join just does one pass through Tab9
I would first try adding the missing index on Tab9.
Also/instead you might try updating the statistics on all tables involved (especially those with a date predicate such as Tab2 Tab3 Tab7 Tab8 Tab6) and see if that goes some way to correcting the huge discrepancy between estimated and actual rows on the left of the plan.
Also breaking the query up into smaller parts and materialising these into temporary tables with appropriate indexes might help. SQL Server can then use the statistics on these partial results to make better decisions for joins later in the plan.
Only as a last resort would I consider using query hints to try and force the plan with a hash join. Your options for doing that are either the USE PLAN hint in which case you dictate exactly the plan you want including all join types and orders or by stating LEFT OUTER HASH JOIN tab9 .... This second option also has the side effect of fixing all join orders in the plan. Both mean that SQL Server will be severely limited is its ability to adjust the plan with changes in data distribution.
It's hard to answer not knowing the size and structure of your tables, and not being able to see the entire execution plan. But the difference in both plans is Hash Match join for "top n" query vs Nested Loop join for the other one.
Hash Match is very resource intensive join, because the server has to prepare hash buckets in order to use it. But it becomes much more effective for big tables, while Nested Loops, comparing each row in one table to every row in another table works great for small tables, because there's no such preparation needed.
What I think is that by selecting TOP 1000000000000 rows in subquery you give the optimizer a hint that you're subquery will produce a great amount of data, so it uses Hash Match. But in fact the output is small, so Nested Loops works better.
What I just said is based on shreds of information, so please have heart criticising my answer ;).

Optimize SQL with Interbase

I was inspired by the good answers from my previous question about SQL.
Now this SQL is run on a DB with Interbase 2009. It is about 21 GB in size.
SELECT DistanceAsMeters, AddrDistance.Bold_Id, AddrDistance.Created, AddressFrom.CityName_CO as FromCity, AddressTo.CityName_CO as ToCity
FROM AddrDistance
LEFT JOIN Address AddressFrom ON AddrDistance.FromAddress = AddressFrom.Bold_Id
LEFT JOIN Address AddressTo ON AddrDistance.ToAddress = AddressTo.Bold_Id
Where DistanceAsMeters = 0 and PseudoDistanceAsCostKm = 0
and not AddrDistance.bold_id in (select bold_id from DistanceQueryTask)
Order By Created Desc
There are 840000 rows with AddrDistance
190000 rows with Address and 4 with DistanceQueryTask.
The question is, can this be done faster? I guess, the same query is run many times select bold_id from DistanceQueryTask. Note that I'm not interested in stored procedures, just plain SQL :)
EDIT1 Here is the current execution plan:
Statement: SELECT DistanceAsMeters, AddrDistance.Bold_Id, AddrDistance.Created, AddressFrom.CityName_CO as FromCity, AddressTo.CityName_CO as ToCity
FROM AddrDistance
LEFT JOIN Address AddressFrom ON AddrDistance.FromAddress = AddressFrom.Bold_Id
LEFT JOIN Address AddressTo ON AddrDistance.ToAddress = AddressTo.Bold_Id
Where DistanceAsMeters = 0 and PseudoDistanceAsCostKm = 0
and not AddrDistance.bold_id in (select bold_id from DistanceQueryTask)
Order By Created Desc
PLAN (DISTANCEQUERYTASK INDEX (RDB$PRIMARY218))
PLAN SORT (JOIN (JOIN (ADDRDISTANCE NATURAL,ADDRESSFROM INDEX (RDB$PRIMARY234)),ADDRESSTO INDEX (RDB$PRIMARY234)))
And yes, DistanceQueryTask is meant to have a low number if rows in the database.
Using Left Join and subqueries will slow down any query.
You can get some improvements with the correct indexes (on Bold_id, DistanceMeters, PseudoDistanceAsCostKm ) remember that more indexes increase the size of the database
I suppose bold_id is your key, and thus properly indexed.
Then replacing the subselect and the not...in by a join might help the optimizer.
SELECT DistanceAsMeters, Bold_Id, Created, AddressFrom.CityName_CO as FromCity, AddressTo.CityName_CO as ToCity
FROM AddrDistance
LEFT JOIN Address AddressFrom ON AddrDistance.FromAddress = AddressFrom.Bold_Id
LEFT JOIN Address AddressTo ON AddrDistance.ToAddress = AddressTo.Bold_Id
LEFT JOIN DistanceQueryTask ON AddrDistance.bold_id = DistanceQueryTask.bold_id
Where DistanceAsMeters = 0 and PseudoDistanceAsCostKm = 0
and DistanceQueryTask.bold_id is null
Order By Created Desc
Create an index for this part: (DistanceAsMeters = 0 and PseudoDistanceAsCostKm = 0)
because it does a (bad) table scan for it: ADDRDISTANCE NATURAL
And try to use the join instead of subselect as stated by Francois.
As Daniel and Andre sugges an index helps a lot.
I would suggest this index (DistanceMeters, PseudoDistanceAsCostKm, Bold_id), because the first 2 parts of the index is constant, then its a smal portion of the index that is needed to read.
If it is a fact that FromAddress and/or ToAddress exist you can change the LEFT JOIN to INNER JOIN, because it is often faster (the query optimizer can make some assumptions).