Remove Hash Match to Increase SQL Query Performance

Remove Hash Match to Increase SQL Query Performance - sql

I have an SSIS package that has the following Query as its OLE DB Source
SELECT SESM.[Id]
,SE.SessionId
,SESM.[SegmentId]
,SE.[Id] as 'SessionEntryId'
,V.[Number] as 'VehicleNumber'
,SELS.[SessionEntryLapId]
,SEL.[LapNumber]
,SESM.[Name]
,SESM.[NotNulls]
,SESM.[OutOfRange]
,SESM.[Nulls]
,SESM.[Mean]
,SESM.[Variance]
,SESM.[Min]
,SESM.[Max]
,SESM.[P05]
,SESM.[P10]
,SESM.[P20]
,SESM.[P25]
,SESM.[P50]
,SESM.[P75]
,SESM.[P80]
,SESM.[P90]
,SESM.[P95]
,SESM.[Value]
,SESM.[Percentage]
,SESM.[Discriminator]
FROM [dbo].[SessionEntrySegmentMetrics] SESM
LEFT JOIN [SessionEntryLapSegments] SELS on SESM.SegmentId = SELS.Id
LEFT JOIN SessionEntryLaps SEL on SELS.SessionEntryLapId = SEL.Id
LEFT JOIN SessionEntries SE on SEL.SessionEntryId = SE.Id
LEFT JOIN Vehicles V on SE.VehicleId = V.Id
The result of this query is 140M rows that its processing which is a small data conversion and transfer to our warehouse. This package is averaging 1M rows per hour which is unacceptably slow.
Looking at the query in SSMS this is the Execution Plan
The two big points are the Index Scan on SessionEntrySegmentMetrics with a cost of 71% and the HASH MATCH at 16%. The SessionEntrySegmentMetrics has 140M rows in it and the index its using is fragmented to 70% with a 60% page fullness.
The memory on the SQL box executing the SSIS package is pegged at 97%.
Besides the fragmentation issue, any ideas on how to eliminate that HASH Match and increase performance of this query?

Related

SQL Query takes a long time when filtering recent rows

I have this SQL query, but I've found that it can take up to 11 seconds to run. I'm really confused because when I change the date selection to a 2018 date, it returns instantly.
Here's the query:
select
cv.table3ID, dm.Column1 ,dm.Column2, mm.Column1,
convert(varchar, cv.Date, 107) as Date,
mm.table2ID, dm.table1ID, mm.Column2,
count(ctt.table4ID) as Total
from
table1 dm
inner join
table2 mm on mm.table2ID = dm.table1ID
inner join
table3 cv on cv.table3ID = mm.table2ID
left join
table4 ct on ct.table4CVID = cv.table3ID
inner join
table4 ctt on ctt.table4MMID = mm.table2ID
where
ctt.table4Date >= '2019-01-19'
and ct.table4CVID is null
and dm.Column1 like '%Albert%'
and cv.Column1 = 39505
and cv.Status = 'A'
group by
cv.table3ID, dm.Column1 ,dm.Column2, mm.Column1,
cv.Date, mm.table2ID, dm.table1ID, mm.Column2
I've found that when I execute that query with ctt.table4Date >= '2018-01-19', the response is immediate. But with '2019-01-19', it takes 11 seconds.
Initially, when I found that the query took 11 seconds, I thought it had to be an indexing issue, but I'm not sure any more if its got to do with the index since it executes well for an older date.
I've looked at the execution plan for the query with the different dates and they look completely different.
Any thoughts on why this might be happening? Does it have anything to do with updating the statistics?
[Update]
This image below is the comparison of the execution plan between 2018 and 2019 for table4 ctt. According to the execution plan, this takes up 43% of the operator cost in 2018 and 45% in 2019.
Execution Plan comparison of table4 ctt 2019 and 2018. Top is 2019, bottom is 208
The image here is the comparison of the execution plan again for table4 as ct. Same here, top is 2019 and bottom is 2018.
Execution plan of table4 ct comparison 2019 and 2018. Top is 2019, bottom is 208
[Update 2]
Here are the SQL Execution Plans:
When using '2018-01-19' as the date: https://www.brentozar.com/pastetheplan/?id=SyUh8xXQV
When using '2019-01-19' as the date: https://www.brentozar.com/pastetheplan/?id=rkELW1Q7V

The problem most likely is the fact that more rows are being returned from the other tables. The clustered index scan that you have linked with your [update] just shows the clustered index seek.
You do, however, need to realise that the number of times the index seek is being invoked is 144. and the actual number of rows read is in 8 digits which is causing the slow response.
I'm guessing that when this works fine for you, the actual number of executions on this table would be 1. The 144 is killing you here; given the poor seek predicates. If you know the query plan that works for you, and the indexes are already present to back it up you should forceseek plans and give explicit hints to join in particular order.
Edit
Took a look at the shared plans, changing the date to 2018 works faster for you since SQL switches to using a Hash Match in place of a loop join given the amount of data being processed.

SQL Server 2012 Estimated Row Numbers Much Different than Actual

I have a query that cross joins two tables. TABLE_1 has 15,000 rows and TABLE_2 has 50,000 rows. A query very similar to this one has run in the past in roughly 10 minutes. Now it is running indefinitely with the same server situation (i.e. nothing else running), and the very similar query is also running indefinitely.
SELECT A.KEY_1
,A.FULL_TEXT_1
,B.FULL_TEXT_2
,B.KEY_2
,MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0) AS confidence
FROM #TABLE_1 A
CROSS JOIN #TABLE_2 B
WHERE MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0) >= 0.9
When I run the estimated execution plan for this query, the Nested Loops (Inner Join) node is estimated at 96% of the execution. The estimated number of rows is 218 million, even though cross joining the tables should result in 15,000 * 50,000 = 750 million rows. When I add INSERT INTO #temp_table to the beginning of the query, the estimated execution plan puts Insert Into at 97% and estimates the number of rows as 218 million. In reality, there should be less than 100 matches that have a similarity score above 0.9.
I have read that large differences in estimated vs. actual row counts can impact performance. What could I do to test/fix this?

I have read that large differences in estimated vs. actual row counts can impact performance. What could I do to test/fix this?
Yes, this is true. It particularly affects optimizations involving join algorithms, aggregation algorithms, and indexes.
But it is not true for your query. Your query has to do a nested loops join with no indexes. All pairs of values in the two tables need to be compared. There is little algorithmic flexibility and (standard) indexes cannot really help.

For better performance, use the minScoreHint parameter. This allows to prevent doing the full similarity calculation for many pairs and early exit.
So this should run quicker:
SELECT A.KEY_1
,A.FULL_TEXT_1
,B.FULL_TEXT_2
,B.KEY_2
,MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0, 0.9) AS confidence
FROM #TABLE_1 A
CROSS JOIN #TABLE_2 B
WHERE MDS_DB.MDQ.SIMILARITY(A.FULL_TEXT_1,B.FULL_TEXT_2, 2, 0, 0, 0.9) >= 0.9
It is not clear from docs if 0.9 results would be included. If not, change 0.9 to 0.89

The link provided by scsimon will help you prove whether it's statistics or not. Have the estimates changed significantly since to when it was running fast?
Parallelism spring to mind. If the query was going parallel, but now isn't (e.g. if a server setting has been changed, or statistics) then that could cause significant performance degradation.

Teradata SQL Optimization

I hope this is concise. I am basically looking for a methodology on how to improve queries after watching one of my colleagues speed up my query almost 10 fold with a quick change
I had a query that had two tables t_item and t_action
t_item is basically an item with characteristics and t_action is the events or actions that are performed on this item with a time stamp for each action each action also has an id
My query joined the two tables on id. There were also some criteria made on t_action.action_type which is free text
My simplified original query was like below
SELECT *
FROM t_item
JOIN t_action
ON t_item.pk = t_action.fk
WHERE t_action.action_type LIKE ('%PURCHASE%')
AND t_item.location = 'DE'
This ran OK, it came back in roughly 8 mins
My colleague changed it so that the t_action.action_type ended up in the FROM portion of the SQL. This reduced the time to 2 mins
SELECT *
FROM t_item
JOIN t_action
ON t_item.pk = t_action.fk
t_action.action_type LIKE ('%PURCHASE%')
WHERE t_item.location = 'DE'
My question is, Generally, how do you know when to put limits in the FROM clause vs in the WHERE clause.
I thought that Teradata SQL optimizer does this automatically
Thank you for your help

In this case, you don't actually need to understand the plan. You just need to see if the two plans are the same. Teradata has a pretty good optimizer, so I would not expect there to be a difference between the two version (could be, but I would be surprised). Hence, caching is a possibility for explaining the difference in performance.
For this query:
SELECT *
FROM t_item JOIN
t_action
ON t_item.pk = t_action.fk
t_action.action_type LIKE '%PURCHASE%'
WHERE t_item.location = 'DE';
The best indexes are probably on t_item(location, pk) and t_action(action_type). However, you should try to get rid of the wildcards for a production query. This makes the query harder to optimize, which in turn might have a large impact on performance.

I tried to create similar query but didn't see any difference in the explain plan..though record counts were less trans(15k) and accounts(10k) with indexes on Account_number. Probably what Gordon has specified , try to run the query at different time and also check explain plan for both the queries to see any difference.
Explain select * from trans t
inner join
ap.accounts a
on t.account_number = a.account_number
where t.trans_id like '%DEP%';
4) We do an all-AMPs JOIN step from ap.a by way of a RowHash match
scan with no residual conditions, which is joined to ap.t by way
of a RowHash match scan with a condition of ("ap.t.Trans_ID LIKE
'%DEP%'"). ap.a and ap.t are joined using a merge join, with a
join condition of ("ap.t.Account_Number = ap.a.Account_Number").
The result goes into Spool 1 (group_amps), which is built locally
on the AMPs. The size of Spool 1 is estimated with no confidence
to be 11,996 rows (1,511,496 bytes). The estimated time for this
step is 0.25 seconds.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.25 seconds.
Explain select * from trans t
inner join
ap.accounts a
on t.account_number = a.account_number
and t.trans_id like '%DEP%';
4) We do an all-AMPs JOIN step from ap.a by way of a RowHash match
scan with no residual conditions, which is joined to ap.t by way
of a RowHash match scan with a condition of ("ap.t.Trans_ID LIKE
'%DEP%'"). ap.a and ap.t are joined using a merge join, with a
join condition of ("ap.t.Account_Number = ap.a.Account_Number").
The result goes into Spool 1 (group_amps), which is built locally
on the AMPs. The size of Spool 1 is estimated with no confidence
to be 11,996 rows (1,511,496 bytes). The estimated time for this
step is 0.25 seconds.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.25 seconds.

The general order of query processing on Teradata is :
Where/And + Joins
Aggregate
Having
Olap/Window
Qualify
Sample/Top
Order By
Format
An easy way to remember is WAHOQSOF - as in Wax On, Wax Off :)

SQL select query execution for change tracking on table takes longer time

I have a SQL Query which takes 14 seconds to execute for a single record.
SELECT CT.SYS_CHANGE_CONTEXT
FROM CHANGETABLE(CHANGES OrderDetail_MO_IN, 0) AS CT
LEFT OUTER JOIN dbo.[OrderDetail_MO_IN] AS a ON CT.[MOB_RECORDID] = a.[MOB_RECORDID]
AND CT.[MOB_RECORDID] = a.[MOB_RECORDID]
WHERE CT.SYS_CHANGE_CONTEXT =CAST(N'11B1CE95-CD2B-4165-BCD6-090B83633573' AS varbinary(128))
When I look at the Execution Plan, it shows 92% cost on the Sort operation and a warning "Operator used tempdb to spill data during execution with spill level 1".
Can anybody please let me know why the query is taking long time and how to optimize the query?
Regards,
Adarsh

Why does changing the where clause on this criteria reduce the execution time so drastically?

I ran across a problem with a SQL statement today that I was able to fix by adding additional criteria, however I really want to know why my change fixed the problem.
The problem query:
SELECT *
FROM
(SELECT ah.*,
com.location,
ha.customer_number,
d.name applicance_NAME,
house.name house_NAME,
dr.name RULE_NAME
FROM actionhistory ah
INNER JOIN community com
ON (t.city_id = com.city_id)
INNER JOIN house_address ha
ON (t.applicance_id = ha.applicance_id
AND ha.status_cd = 'ACTIVE')
INNER JOIN applicance d
ON (t.applicance_id = d.applicance_id)
INNER JOIN house house
ON (house.house_id = t.house_id)
LEFT JOIN the_rule tr
ON (tr.the_rule_id = t.the_rule_id)
WHERE actionhistory_id >= 'ACT100010000'
ORDER BY actionhistory_id
)
WHERE rownum <= 30000;
The "fix"
SELECT *
FROM
(SELECT ah.*,
com.location,
ha.customer_number,
d.name applicance_NAME,
house.name house_NAME,
dr.name RULE_NAME
FROM actionhistory ah
INNER JOIN community com
ON (t.city_id = com.city_id)
INNER JOIN house_address ha
ON (t.applicance_id = ha.applicance_id
AND ha.status_cd = 'ACTIVE')
INNER JOIN applicance d
ON (t.applicance_id = d.applicance_id)
INNER JOIN house house
ON (house.house_id = t.house_id)
LEFT JOIN the_rule tr
ON (tr.the_rule_id = t.the_rule_id)
WHERE actionhistory_id >= 'ACT100010000' and actionhistory_id <= 'ACT100030000'
ORDER BY actionhistory_id
)
All of the _id columns are indexed sequences.
The first query's explain plan had a cost of 372 and the second was 14. This is running on an Oracle 11g database.
Additionally, if actionhistory_id in the where clause is anything less than ACT100000000, the original query returns instantly.

This is because of the index on the actionhistory_id column.
During the first query Oracle has to return all the index blocks containing indexes for records that come after 'ACT100010000', then it has to match the index to the table to get all the records, and then it pulls 29999 records from the result set.
During the second query Oracle only has to return the index blocks containing records between 'ACT100010000' and 'ACT100030000'. Then it grabs from the table those records that are represented in the index blocks. A lot less work in that step of grabbing the record after having found the index than if you use the first query.
Noticing your last line about if the id is less than ACT100000000 - sounds to me that those records may all be in the same memory block (or in a contiguous set of blocks).
EDIT: Please also consider what is said by Justin - I was talking about actual performance, but he is pointing out that the id being a varchar greatly increases the potential values (as opposed to a number) and that the estimated plan may reflect a greater time than reality because the optimizer doesn't know the full range until execution. To further optimize, taking his point into consideration, you could put a function based index on the id column or you could make it a combination key, with the varchar portion in one column and the numeric portion in another.

What are the plans for both queries?
Are the statistics on your tables up to date?
Do the two queries return the same set of rows? It's not obvious that they do but perhaps ACT100030000 is the largest actionhistory_id in the system. It's also a bit confusing because the first query has a predicate on actionhistory_id with a value of TRA100010000 which is very different than the ACT value in the second query. I'm guessing that is a typo?
Are you measuring the time required to fetch the first row? Or the time required to fetch the last row? What are those elapsed times?
My guess without that information is that the fact that you appear to be using the wrong data type for your actionhistory_id column is affecting the Oracle optimizer's ability to generate appropriate cardinality estimates which is likely causing the optimizer to underestimate the selectivity of your predicates and to generate poorly performing plans. A human may be able to guess that actionhistory_id is a string that starts with ACT10000 and then has 30,000 sequential numeric values from 00001 to 30000 but the optimizer is not that smart. It sees a 13 character string and isn't able to figure out that the last 10 characters are always going to be numbers so there are only 10 possible values rather than 256 (assuming 8-bit characters) and that the first 8 characters are always going to be the same constant value. If, on the other hand, actionhistory_id was defined as a NUMBER and had values between 1 and 30000, it would be dramatically easier for the optimizer to make reasonable estimates about the selectivity of various predicates.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove Hash Match to Increase SQL Query Performance - sql

Related

SQL Query takes a long time when filtering recent rows

SQL Server 2012 Estimated Row Numbers Much Different than Actual

Teradata SQL Optimization

SQL select query execution for change tracking on table takes longer time

Why does changing the where clause on this criteria reduce the execution time so drastically?

Categories

Resources