MariaDB SQL sentence: force order of tables for optimizer - sql

I have a rather complex query that takes some 5 seconds to run.
EDITED: complete code for better understanding:
SELECT
AGEID, AGECLIID, CLINombre, CLIEstado,AGEOnline, AGEOnlineCancelacion,
IF(CANID>0,'Y','N') AS AGECancelado, ACTIOrden, ACTIColor, ACTINombre,
ACTICapacidad, 'Y' AS slotactivo
FROM
( agenda JOIN cliente on AGECLIID = CLIID AND CLICentro = 'Madrid'
JOIN actividad on ACTIID = AGEACTIID
LEFT JOIN agcambio on CAMAGEID = AGEID
LEFT JOIN agcancelacion on CANAGEID = AGEID
) RIGHT JOIN horario on DIAOrden = COALESCE( CAMDia, AGEDia )
ORDER BY HORHora, MINMinuto, DIAOrden
NOTE: Originally, it was "SELECT FROM horario LEFT JOIN all_the_rest of the query".
This produces the following explain plan:
NOTE: I tried to upload a picture here, but I cannot do it, it just creates a link to the picture.
explain plan with right join
It takes like 5 seconds to execute.
The key here is that the part of the query that retrieves busy slots executed alone takes like nothing. In timeslots (horario) table there are 336 rows (all possible time slots) so it should not take that long to perform the RIGHT JOIN.
The explain plan says that first it does an ALL access to timeslots, and after that all the other tables are accessed via INDEX.
So I wanted to change the access order to tables, forcing to access timeslots at the end. I changed the order in the FROM clause changing the JOIN to RIGHT, but the explain plan says the same (and the time to run the query is still 5 sec).
Testing things, I have tried to do a STRAIGHT_JOIN to force the table access order and it works, cause it took less than a second to return the rows (probably because it is a join, but it is still significantly faster). But I cannot use STRAIGHT_JOIN cause I need the RIGHT JOIN, and I've read that there is no way to do STRAIGHT in an OUTER JOIN.
The explain plan of the execution with straight join is the following:
explain plan with straight_join
I have tried to create a view but the result was the same.
So,
a) can I force somehow the optimizer to access the timeslots table the last one?
b) but the question is why it is taking so long, if there are only 336 rows to do a RIGHT JOIN with (150 rows of busy slots).... Maybe this could help me to rewrite the query.
Many thanks in advance for any hint... Meanwhile I'm rewriting the query again and again.... :)
Xavi.

Related

How can this change be making my query slow (OR vs UNION) and can I fix it?

I've just been debugging a slow SQL query.
It's a join between 2 tables, with a WHERE clause conditioning on either a property of 1 table OR the other.
If I re-write it as a UNION then it's suddenly 2 orders of magnitude faster, even though those 2 queries produce identical outputs:
DECLARE #UserId UNIQUEIDENTIFIER = '0019813D-4379-400D-9423-56E1B98002CB'
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId) OR Bookings.MixedDealBroker in (#UserId))
--Execution time: ~4000ms
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
--Execution time: ~70ms
This seems rather surprising to me! I would have expected the SQL compiler to be entirely capable of identifying that the 2nd form was equivalent and would have used that compilation approach if it were available.
Some context notes:
I've checked and IN (#UserId) vs = #UserId makes no difference.
Nor does JOIN vs LEFT JOIN.
Those tables each have 100,000s records, and the filter cuts it down to ~100.
In the slow version it seems to be reading every row of both tables.
So:
Does anyone have any ideas for how this comes about.
What (if anything) can I do to fix the performance without just re-writing the query as a series of UNIONs (not viable for a variety of reasons.)
=-=-=-=-=-=-=
Execution Plans:
This is a common limitation of SQL engines, not just in SQL Server, but also other database systems as well. The OR complicates the predicate enough that the execution plan selected isn't always ideal. This probably relates to the fact that only one index can be seeked into per instance of a table object at a time (for the most part), or in your specific case, your OR predicate is across two different tables, and other factors with how SQL engines are designed.
By using a UNION clause, you now have two instances of the Bookings table referenced, which can individually be seeked on separately in the most efficient way possible. That allows the SQL Engine to pick a better execution plan to serve you query.
This is pretty much just one of those things that are the way they are because that's just the way it is, and you need to remember the UNION clause workaround for future encounters of this kind of performance issue.
Also, in response to your comment:
I don't understand how the difference can affect the EP, given that the 2 different "phrasings" of the query are identical?
A new execution plan is generated every time one doesn't exist in the plan cache for a given query, essentially. The way the Engine determines if a plan for a query is already cached is based on the exact hashing of that query statement, so even an extra space character at the end of the query can result in a new plan being generated. Theoretically that plan can be different. So a different written query (despite being logically the same) can surely result in a different execution plan.
There are other reasons a plan can change on re-generation too, such as different data and statistics of that data, in the tables referenced in the query between executions. But these reasons don't really apply to your question above.
As already stated, the OR condition prevents the database engine from efficiently using the indexes in a single query. Because the OR condition spans tables, I doubt that the Tuning Advisor will come up with anything useful.
If you have a case where the query you have posted is part of a larger query, or the results are complex and you do not want to repeat code, you can wrap your initial query in a Common Table Expression (CTE) or a subquery and then feed the combined results into the remainder of your query. Sometimes just selecting one or more PKs in your initial query will be sufficient.
Something like:
SELECT <complex select list>
FROM (
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings B
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
) PRE
JOIN Bookings B ON B.ID = PRE.BookingsID
JOIN BookingPricings BP ON BP.ID = PRE.BookingPricingsID
<more joins>
WHERE <more conditions>
Having just the IDs in your initial select make the UNION more efficient. The UNION can also be changed to a yet more-efficient UNION ALL with careful use of additional conditions, such as AND Bookings.MixedDealBroker <> #UserId in the second part, to avoid overlapping results.

How to improve the performance of a 10 min running query?

I have a query which is taking approximately 10 mins to execute and produce the results. When I try to break it into parts and run it, it seems to run fine, within seconds.
I tried to modify the subselect of the top and the bottom portions of the query and determine if that was causing the issue, but it was not. It gave out some results within 3 seconds.
I am trying to learn to read the Estimated Execution plan, but it is becoming more confusing and hard for me to trace to the issue.
Can anyone please point out some mistakes which I made that is making the query for long?
Select Distinct
PostExtended.BatchNum,
post.ControlNumStatus,
post.AccountSeg,
Post.PostDat
From
Post
Post Records
join (Select Post, MAX(Dist) as Dist, COUNT(fkglDist) as RecordCount From PostExtend WITH (NOLOCK) Group By flPost) as PostExtender on Post.PK = PostExtender.flPost
join glPostExtended WITH (NOLOCK) on glPostExtendedLimiter.Post = glPostExtended.Post and (PostExtendedLimiter.fkglDist = PostExtend.Dist or PostExtend.Dist is null)
join (select lP.fkosControlNumberStatus, lP.SourceJENumber, AccountSegment,
sum(case
............
from Post WITH (NOLOCK)
join AccountingPeriod WITH (NOLOCK) on AccountingPeriod.pk = lP.fkglAccountingPeriod
join FiscalYear WITH (NOLOCK) on FiscalYear.pk = AccountingPeriod.FiscalYear
join Account WITH (NOLOCK) on Account.pk = FiscalYear.Account
where FiscalYear.Period = #Date
and glP.fkMLSosCodeEntryType = 2202
group by glP.fkosControlNumberStatus, glP.SourceNumber, AccountSeg) post on post.ControlNumStatus = Post.fkControlNumberStatus and postdata.SourceJENumber = glPost.SourceJENumber
where post.AmountT <> 0)......
Group by
The subqueries are very often the point of problems.
I would try to:
separate the postdata subquery from the main query,
save the result in a temporary table or even in a table variable,
put clustered index on fkosControlNumberStatus and SourceJENumber fields,
join this temporary table back to the main query.
Sometimes the result of these simple actions pleasantly surprises.
This is a fairly complex query. You are joining on Aggregate Queries (with GROUP BY).
The first thing I would do is see how long it takes to run each of the join queries. One of these may run very fast, while another may run very long. So, you may not really need to optimize the entire query--just one of the joined queries.
Another way to do it is just start eliminating joins one by one, then run the entire query and see how fast it goes. When you have a really significant decrease in time, you've found the error.
Typically, one thing that can add a lot of CPU is comparisons. The sums with case statements might be the biggest suspect.
Have you used the Database Engine Tuning Adviser? If all else fails, go with that and see what it tells you.
So, maybe try this approach:
Take away the CASE Statements inside the SUM expressions on that last join.
Remove the last JOIN with all the sums.
Remove the first join with that GROUP BY and the MAX expression
That would be my strategy.

Force unique key in view to avoid merge join

I'm trying to optimize a query. Basically, there are 3 parts to a transaction that can be repeated. I log all communications, but want to get the "freshest" of the 3 parts. The 3 parts are all linked through a single intermediate table (unfortunately) which is what is slowing this whole thing down (too much normalization?).
There is the center of the "star" "Transactions", then the center spokes (all represened by "TransactionDetails", which refer to the hub using "Transactions" primary key, then the outer spokes (PPGDetails, TicketDetails and CompletionDetails), all of which refer to "TransactionDetails" buy it's primary key.
Each of "PPGDetails", "TicketDetails" and "CompletionDetails" will have exactly one row in "TransactionDetails" that they link to, by primary key. There can be many of each of these pairs of objects per transaction.
So, in order to get the most recent TicketDetails for a transaction, I use this view:
CREATE VIEW [dbo].[TicketTransDetails] AS
select *
from TicketDetails tkd
join (select MAX(TicketDetail_ID) as TicketDetail_ID
from TicketDetails temp1
join TransactionDetails temp2
on temp1.TransactionDetail_ID = temp2.TransactionDetail_ID
group by temp2.Transaction_ID) qq
on tkd.TicketDetail_ID = qq.TicketDetail_ID
join TransactionDetails td
on tkd.TransactionDetail_ID = td.TransactionDetail_ID
GO
The other 2 detail types have similar views.
Then, to get all of the transaction details I want, one row per transaction, I use:
select *
from Transactions t
join CompletionTransDetails cpd
on t.Transaction_ID = cpd.Transaction_ID
left outer join TicketTransDetails tkd
on t.Transaction_ID = tkd.Transaction_ID
left outer join PPGTransDetails ppd
on t.Transaction_ID = ppd.Transaction_ID
where cpd.DateAndTime between '2/1/2017' and '3/1/2017'
It is by design that I want ONLY transactions that have at least 1 "CompletionDetail", but 0 or more "PPGDetail" or "TicketDetail".
This query returns the correct results, but takes 40 seconds to execute, on decent server hardware, and a "Merge Join (Left Outer Join)" immediately before the "SELECT" returns takes 100% of the execution plan time.
If I take out the join to either PPGTransDetails or TicketTransDetails in the final query, it brings the execution time down to ~20 seconds, so a marked improvement, but still doing a Merge Join over a significant number of records (many extraneous, I assume).
When just a single transaction is selected (via where clause), the query only takes about 4 seconds, and the query, then, has a final step of "Nested Loops" which also takes a large portion of the time (96%). I would like this query to take less than a second.
Since the views don't have a primary key, I assume that is causing the Merge Join to proceed. That said, I am having trouble creating a query that emulates this functionality - much less one that is more efficient.
Can anyone help me recognize what I may be missing?
Thanks!
--mobrien118
Edit: Adding more info -
Here is the effective data model:
Essentially, for a single transaction, there can be MANY PPGDetails, TicketDetails and CompletionDetails, but each one will have it's own TransactionDetails (they are one-to-one, but not enforced in the model, just in software).
There are currently:
1,619,307 "Transactions"
3,564518 "TransactionDetails"
512,644 "PPGDetails"
1,471,826 "TicketDetails"
1,580,043 "CompletionDetails"
There are currently no foreign key constraints or indexes set up on these items.
First a quick remark:
which also takes a large portion of the time (96%).
This is a bit of a (common) misconception. The 96% there is an estimate on how much resources said 'block' will need. It by no means indicates that 96% of the time inside the query was spent on it. I've had situations where stuff that took over half of the query time-wise were attributed virtually no cost.
Additionally, you seem to be assuming that when you query/join to the view that the system will first prepare the data from the view and then later on will use that result to further 'work out the query'. This is not the case, the system will 'expand' the view and do a 'combined' query, taking everything into account.
For us to understand what's going on you'll need to provide us with the query plan (.sqlplan if you use SqlSentry Plan Explorer), it's that or a full explanation on the table layout, indexes, foreign keys, etc... and a bit of explanation on the data (total rows, expected matches between tables, etc...)
PS: even though everybody seems to be touting 'hash joins' as the solution to everything, nested loops and merge joins often are more efficient.
(trying to understand your queries, is this view equivalent to your view?)
[edit: incorrect view removed to avoid confusion]
Second try: (think I have it right this time)
CREATE VIEW [dbo].[TicketTransDetails] AS
SELECT td.Transaction_ID, tkd.*
FROM TicketDetails tkd
JOIN TransactionDetails td
ON td.TransactionDetail_ID = tkd.TransactionDetail_ID
JOIN (SELECT MAX(TicketDetail_ID) as max_TicketDetail_ID, temp2.Transaction_ID
FROM TicketDetails temp1
JOIN TransactionDetails temp2
ON temp1.TransactionDetail_ID = temp2.TransactionDetail_ID
GROUP BY temp2.Transaction_ID) qq
ON qq.max_TicketDetail_ID = tkd.TicketDetail_ID
AND qq.TransactionDetail_ID = td.Transaction_ID
It might not be any faster when querying the entire table, but it should be when fetching specific records from the Transactions table.
Indexing-wise you probably want a unique index on TicketDetails (TransactionDetail_ID, TicketDetail_ID)
You'll need similar constructs for the other tables off course.
Thinking it through a bit further I think this would work too:
CREATE VIEW [dbo].[TicketTransDetails]
AS
SELECT *
FROM (
SELECT td.Transaction_ID,
TicketDetail_ID_rownr = ROW_NUMBER() OVER (PARTITION BY td.Transacion_ID ORDER BY tkd.TicketDetail_ID DESC),
tkd.*
FROM TicketDetails tkd
JOIN TransactionDetails td
ON td.TransactionDetail_ID = tkd.TransactionDetail_ID
) xx
WHERE TicketDetail_ID_rownr = 1 -- we want the "first one from the end" only
It looks quite a bit more readable but I'm not sure it would be faster or not... you'll have to compare timings and query plans.

Why would a sub query perform better than a litteral value in a WHERE clause with multiple joins?

Take to following query:
SELECT *
FROM FactALSAppSnapshot AS LB
LEFT OUTER JOIN MLALSInfoStage AS LA ON LB.ProcessDate = LA.ProcessDate AND
LB.ALSAppID = LA.ALSNumber
LEFT OUTER JOIN MLMonthlyIncomeStage AS LC ON LB.ProcessDate = LC.ProcessDate AND
LB.ALSAppID = LC.ALSNumber
LEFT OUTER JOIN DimBranchCategory AS LI on LB.ALSAppBranchKey = LI.Branch
WHERE LB.ProcessDate=(SELECT TOP 1 LatestProcessDateKey
FROM DimDate)
Notice that the WHERE condition is a scalar sub query. The runtime for this is 0:54 resulting in 367,853 records.
However, if I switch the WHERE clause to the following:
WHERE LB.ProcessDate=20161116
This somehow causes the query runtime to jump up to 57:33 still resulting in 367,853 records. What is happening behind the scenes that would cause this huge jump in the runtime? I would have expected the sub query version to take longer, not the literal integer value.
The table aliased as LI (last join on the list) seems to be the only table that isn't indexed on its key, and seems to allow the query to perform closer to the 1st query if I remove that table as join and using the integer value instead of the sub query.
SQL Server 11
The real answer to your question lies in the execution plan for the query. You can see the actual plan in SSMS.
Without the plan, the rest is speculation. However, based on my experience, what changes is the way the joins are processed. In my experience, queries slow down considerably when a query switches to nested loop joins. This is at the whim of the optimizer, which -- when there is a constant -- thinks this is the best way to run the query.
I'm not sure why this would be the case. Perhaps an index on FactALSAppSnapshot(ProcessDate, ALSAppID, ALSAppBranchKey) would speed up both versions of the query.

Aggregating two selects with a group by in SQL is really slow

I am currently working with a query in in MSSQL that looks like:
SELECT
...
FROM
(SELECT
...
)T1
JOIN
(SELECT
...
)T2
GROUP BY
...
The inner selects are relatively fast, but the outer select aggregates the inner selects and takes an incredibly long time to execute, often timing out. Removing the group by makes it run somewhat faster and changing the join to a LEFT OUTER JOIN speeds things up a bit as well.
Why would doing a group by on a select which aggregates two inner selects cause the query to run so slow? Why does an INNER JOIN run slower than a LEFT OUTER JOIN? What can I do to troubleshoot this further?
EDIT: What makes this even more perplexing is the two inner queries are date limited and the overall query only runs slow when looking at date ranges between the start of July and any other day in July, but if the date ranges are anytime before the the July 1 and Today then it runs fine.
Without some more detail of your query its impossible to offer any hints as to what may speed your query up. A possible guess is the two inner queries are blocking access to any indexes which might have been used to perform the join resulting in large scans but there are probably many other possible reasons.
To check where the time is used in the query check the execution plan, there is a detailed explanation here
http://www.sql-server-performance.com/tips/query_execution_plan_analysis_p1.aspx
The basic run down is run the query, and display the execution plan, then look for any large percentages - they are what is slowing your query down.
Try rewriting your query without the nested SELECTs, which are rarely necessary. When using nested SELECTs - except for trivial cases - the inner SELECT resultsets are not indexed, which makes joining them to anything slow.
As Tetraneutron said, post details of your query -- we may help you rewrite it in a straight-through way.
Have you given a join predicate? Ie join table A ON table.ColA = table.ColB. If you don't give a predicate then SQL may be forced to use nested loops, so if you have a lot of rows in that range it would explain a query slow down.
Have a look at the plan in the SQL studio if you have MS Sql Server to play with.
After your t2 statement add a join condition on t1.joinfield = t2.joinfield
The issue was with fragmented data. After the data was defragmented the query started running within reasonable time constraints.
JOIN = Cartesian Product. All columns from both tables will be joined in numerous permutations. It is slow because the inner queries are querying each of the separate tables, but once they hit the join, it becomes a Cartesian product and is more difficult to manage. This would occur at the outer select statement.
Have a look at INNER JOINs as Tetraneutron recommended.