Speed up Oracle multi-table query with ORDER BY clause

Speed up Oracle multi-table query with ORDER BY clause - sql

I have the following three tables (there are actually many more fields, but this should give an idea of what I'm trying to achieve):
log (
eventId INTEGER,
objectId INTEGER,
PRIMARY KEY (eventId)
)
objects (
objectId INTEGER,
typeId INTEGER,
PRIMARY KEY (objectId, typeId)
)
statusBits (
typeId INTEGER,
bitNumber INTEGER,
)
The log table contains a very large number of records (500,000+), while the other tables are quite small. I can join the tables using the following query:
SELECT l.eventId, o.typeId, s.bitNumber
FROM log l, objects o, statusBits s
WHERE (l.objectId = o.objectId) AND (o.typeId = s.typeId)
This query runs nice and fast. It also runs fast when I add an ORDER BY eventId clause at the end. However, when I add ORDER BY eventId, bitNumber (thus sorting by two fields rather than one) it becomes painfully slow.
How can I optimise my query to that it runs faster? I am running Oracle 10g XE if that makes any difference.
UPDATE:
I've already tried CREATE INDEX ON statusBits(bitNumber) but it doesn't seem to have a great effect.

First of all, i'll refactor your query as follow:
SELECT L.eventId
,O.typeId
,S.bitNumber
FROM log L
INNER JOIN objects O ON O.objectId = L.objectId
INNER JOIN statusBits S ON S.typeId = O.typeId
It may probably not help for your execution time but the query is much more readable and the use of INNER JOIN is a best practice.
Then in order to optimise your execution time, the first solution that comes in mind is to create an index but you've already tested that. It may help to try a concatenated index instead of a simple index:
CREATE INDEX ON statusBits (typeId, bitNumber);
Hope this will help.

Related

How to optimize the Conditional select query that involve inner joins in it?

I have got an sql query that has lots of INNER JOINS between other tables.
I know that we can use Joins to optimize but as we can see, this query already involved Joins in it. I was thinking to add GROUP BY to this statement which gets more realistic and could run better. Do selecting more columns from one to other tables cause the query slow? If so, how could it work if we need to optimize? Below is my code:
SELECT /*PARALLEL(4)*/
s.task_seq_num,
s.group_seq_num AS grp_seq_num,
g.source_type_cd,
d.doc_id,
d.doc_ref_id AS doc_ref_id,
dm.doc_priority_num AS doc_priority,
s.doc_seq_num AS doc_seq_num,
s.case_num AS case_num,
dm.doc_title_name AS doc_title,
s.task_status_cd AS task_status_cd,
d.received_dt AS received_dt,
nvl(b.first_name,d.first_name) AS first_name,
nvl(b.mid_name,d.mid_name) AS mid_name,
nvl(b.last_name,d.last_name) AS last_name,
tg.content_tag_cd AS content_tag_cd,
d.app_num AS app_num,
e.head_of_household_sw AS head_of_household_sw,
f.user_id AS user_id
FROM
dm_task_status s
INNER JOIN dm_task_tag tg ON s.task_seq_num = tg.task_seq_num
INNER JOIN dm_doc_group g ON g.group_seq_num = s.group_seq_num
INNER JOIN dm_doc d ON d.doc_seq_num = s.doc_seq_num
INNER JOIN dm_doc_master dm ON dm.doc_ref_id = d.doc_ref_id
LEFT JOIN mo_employees f ON f.emp_id = s.emp_id
LEFT JOIN ( dc_case_individual e
INNER JOIN dc_indv b ON b.indv_id = e.indv_id
AND e.head_of_household_sw = 'Y' ) ON e.case_num = s.case_num
WHERE
s.office_num =38
AND s.eff_end_tms IS NULL
AND d.delete_sw IS NULL
ORDER BY s.group_seq_num ASC;
Any ideas are appreciated

First,
AND d.delete_sw IS NULL
should be up in the join and not in the WHERE clause. I don't THINK that matters for performance, but don't trust me on that one. But I do know that it's just good policy to only have the final WHERE clause constructed of the primary table itself.
Second, Since we don't know the data itself or the structure of the data, I would say on EACH join, ensure that you're using a table index wherever possible to prevent full table scans. It seems like something that might be insulting to suggest, but I've made the mistake in the past many times without realizing that there was in fact a place where I was NOT using an indexed field to limit the data scanned.
For example, are all of the fields I list below a part of the primary key of their respective tables? If so, are they the FULL primary key? If they are not the full primary key, is there any way that you can get the rest of the primary key values from the table(s) that you're "coming from"? SQL will try to do the best job it can to make the most efficient plan, but it can only go so far, we always have to try to ensure we're using indexed fields and preferably primary keys to ensure that the database doesn't have to work harder than it has to.
Fields in question:
dm_task_tag.task_seq_num
dm_doc_group.group_seq_num
dm_doc.doc_seq_num
dm_doc_master.dm.doc_ref_id
mo_employees.emp_id
dc_case_individual.indv_id
dc_case_individual.case_num
I'm VERY suspicious of that second LEFT join, but I can't state much about that without knowing the tables. I'm also especially curious if the doc_ref_id is in fact the primary key of the dm_doc_master, or if you have a seq_num that you're missing...

Why does breaking out this correlated subquery vastly improve performance?

I tried running this query against two tables which were very different sizes - #temp was about 15,000 rows, and Member is about 70,000,000, about 68,000,000 of which do not have the ID 307.
SELECT COUNT(*)
FROM #temp
WHERE CAST(individual_id as varchar) NOT IN (
SELECT IndividualID
FROM Member m
INNER JOIN Person p ON p.PersonID = m.PersonID
WHERE CompanyID <> 307)
This query ran for 18 hours, before I killed it and tried something else, which was:
SELECT IndividualID
INTO #source
FROM Member m
INNER JOIN Person p ON p.PersonID = m.PersonID
WHERE CompanyID <> 307
SELECT COUNT(*)
FROM #temp
WHERE CAST(individual_id AS VARCHAR) NOT IN (
SELECT IndividualID
FROM #source)
And this ran for less than a second before giving me a result.
I was pretty surprised by this. I'm a middle-tier developer rather than a SQL expert and my understanding of what goes on under the hood is a little murky, but I would have presumed that, since the sub-query in my first attempt is the exact same code, asking for the exact same data as in the second attempt, that these would be roughly equivalent.
But that's obviously wrong. I can't look at the execution plan for my original query to see what SQL Server is trying to do. So can someone kindly explain why splitting the data out into a temp table is so much faster?
EDIT: Table schemas and indexes
The #temp table has two columns, Individual_ID int and Source_Code varchar(50)
Member and Person are more complex. They has 29 and 13 columns respectively so I don't really want to post them all in full. PersonID is an int and is the PK on Person and an FK on Member. IndividualID is a column on Person - this is not clear in the query as written.
I tried using a LEFT JOIN instead of NOT IN before asking the question. The performance on the second query wasn't noticeably different - both were sub-second. On the first query I let it run for an hour before stopping it, presuming it would make no significant difference.
I also added an index on #source, just like on the original table, so the performance impact should be identical.

First, your query has two faux pas's that really stick out. You are converting to varchar(), but you do not include a length argument. This should not be allowed! The default length varies by context and you need to be explicit.
Second, you are matching two keys in different tables and they seemingly have different types. Foreign key references should always have the same type. This can have a very big impact on performance. If you are dealing with tables that have millions of rows, then you need to pay some attention to the data structure.
To understand the difference in performance, you need to understand execution plans. The two queries have very different execution plans. My (educated) guess is that the first version version is using a nested loop join algorithm. The second version is using a more sophisticated algorithm. In your case, this would be due to the ability of SQL Server to maintain statistics on tables. So, instantiating the intermediate results actually helps the optimizer produce a better query plan.
The subject of how best to write this logic has been investigated a lot. Here is a very good discussion on the subject by Aaron Bertrand.
I do agree with Aaron on the preference for not exists in this case:
SELECT COUNT(*)
FROM #temp t
WHERE NOT EXISTS (SELECT 1
FROM Member m JOIN
Person p
ON p.PersonID = m.PersonID
WHERE MemberID <> 307 and individual_id = t. individual_id
);
However, I don't know if this will have better performance in this particular case.

This line is probably what kills the first query
WHERE CAST(individual_id as varchar) NOT IN
My guess would be that this forces a table scan rather than using any indexes.

SubQuery vs TempTable before Merge

I have a complex query that I want to use as the Source of a Merge into a table. This will be executed over millions of rows. Currently I am trying to apply constraints to the data by inserting it into a temp table before the merge.
The operations are:
Filter out duplicate data.
Join some tables to pull in additional data
Insert into the temp table.
Here is the query.
-- Get all Orders that aren't in the system
WITH Orders AS
(
SELECT *
FROM [Staging].Orders o
WHERE NOT EXISTS
(
SELECT 1
FROM Maps.VendorBOrders vbo
JOIN OrderFact of
ON of.Id = vbo.OrderFactId
AND InternalOrderId = o.InternalOrderId
AND of.DataSetId = o.DataSetId
AND of.IsDelete = 0
)
)
INSERT INTO #VendorBOrders
(
CustomerId
,OrderId
,OrderTypeId
,TypeCode
,LineNumber
,FromDate
,ThruDate
,LineFromDate
,LineThruDate
,PlaceOfService
,RevenueCode
,BillingProviderId
,Cost
,AdjustmentTypeCode
,PaymentDenialCode
,EffectiveDate
,IDRLoadDate
,RelatedOrderId
,DataSetId
)
SELECT
vc.CustomerId
,OrderId
,OrderTypeId
,TypeCode
,LineNumber
,FromDate
,ThruDate
,LineFromDate
,LineThruDate
,PlaceOfService
,RevenueCode
,bp.Id
,Cost
,AdjustmentTypeCode
,PaymentDenialCode
,EffectiveDate
,IDRLoadDate
,ro.Id
,o.DataSetId
FROM
Orders o
-- Join related orders to match orders sharing same instance
JOIN Maps.VendorBRelatedOrder ro
ON ro.OrderControlNumber = o.OrderControlNumber
AND ro.EquitableCustomerId = o.EquitableCustomerId
AND ro.DataSetId = o.DataSetId
JOIN BillingProvider bp
ON bp.ProviderNPI = o.ProviderNPI
-- Join on customers and fail if the customer doesn't exist
LEFT OUTER JOIN [Maps].VendorBCustomer vc
ON vc.ExtenalCustomerId = o.ExtenalCustomerId
AND vc.VendorId = o.VendorId;
I am wondering if there is anything I can do to optimize it for time. I have tried using the DB Engine Tuner, but this query takes 100x more CPU Time than the other queries I am running. Is there anything else that I can look into or can the query not be improved further?

CTE is just syntax
That CTE is evaluated (run) on that join
First just run it as a select statement (no insert)
If the select is slow then:
Move that CTE to a #TEMP so it is evaluated once and materialized
Put an index (PK if applicable) on the three join columns
If the select is not slow then it is insert time on #VendorBOrders
Fist only create PK and sort the insert on the PK so as not to fragment that clustered index
Then AFTER the insert is complete build any other necessary indexes

Generally when I do speed testing I perform checks on the parts of SQL to see where the problem lies. Turn on the 'Execution plan' and see where a lot of the time is going. Also if you want to just do the quick and dirty highlight your CTE and run just that. Is that fast, yes, move on.
I have at times found a single index being off throws off a whole complex logic of joins by merely having the database do one part of something large and then finding that piece.
Another idea is that if you have a fast tempdb on a production environment or the like, dump your CTE to a temp table as well. Index on that and see if that speeds things up. Sometimes CTE's, table variables, and temp tables lose some performance at joins. I have found that creating an index on a partial object will improve performance at times but you are also putting more load on the tempdb to do this, so keep that in mind.

sql, query optimisation with and inner join?

I'm trying to optimise my query, it has an inner join and coalesce.
The join table, is simple a table with one field of integer, I've added a unique key.
For my where clause I've created a key for the three fields.
But when I look at the plan it still says it's using a table scan.
Where am I going wrong ?
Here's my query
select date(a.startdate, '+'||(b.n*a.interval)||' '||a.intervaltype) as due
from billsndeposits a
inner join util_nums b on date(a.startdate, '+'||(b.n*a.interval)||'
'||a.intervaltype) <= coalesce(a.enddate, date('2013-02-26'))
where not (intervaltype = 'once' or interval = 0) and factid = 1
order by due, pid;

Most likely your JOIN expression cannot use any index and it is calculated by doing a NATURAL scan and calculate date(a.startdate, '+'||(b.n*a.interval)||' '||a.intervaltype) for every row.
BTW: That is a really weird join condition in itself. I suggest you find a better way to join billsndeposits to util_nums (if that is actually needed).

I think I understand what you are trying to achieve. But this kind of join is a recipe for slow performance. Even if you remove date computations and the coalesce (i.e. compare one date against another), it will still be slow (compared to integer joins) even with an index. And because you are creating new dates on the fly you cannot index them.
I suggest creating a temp table with 2 columns (1) pid (or whatever id you use in billsndeposits) and (2) recurrence_dt
populate the new table using this query:
INSERT INTO TEMP
SELECT PID, date(a.startdate, '+'||(b.n*a.interval)||' '||a.intervaltype)
FROM billsndeposits a, util_numbs b;
Then create an index on recurrence_dt columns and runstats. Now your select statement can look like this:
SELECT recurrence_dt
FROM temp t, billsndeposits a
WHERE t.pid = a.pid
AND recurrence_dt <= coalesce(a.enddate, date('2013-02-26'))
you can add a exp_ts on this new table, and expire temporary data afterwards.
I know this adds more work to your original query, but this is a guaranteed performance improvement, and should fit naturally in a script that runs frequently.
Regards,
Edit
Another thing I would do, is make enddate default value = date('2013-02-26'), unless it will affect other code and/or does not make business sense. This way you don't have to work with coalesce.

SQL Server search filter and order by performance issues

We have a table value function that returns a list of people you may access, and we have a relation between a search and a person called search result.
What we want to do is that wan't to select all people from the search and present them.
The query looks like this
SELECT qm.PersonID, p.FullName
FROM QueryMembership qm
INNER JOIN dbo.GetPersonAccess(1) ON GetPersonAccess.PersonID = qm.PersonID
INNER JOIN Person p ON p.PersonID = qm.PersonID
WHERE qm.QueryID = 1234
There are only 25 rows with QueryID=1234 but there are almost 5 million rows total in the QueryMembership table. The person table has about 40K people in it.
QueryID is not a PK, but it is an index. The query plan tells me 97% of the total cost is spent doing "Key Lookup" witht the seek predicate.
QueryMembershipID = Scalar Operator (QueryMembership.QueryMembershipID as QM.QueryMembershipID)
Why is the PK in there when it's not used in the query at all? and why is it taking so long time?
The number of people total 25, with the index, this should be a table scan for all the QueryMembership rows that have QueryID=1234 and then a JOIN on the 25 people that exists in the table value function. Which btw only have to be evaluated once and completes in less than 1 second.

if you want to avoid "key lookup", use covered index
create index ix_QueryMembership_NameHere on QueryMembership (QueryID)
include (PersonID);
add more column names, that you gonna select in include arguments.
for the point that, why PK's "key lookup" working so slow, try DBCC FREEPROCCACHE, ALTER INDEX ALL ON QueryMembership REBUILD, ALTER INDEX ALL ON QueryMembership REORGANIZE
This may help if your PK's index is disabled, or cache keeps wrong plan.

You should define indexes on the tables you query. In particular on columns referenced in the WHERE and ORDER BY clauses.
Use the Database Tuning Advisor to see what SQL Server recommends.

For specifics, of course you would need to post your query and table design.
But I have to make a couple of points here:
You've already jumped to the conclusion that the slowness is a result of the ORDER BY clause. I doubt it. The real test is whether or not removing the ORDER BY speeds up the query, which you haven't done. Dollars to donuts, it won't make a difference.
You only get the "log n" in your big-O claim when the optimizer actually chooses to use the index you defined. That may not be happening because your index may not be selective enough. The thing that makes your temp table solution faster than the optimizer's solution is that you know something about the subset of data being returned that the optimizer does not (specifically, that it is a really small subset of data). If your indexes are not selective enough for your query, the optimizer can't always reasonably assume this, and it will choose a plan that avoids what it thinks could be a worst-case scenario of tons of index lookups, followed by tons of seeks and then a big sort. Oftentimes, it chooses to scan and hash instead. So what you did with the temp table is often a way to solve this problem. Often you can narrow down your indexes or create an indexed view on the subset of data you want to work against. It all depends on the specifics of your wuery.

You need indexes on your WHERE and ORDER BY clauses. I am not an expert but I would bet it is doing a table scan for each row. Since your speed issue is resolved by Removing the INNER JOIN or the ORDER BY I bet the issue is specifically with the join. I bet it is doing the table scan on your joined table because of the sort. By putting an index on the columns in your WHERE clause first you will be able to see if that is in fact the case.

Have you tried restructuring the query into a CTE to separate the TVF call? So, something like:
With QueryMembershipPerson
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QMP.PersonId
EDIT: Btw, I'm assuming that there is an index on PersonId in both the QueryMembership and the Person table.
EDIT What about two table expressions like so:
With
QueryMembershipPerson As
(
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
)
, With PersonAccess As
(
Select PersonId
From dbo.GetPersonAccess(1)
)
Select PersonId, Fullname
From QueryMembershipPerson As QMP
Join PersonAccess As PA
On PA.PersonId = QMP.PersonId
Yet another solution would be a derived table like so:
Select ...
From (
Select QM.PersonId, P.Fullname
From QueryMembership As qm
Join Person As P
On P.PersonId = QM.PersonId
Where QM.QueryId = 1234
) As QueryMembershipPerson
Join dbo.GetPersonAccess(1) As PA
On PA.PersonId = QueryMembershipPerson.PersonId
If pushing some of the query into a temp table and then joining on that works, I'd be surprised that you couldn't combine that concept into a CTE or a query with a derived table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas