Hint for SQL Query Joined tables with million records

Hint for SQL Query Joined tables with million records - sql

I have below query that is taking on an average more than 5 seconds to fetch the data in a transaction that is triggered in-numerous times via application. I am looking for a hint that can possibly help me reduce the time taken for this query everytime its been fired. My conditions are that I cannot add any indexes or change any settings of application for this query. Hence oracle hints or changing the structure of the query is the only choice I have. Please find below my query.
SELECT SUM(c.cash_flow_amount) FROM CM_CONTRACT_DETAIL a ,CM_CONTRACT b,CM_CONTRACT_CASHFLOW c
WHERE a.country_code = Ip_country_code
AND a.company_code = ip_company_code
AND a.dealer_bp_id = ip_bp_id
AND a.contract_start_date >= ip_start_date
AND a.contract_start_date <= ip_end_date
AND a.version_number = b.current_version
AND a.status_code IN ('00','10')
AND a.country_code = b.country_code
AND a.company_code = b.company_code
AND a.contract_number = b.contract_number
AND a.country_code = c.country_code
AND a.company_code = c.company_code
AND a.contract_number = c.contract_number
AND a.version_number = c.version_number
AND c.cash_flow_type_code IN ('07','13');
The things to know about the tables are that they are all transactional tables and the data of this table keeps changing everyday. They have records in 1 lacs to 10 lacs in numbers.
This is the explain plan currently on the query:
Operation Object Name Rows Bytes Cost Object Node In/Out PStart PStop
SELECT STATEMENT Hint=RULE
SORT AGGREGATE
TABLE ACCESS BY INDEX ROWID CM_CONTRACT_CASHFLOW
NESTED LOOPS
NESTED LOOPS
TABLE ACCESS BY INDEX ROWID CM_CONTRACT_DETAIL
INDEX RANGE SCAN XIF760CT_CONTRACT_DETAIL
TABLE ACCESS BY INDEX ROWID CM_CONTRACT
INDEX UNIQUE SCAN XPKCM_CONTRACT
INDEX RANGE SCAN XPKCM_CONTRACT_CASHFLOW
Indexes on CM_CONTRACT_DETAIL:
XPKCM_CONTRACT_DETAIL is a composite unique index on country_code, company_code, contract_number and version_number
XIF760CT_CONTRACT_DETAIL is a non unique index on dealer_bp_id
Indexes on CM_CONTRACT:
XPKCM_CONTRACT is a composite unique index on country_code, company_code, contract_number
Indexes on CM_CONTRACT_CASHFLOW:
XPKCM_CONTRACT_CASHFLOW is a composite unique index on country_code, company_code, contract_number and version_number,supply_sequence_number, cash_flow_type_code,payment_date.
Could you please help better this query? Please let me know if anything else about the tables is required on this. Stats are not gathered on this tables either.

Your query plan says HINT=RULE. Why is that? Is this the standard setting in your dbms? Why not make use of the optimizer? You can use /*+CHOOSE*/ for that. This may be all that's needed. (Why are there no Stats on the tables, though?)
EDIT: The above was nonsense. By not gathering any statistics you prevent the optimizer from doing its work. It will always fall back to the good old rules, because it has no basis to calculate costs on and find a better plan. It is strange to see that you voluntarily keep the dbms from getting your queries fast. You can use hints in your queries of course, but be careful always to check and alter them when table data changes significantly. Better gather statistics and have the optimizer doing this work. As to useful hints:
My feeling says: With that many criteria on CM_CONTRACT_DETAIL this should be the driving table. You can force that with /*+LEADING(a)*/. Maybe even use a full table scan on that table /*+FULL(a)*/, which you can still speed up with parallel execution: /*+PARALLEL(a,4)*/.
Good luck :-)

Related

What Columns Should I Index to Improve Performance in SQL

In my query I have a temp table of keys that will be joined to multiple tables later on.
I want to create an index on my temp table to improve performance, cause it takes a couple of minutes for my query to run.
SELECT DISTINCT
k.Id, k.Name, a.Address, a.City, a.State, a.Zip, p.Phone, p.Fax, ...
FROM
#tempKeys k
INNER JOIN
dbo.Address a ON a.AddrId = k.AddrId
INNER JOIN
dbo.Phone p ON p.PhoneId = a.PhoneId
...
My question is should I create an index for each column that is being joined to a table separately
CREATE NONCLUSTERED INDEX ... (Addr.Id ASC)
CREATE NONCLUSTERED INDEX ... (PhoneId ASC)
or can I create one index that includes all columns being joined
CREATE NONCLUSTERED INDEX ... (Addr.Id ASC, PhoneId ASC)
Also, are there other ways I can improve performance on this scenario?

As #DaleK says this is a complex topic. In general though, an index is only usable when all the leading values are used. Your suggestion of a composite index will likely not work. The indexed value of PhoneId cannot be used independently from AddrId. (The index would be ok for AddrId on its own)
The best approach is to have a test database with representative data & volumes then check the query plan & suggestions. Don't forget every index you add has a side effect on the insert.
Another factor is that without a WHERE clause or if there are larger data sets (I think over 5-10% of the table), the optimiser will decide it's often faster to not use indexes anyway.
And I'd rethink using temp tables anyway, let alone indexed ones. They're rarely necessary. A single, large query usually runs faster (and has better data integrity depending on your isolation model) than one split into chunks.

Does index still exists in these situations?

I have some questions about index.
First, if I use a index column in WITH clause, dose this column still works as index column in main query?
For example,
WITH TEST AS (
SELECT EMP_ID
FROM EMP_MASTER
)
SELECT *
FROM TEST
WHERE EMP_ID >= '2000'
'EMP_ID' in 'EMP_MASTER' table is PK and index for EMP_MASTER consists of EMP_ID.
In this situation, Does 'Index Scan' happen in main query?
Second, if I join two tables and then use two index columns from each table in WHERE, does 'Index Scan' happen?
For example,
SELECT *
FROM A, B
WHERE A.COL1 = B.COL1 AND
A.COL1 > 200 AND
B.COL1 > 100
Index for table A consists of 'COL1' and index for table B consists of 'COL1'.
In this situation, does 'Index Scan' happen in each table before table join?
If you give me some proper advice, I really appreciate that.

First, SQL is a declarative language, not a procedural language. That is, a SQL query describes the result set, not the specific processing. The SQL engine uses the optimizer to determine the best execution plan to generate the result set.
Second, Oracle has a reasonable optimizer.
Hence, Oracle will probably use the indexes in these situations. However, you should look at the execution plan to see what Oracle really does.

First, if I use a index column in WITH clause, dose this column still works as index column in main query?
Yes. A CTE (the WITH part) is a query just like any other - and if a query references a physical table column used by an index then the engine will use the index if it thinks it's a good idea.
In this situation, Does 'Index Scan' happen in main query?
We can't tell from the limitated information you've provided. An engine will scan or seek an index based on its heuristics about the distribution of data in the index (e.g. STATISTICS objects) and other information it has, such as cached query execution plans.
In this situation, does 'Index Scan' happen in each table before table join?
As it's a range query, it probably would make sense for an engine to use an index scan rather than an index seek - but it also could do a table-scan and ignore the index if the index isn't selective and specific enough. Also factor in query flags to force reading non-committed data (e.g. for performance and to avoid locking).

SQL Join with GROUP BY query optimisation

I'm trying to optimise the following query.
SELECT C.name, COUNT(DISTINCT I.id), COUNT(B.id)
FROM Categories C, Items I, Bids B
WHERE C.id = I.category
AND I.id = B.item_id
GROUP BY C.name
ORDER BY 2 DESC, 3 DESC;
Categories is a small table with 20 records.
Items is a large table with over 50,000 records.
Bids is a even larger table with over 600,000 records.
I have an index on
Categories(name, id), Items(category), and Bids(item_id, id).
The PRIMARY KEY for each table is: Items(id), Categories(id), Bids(id)
Is there any possibility to optimise the query? Very appreciated.

Without EXPLAIN (ANALYZE, BUFFERS) output this is guesswork.
The query is so simple that nothing can be optimized there.
Make sore that you cave correct table statistics; check EXPLAIN (ANALYZE) to see if PostgreSQL's estimates are correct.
Increase shared_buffers so that the whole database fits into RAM (if you can).
Increase work_mem so that all hashes and sorts are performed in memory.

Not really you are scanning all records.
How many of the item records are hit with the data from bids. I would imagine all tables are full scanned and hash joined , and indexes disregarded.

ِYour query seems really boiler plate and I am sure that with the size of your tables, any not-really-low-hardware server can run this query in a heartbeat. But you can always make things better. Here's a list of optimizations you can make that are supposed to boost up your query's performance, theoretically:
Theoretically speaking, your biggest inefficiency here is that you are calculating cross product of your tables instead of joining them. You can rewrite the query with joins like:
...
FROM Items I
INNER JOIN Bids B
ON I.id = B.item_id
INNER JOIN Categories C
ON C.id = I.category
...
If we are considering everything performance wise, your index on the category for the Items table is inefficient, since your index has only 20 entries that are mapped to 50K entries. This here is an inefficient index, and you may even get better performance without this index. However, from a practical point of view, there are a lot of other stuff to consider here, so this may not actually be a big deal.
You have no index on the ID column of the Items table and having an index on that column speeds up your first join. (However PostgreSQL has default index on primary key columns so this is not a big deal either)
Also, adding explain analyze to the beginning of your query shows you the plan that the PostgreSQL query planner uses to run you queries. If you know a thing or two about query plans, I suggest you take a look a the results of that too to find any missing inefficiencies.

Getting RID Lookup instead of Table Scan?

SQL Fiddle: http://sqlfiddle.com/#!3/23cf8
In this query, when I have an In clause on an Id, and then also select other columns, the In is evaluated first, and then the Details column and other columns are pulled in via a RID Lookup:
--In production and in SQL Fiddle, Details is grabbed via a RID Lookup after the In clause is evaluated
SELECT [Id]
,[ForeignId]
,Details
--Generate a numbering(starting at 1)
--,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where foreignId In (1,2,3,5)
With this query, the Details are being pulled in via a Table Scan.
With NumberedContacts AS
(
SELECT [Id]
,[ForeignId]
--Generate a numbering(starting at 1)
,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where ForeignId In (1,2,3,5)
)
Select nc.[Id]
,nc.[ForeignId]
,sc.[Details]
From NumberedContacts nc
Inner Join SupportContacts sc on nc.Id = sc.Id
Where nc.ContactNumber <= 2 --Only grab the last 2 contacts per ForeignId
;
In SqlFiddle, the second query actually gets a RID Lookup, whereas in production with a million records it produces a Table Scan (the IN clause eliminates 99% of the rows)
Otherwise the query plan shown in SQL Fiddle is identical, the only difference being that for the second query the RID Lookup in SQL Fiddle, is a Table Scan in production :(
I would like to understand possibilities that would cause this behavior? What kinds of things would you look at to help determine the cause of it using a table scan here?
How can I influence it to use a RID Lookup there?
From looking at operation costs in the actual execution plan, I believe I can get the second query very close in performance to the first query if I can get it to use a RID Lookup. If I don't select the Detail column, then the performance of both queries is very close in production. It is only after adding other columns like Detail that performance degrades significantly for the second query. When I put it in SQL Fiddle and saw that the execution plan used an RID Lookup, I was surprised but slightly confused...
It doesn't have a clustered index because in testing with different clustered indexes, there was slightly worse performance for this and other queries. That was before I began adding other columns like Details though, and I can experiment with that more, but would like to have a understanding of what is going on now before I start shooting in the dark with random indexes.

What if you would change your main index to include the Details column?
If you use:
CREATE NONCLUSTERED INDEX [IX_SupportContacts_ForeignIdAsc_IdDesc]
ON SupportContacts ([ForeignId] ASC, [Id] DESC)
INCLUDE (Details);
then neither a RID lookup nor a table scan would be needed, since your query could be satisfied from just the index itself....

The differences in the query plans will be dependent on the types of indexes that exist and the statistics of the data for those tables in the different environments.
The optimiser uses the statistics (histograms of data frequency, mostly) and the available indexes to decide which execution plan is going to be the quickest.
So, for example, you have noticed that the performance degrades when the 'Details' column is included. This is an almost sure sign that either the 'Details' column is not part of an index, or if it is part of an index, the data in that column is mostly unique such that the index accesses would be equivalent (or almost equivalent) to a table scan.
Often when this situation arises, the optimiser will choose a table scan over the index access, as it can take advantage of things like block reads to access the table records faster than perhaps a fragmented read of an index.
To influence the path that will be chose by the optimiser, you would need to look at possible indexes that could be added/modified to make an index access more efficient, but this should be done with care as it can adversely affect other queries as well as possibly degrading insert performance.
The other important activity you can do to help the optimiser is to make sure the table statistics are kept up to date and refreshed at a frequency that is appropriate to the rate of change of the frequency distribution in the table data

If it's true that 99% of the rows would be omitted if it performed the query using the relevant index + RID then the likeliest problem in your production environment is that your statistics are out of date and the optimiser doesn't realise that ForeignID in (1,2,3,5) would limit the result set to 1% of the total data.
Here's a good link for discovering more about statistics from Pinal Dave: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
As for forcing the optimiser to follow the correct path WITHOUT updating the statistics, you could use a table hint - if you know the index that your plan should be using which contains the ID and ForeignID columns then stick that in your query as a hint and force SQL optimiser to use the index:
http://msdn.microsoft.com/en-us/library/ms187373.aspx
FYI, if you want the best performance from your second query, use this index and avoid the headache you're experiencing altogether:
create index ix1 on SupportContacts(ForeignID, Id DESC) include (Details);

How should tables be indexed to optimise this Oracle SELECT query?

I've got the following query in Oracle10g:
select *
from DATA_TABLE DT,
LOOKUP_TABLE_A LTA,
LOOKUP_TABLE_B LTB
where DT.COL_A = LTA.COL_A (+)
and DT.COL_B = LTA.COL_B (+)
and LTA.COL_C = LTB.COL_C
and LTA.COL_B = LTB.COL_B
and ( DT.REF_TXT = :refTxt or DT.ALT_REF_TXT = :refTxt )
and DT.CREATED_DATE between :startDate and :endDate
And was wondering whether you've got any hints for optimising the query.
Currently I've got the following indices:
IDX1 on DATA_TABLE (REF_TXT, CREATED_DATE)
IDX2 on DATA_TABLE (ALT_REF_TXT, CREATED_DATE)
LOOKUP_A_PK on LOOKUP_TABLE_A (COL_A, COL_B)
LOOKUP_A_IDX1 on LOOKUP_TABLE_A (COL_C, COL_B)
LOOKUP_B_PK on LOOKUP_TABLE_B (COL_C, COL_B)
Note, the LOOKUP tables are very small (<200 rows).
EDIT:
Explain plan:
Query Plan
SELECT STATEMENT Cost = 8
FILTER
NESTED LOOPS
NESTED LOOPS
TABLE ACCESS BY INDEX ROWID DATA_TABLE
BITMAP CONVERSION TO ROWIDS
BITMAP OR
BITMAP CONVERSION FROM ROWIDS
SORT ORDER BY
INDEX RANGE SCAN IDX1
BITMAP CONVERSION FROM ROWIDS
SORT ORDER BY
INDEX RANGE SCAN IDX2
TABLE ACCESS BY INDEX ROWID LOOKUP_TABLE_A
INDEX UNIQUE SCAN LOOKUP_A_PK
TABLE ACCESS BY INDEX ROWID LOOKUP_TABLE_B
INDEX UNIQUE SCAN LOOKUP_B_PK
EDIT2:
The data looks like this:
There will be 10000s of distinct REF_TXT, which 10-100s of CREATED_DTs for each. ALT_REF_TXT will mostly NULL but there are going to be 100s-1000s which it will be different from REF_TXT.
EDIT3: Fixed what ALT_REF_TXT actually contains.

The execution plan you're currently getting looks pretty good. There's no obvious improvement to be made.
As other have noted, you have some outer join indicators, but then you essentially prevent the outer join by requiring equality on other columns in the two outer tables. As you can see from the execution plan, no outer join is happening. If you don't want an outer join, remove the (+) operators, they're just confusing the issue. If you do want an outer join, rewrite the query as shown by #Dems.
If you're unhappy with the current performance, I would suggest running the query with the gather_plan_statistics hint, then using DBMS_XPLAN.DISPLAY_CURSOR(?,?,'ALLSTATS LAST') to view the actual execution statistics. This will show the elapsed time attributed to each step in the execution plan.
You might get some benefit from converting one or both of the lookup tables into index-organized tables.

Your 2 index range scans on IDX1 and IDX2 will produce at most 100 rows, so your BITMAP CONVERSION TO ROWIDS will produce at most 200 rows. And from there on, it's only indexed access by rowids, leading to a likely sub-second execution. So are you really experiencing performance problems? If so, how long does it take exactly?
If you are experiencing performance problems, then please follow Dave Costa's advice and get the real plan, because in that case it's likely that you are using another plan runtime, possibly due to certain bind variable values or different optimizer environment settings.
Regards,
Rob.

This is one of those cases where it makes very little sense to try to optimize the DBMS performance without knowing what your data means.
Do you have many, many distinct CREATED_DATE values and a few rows in your DT for each date? If so you want an index on CREATED_DATE, as it will be the primary way for the DBMS to reject columns it doesn't want to process.
On the other hand, do you have only a handful of dates, and many distinct values of REF_TXT or ALT_REF_TXT? In that case you probably have the correct compound index choices.
The presence of OR in your query complicates things greatly, and throws most guesswork out the window. You must look at EXPLAIN PLAN to see what's going on.
If you have tens of millions of distinct REF_TXT and ALT_REF_TXT values, you may want to consider denormalizing this schema.
Edit.
Thanks for the additional info. Your explain plan contains no smoking guns that I can see. Some things to try next if you're not happy with performance yet.
Flip the order of the columns in your compound indexes on your data tables. Maybe that will get you simpler index range scans instead of all the bitmap monkey business.
Exchange your SELECT * for the names of the columns you actually need in the query resultset. That's good programming practice in any case, and it MAY allow the optimizer to avoid some work.
If things are still too slow, try recasting this as a UNION of two queries rather than using OR. That MAY allow the alt_ref_txt part of your query, which is made a little more complex by all the NULL values in that column, to be optimized separately.

This may be the query you want using a more upto date syntax.
(And without inner joins breaking outer joins)
select
*
from
DATA_TABLE DT
left outer join
(
LOOKUP_TABLE_A LTA
inner join
LOOKUP_TABLE_B LTB
on LTA.COL_C = LTB.COL_C
and LTA.COL_B = LTB.COL_B
)
on DT.COL_A = LTA.COL_A
and DT.COL_B = LTA.COL_B
where
( DT.REF_TXT = :refTxt or DT.ALT_REF_TXT = :refTxt )
and DT.CREATED_DATE between :startDate and :endDate
INDEXes that I'd have are...
LOOKUP_TABLE_A (COL_A, COL_B)
LOOKUP_TABLE_B (COL_B, COL_C)
DATA_TABLE (REF_TXT, CREATED_DATE)
DATA_TABLE (ALT_REF_TXT, CREATED_DATE)
Note: The first condition in the WHERE clause about contains an OR that will likely frag the use of INDEXes. In such case I have seen performance benefits in UNIONing two queries together...
<your query>
where
DT.REF_TXT = :refTxt
and DT.CREATED_DATE between :startDate and :endDate
UNION
<your query>
where
DT.ALT_REF_TXT = :refTxt
and DT.CREATED_DATE between :startDate and :endDate

Provide output of this query with "set autot trace". Let's see how many blocks it is pulling. Explain plan looks good, it should be very fast. If you need more, denormalize the lookup table info into DT. Violates 3rd normal form, but it will make your query faster by eliminating the joins. In a situation where milliseconds counts, everything is in buffers, and you need that query to run 1000 times/second, it can help by driving down the number of blocks looked at per row. It is the ultimate way to boost read performance, but complicates your app (and ruins your lovely ER diagram).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas