I'm having issues optimizing a complex view in MS SQL - sql

I'm attempting to create a view to join in hierarchical data to a normalized dataset in SQL using a profileID field.
The issue that I'm having is that my company's hierarchy data is, for lack of a better term, gapped. There are startdate and enddate fields that need to be considered in the join.
Currently I'm working with something like the following -
Select * from
dbo.datatable dt
inner join dbo.hierarchy h
on dt.profileid = h.profileid
AND dt.date >= h.startdate
AND dt.date < h.enddate
I've got a clustered index on dt that includes date and profileid and a clustered index on h that includes startdate, enddate, and profileid. SSMS has also suggested a couple indexes that I've added as well that include a lot of the data fields.
I cannot change the format of the hierarchy, but the view is absurdly slow when I try to pull a large number of days in a sql query. This dataset is end-user facing, so it's gotta be fast and usable.
Any tips are greatly appreciated!

In both tables, put profileId first in the likely index. This is because it is tested with =.
Alas, that is probably the only optimization you can do. Looking at a date range degenerates into using only one of the tests, and leads to scanning up to half the table. Or, after testing profileId, scanning half the rows for that profileId.
If the start-end ranges never overlap, there may be a trick to make things faster, but it will involve changes to the schema and code.

Related

What is the fastest way to perform a date query in Oracle SQL?

We have a 6B row table that is giving us challenges when retrieving data.
Our query returns values instantly when doing a...
SELECT * WHERE Event_Code = 102225120
That type of instant result is exactly what we need. We now want to filter to receive values for just a particular year - but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row. There are other approaches like using TRUNC, or BETWEEN, or specifying the datetime in YYYY-MM-DD format for doing comparisons.
Of note, we do not have the option to add indexes to the database as it is a vendor's database.
What is the way to add a date filtering query and enable Oracle to begin streaming the results back in the fastest way possible?
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row
That question is quite different from yours. Firstly, your statement above applies to any data type, not only dates. Also the word many is relative to the number of records in the table. If the optimizer decides that the query will return many of all records in your table, then it may decide that a full scan of the table is faster than using the index. In your situation, this translates to how many records are in 2017 out of all records in the table? This calculation gives you the cardinality of your query which then gives you an idea if an index will be faster or not.
Now, if you decide that an index will be faster, based on the above, the next step is to know how to build your index. In order for the optimizer to use the index, it must match the condition that you're using. You are not comparing dates in your query, you are only comparing the year part. So an index on the date column will not be used by this query. You need to create an index on the year part, so use the same condition to create the index.
we do not have the option to add indexes to the database as it is a vendor's database.
If you cannot modify the database, there is no way to optimize your query. You need to talk to the vendor and get access to modify the database or ask them to add the index for you.
A function can also cause slowness for the number of records involved. Not sure if Function Based Index can help you for this, but you can try.
Had you tried to add a year column in the table? If not, try to add a year column and update it using code below.
UPDATE table
SET year = EXTRACT(YEAR FROM PERFORMED_DATE_TIME);
This will take time though.
But after this, you can run the query below.
SELECT *
FROM table
WHERE Event_Code = 102225120 AND year = 2017;
Also, try considering Table Partitioned for this big data. For starters, see link below,
link: https://oracle-base.com/articles/8i/partitioned-tables-and-indexes
Your question is a bit ambiguous IMHO:
but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Do you mean that
SELECT * WHERE Event_Code = 102225120
is fast, but
SELECT * WHERE Event_Code = 102225120 AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
is slow???
For starters I'll agree with Mitch Wheat that you should try to use PERFORMED_DATE_TIME between Jan 1, 2017 and Dec 31, 2017 instead of Year(field) = 2017. Even if you'd have an index on the field, the latter would hardly be able to make use of it while the first method would benefit enormously.
I'm also hoping you want to be more specific than just 'give me all of 2017' because returning over 1B rows is NEVER going to be fast.
Next, if you can't make changes to the database, would you be able to maintain a 'shadow' in another database? This would require that you create a table with all date-values AND the PK of the original table in another database and query those to find the relevant PK values and then JOIN those back to your original table to find whatever you need. The biggest problem with this would be that you need to keep the shadow in sync with the original table. If you know the original table only changes overnight, you could merge the changes in the morning and query all day. If the application is 'real-time(ish)' then this probably won't work without some clever thinking... And yes, your initial load of 6B values will be rather heavy =)
May this could be usefull (because you avoid functions (a cause for context switching) and if you have an index on your date field, it could be used) :
with
dt as
(
select
to_date('01/01/2017', 'DD/MM/YYYY') as d1,
to_date('31/01/2017', 'DD/MM/YYYY') as d2
from dual
),
dates as
(
select
dt.d1 + rownum -1 as d
from dt
connect by dt.d1 + rownum -1 <= dt.d2
)
select *
from your_table, dates
where dates.d = PERFORMED_DATE_TIME
Move the date literal to RHS:
AND PERFORMED_DATE_TIME >= date '2017-01-01'
AND PERFORMED_DATE_TIME < date '2018-01-01'
But without an (undisclosed) appropriate index on PERFORMED_DATE_TIME, the query is unlikely to be any faster.
One option to create indexes in third party databases is to script in the index and then before any vendor upgrade run a script to remove any indexes you've added. If the index is important, ask the vendor to add it to their database design.

How to improve select performance from huge tables

I currently have a process that is taking to long(an hour+-) .
The process basically does this :
First, it left joining from one table to a VIEW -
SELECT * FROM
STG_CRM, V_CRM
WHERE
STG_CRM.CRM_CASE_ID=V_CRM.CASE_ID(+)
The view DDL:
create or replace view stg_admin.v_crm as
select t.case_id
from crm_case t, dim_crm x
where t.case_id=x.crm_case_id;
STG_CRM - 200k records - no indexes.
DIM_CRM - 90MIL records - indexed (crm_case_id - unique).
CRM_CASE - 200k records - no indexes.
Until now everything is not to heavy yet (about 2-3 minutes), then there is a left join to another VIEW , that currently is the heaviest select (just select * from the view is 10 minutes).
View DDL - I'm currently thinking over two different queries :
select t.crm_case_id,s.customer_key
from stg_crm t, stg_scd s
where t.account_number=s.account_number
and t.case_create_date between s.start_date and s.end_date;
Or:
select t.crm_case_id,
(select min(s.customer_key) keep (dense_rank first order by s.end_date asc)
from stg_scd s
where t.account_number = s.account_number and
t.case_create_date <= s.end_date
) as customer_key
from stg_crm t
Table stg_scd - 500MIL records indexed (customer_key,start_date,end_date) - UNIQUE partitioned by end_date daily.
Now currently both of this queries taking a very long time, the second a bit longer . My guess is because it is not using the index, since start_date is not beeing used to filter, but I have no idea how to add it.
My question is: How can I make it faster? If I add an index on STG_CRM on create_date, will it help?(I don't even know if DBA will allow at) because this is the small table.
LIMITATIONS :
I can't change indexes on the big table (STG_SCD)
I may be able to add index on other tables, but only given a good reason because it can hurt performances on other processes that use this tables.
The implicit join syntax are generated through my program, so no need comments there.
Thanks a lot in advance!
P.S. The first select left joined to the second select takes about 30-60 minutes .
Since you are aggregating data from a very large table (as you are doing in your second query option), you might gain some performance benefits from using query_rewrite to calculate the results of these values in advance. Here's a paper that covers the use of this feature -
http://gerardnico.com/wiki/database/oracle/query_rewriting
You can also find a lot of other examples for the best practices of using this feature, it is highly valuable in tuning data warehouse queries.
Good luck!
I've managed to solve this problem!
I was missing two things:
1) Statistics on stg_scd, which we disabled after we add partitions to the tables and forgot to add statistics after each partition is being added.
2) Adding index on stg_crm on(account_number, case_create_date)
Thanks for all your attempts :)

Speeding up aggregations for a large table in Oracle

I am trying to see how to improve performance for aggregation queries in an Oracle database. The system is used to run financial series simulations.
Here is the simplified set-up:
The first table table1 has the following columns
date | id | value
It is read-only, has about 100 million rows and is indexed on id, date
The second table table2 is generated by the application according to user input, is relatively small (300K rows) and has this layout:
id | start_date | end_date | factor
After the second table is generated, I need to compute totals as follows:
select date, sum(value * nvl(factor,1)) as total
from table1
left join table2 on table1.id = table2.id
and table1.date between table2.start_date and table2.end_date group by date
My issue is that this is slow, taking up to 20-30 minutes if the second table is particularly large. Is there a generic way to speed this up, perhaps trading off storage space and execution time, ideally, to achieve something running in under a minute?
I am not a database expert and have been reading Oracle performance tuning docs but was not able to find anything appropriate for this. The most promising idea I found were OLAP cubes but I understand this would help only if my second table was fixed and I simply needed to apply different filters on the data.
First, to provide any real insight, you'd need to determine the execution plan that Oracle is producing for the slow query.
You say the second table is ~300K rows - yes that's small compared to 100M but since you have a range condition in the join between the two tables, it's hard to say how many rows from table1 are likely to be accessed in any given execution of the query. If a large proportion of the table is accessed, but the query optimizer doesn't recognize that, the index may actually be hurting instead of helping.
You might benefit from re-organizing table1 as an index-organized table, since you already have an index that covers most of the columns. But all I can say from the information so far is that it might help, but it might not.
Apart from indexes, Also try below. My two cents!
Try running this Query with PARALLEL option employing multiple processors. /*+ PARALLEL(table1,4) */ .
NVL has been done for million of rows, and this will be an impact
to some extent, any way data can be organised?
When you know the date in Advance, probably you divide this Query
into two chunks, by fetching the ids in TABLE2 using the start
date and end date. And issue a JOIN it to TABLE1 using a
view or temp table. By this we use the index (with id as
leading edge) optimally
Thanks!

Index In SQL Server 2012

I have this table in my database where HusbandPersonId and WifePersonId are foreign keys to another table called Person, and the start/end date refer when the marriage start and when it ends.
And I have this query :
SELECT
DISTINCT A.WifePersonId
FROM
Couple A
INNER JOIN Couple B
ON A.WifePersonId = B.WifePersonId
AND A.HusbandPersonId <> B.HusbandPersonId
AND A.StartDate < B.EndDate
AND A.EndDate > B.StartDate;
which returns any wife that is married to more than one person at same time.
Now I would like to add an index to improve the speed of search of this query.
Which index would be the best and what is the execution plan of the query before and after the index has been added ?
This is a request in homework and I search too much but I didn't find any helpful topic
Can anyone help on this ?
General indexing rules suggest including all fields in your join condition. The order of these fields may have an affect. Again, general rules suggest ordering the fields in order of increasing cardinality (# of unique values for each field).
So you might try:
CREATE INDEX Couple__multi ON Couple(StartDate, EndDate, WifePersonId, HusbandPersonId)
This assumes that the # of distinct start/end dates is less than the number of unique wife/husband PersonIds.
These rules are basically Indexing 101 type things. Most of the time they will get you an acceptable level of performance. It depends highly on your data and application if this is sufficient for your purposes.
Personally, I have not thought much of the SQL Performance Analyzer-suggested indexes, but I last took a serious look at them in SQL2005. I'm sure there has been some improvement since.
Hope this helps.

How should tables be indexed to optimise this Oracle SELECT query?

I've got the following query in Oracle10g:
select *
from DATA_TABLE DT,
LOOKUP_TABLE_A LTA,
LOOKUP_TABLE_B LTB
where DT.COL_A = LTA.COL_A (+)
and DT.COL_B = LTA.COL_B (+)
and LTA.COL_C = LTB.COL_C
and LTA.COL_B = LTB.COL_B
and ( DT.REF_TXT = :refTxt or DT.ALT_REF_TXT = :refTxt )
and DT.CREATED_DATE between :startDate and :endDate
And was wondering whether you've got any hints for optimising the query.
Currently I've got the following indices:
IDX1 on DATA_TABLE (REF_TXT, CREATED_DATE)
IDX2 on DATA_TABLE (ALT_REF_TXT, CREATED_DATE)
LOOKUP_A_PK on LOOKUP_TABLE_A (COL_A, COL_B)
LOOKUP_A_IDX1 on LOOKUP_TABLE_A (COL_C, COL_B)
LOOKUP_B_PK on LOOKUP_TABLE_B (COL_C, COL_B)
Note, the LOOKUP tables are very small (<200 rows).
EDIT:
Explain plan:
Query Plan
SELECT STATEMENT Cost = 8
FILTER
NESTED LOOPS
NESTED LOOPS
TABLE ACCESS BY INDEX ROWID DATA_TABLE
BITMAP CONVERSION TO ROWIDS
BITMAP OR
BITMAP CONVERSION FROM ROWIDS
SORT ORDER BY
INDEX RANGE SCAN IDX1
BITMAP CONVERSION FROM ROWIDS
SORT ORDER BY
INDEX RANGE SCAN IDX2
TABLE ACCESS BY INDEX ROWID LOOKUP_TABLE_A
INDEX UNIQUE SCAN LOOKUP_A_PK
TABLE ACCESS BY INDEX ROWID LOOKUP_TABLE_B
INDEX UNIQUE SCAN LOOKUP_B_PK
EDIT2:
The data looks like this:
There will be 10000s of distinct REF_TXT, which 10-100s of CREATED_DTs for each. ALT_REF_TXT will mostly NULL but there are going to be 100s-1000s which it will be different from REF_TXT.
EDIT3: Fixed what ALT_REF_TXT actually contains.
The execution plan you're currently getting looks pretty good. There's no obvious improvement to be made.
As other have noted, you have some outer join indicators, but then you essentially prevent the outer join by requiring equality on other columns in the two outer tables. As you can see from the execution plan, no outer join is happening. If you don't want an outer join, remove the (+) operators, they're just confusing the issue. If you do want an outer join, rewrite the query as shown by #Dems.
If you're unhappy with the current performance, I would suggest running the query with the gather_plan_statistics hint, then using DBMS_XPLAN.DISPLAY_CURSOR(?,?,'ALLSTATS LAST') to view the actual execution statistics. This will show the elapsed time attributed to each step in the execution plan.
You might get some benefit from converting one or both of the lookup tables into index-organized tables.
Your 2 index range scans on IDX1 and IDX2 will produce at most 100 rows, so your BITMAP CONVERSION TO ROWIDS will produce at most 200 rows. And from there on, it's only indexed access by rowids, leading to a likely sub-second execution. So are you really experiencing performance problems? If so, how long does it take exactly?
If you are experiencing performance problems, then please follow Dave Costa's advice and get the real plan, because in that case it's likely that you are using another plan runtime, possibly due to certain bind variable values or different optimizer environment settings.
Regards,
Rob.
This is one of those cases where it makes very little sense to try to optimize the DBMS performance without knowing what your data means.
Do you have many, many distinct CREATED_DATE values and a few rows in your DT for each date? If so you want an index on CREATED_DATE, as it will be the primary way for the DBMS to reject columns it doesn't want to process.
On the other hand, do you have only a handful of dates, and many distinct values of REF_TXT or ALT_REF_TXT? In that case you probably have the correct compound index choices.
The presence of OR in your query complicates things greatly, and throws most guesswork out the window. You must look at EXPLAIN PLAN to see what's going on.
If you have tens of millions of distinct REF_TXT and ALT_REF_TXT values, you may want to consider denormalizing this schema.
Edit.
Thanks for the additional info. Your explain plan contains no smoking guns that I can see. Some things to try next if you're not happy with performance yet.
Flip the order of the columns in your compound indexes on your data tables. Maybe that will get you simpler index range scans instead of all the bitmap monkey business.
Exchange your SELECT * for the names of the columns you actually need in the query resultset. That's good programming practice in any case, and it MAY allow the optimizer to avoid some work.
If things are still too slow, try recasting this as a UNION of two queries rather than using OR. That MAY allow the alt_ref_txt part of your query, which is made a little more complex by all the NULL values in that column, to be optimized separately.
This may be the query you want using a more upto date syntax.
(And without inner joins breaking outer joins)
select
*
from
DATA_TABLE DT
left outer join
(
LOOKUP_TABLE_A LTA
inner join
LOOKUP_TABLE_B LTB
on LTA.COL_C = LTB.COL_C
and LTA.COL_B = LTB.COL_B
)
on DT.COL_A = LTA.COL_A
and DT.COL_B = LTA.COL_B
where
( DT.REF_TXT = :refTxt or DT.ALT_REF_TXT = :refTxt )
and DT.CREATED_DATE between :startDate and :endDate
INDEXes that I'd have are...
LOOKUP_TABLE_A (COL_A, COL_B)
LOOKUP_TABLE_B (COL_B, COL_C)
DATA_TABLE (REF_TXT, CREATED_DATE)
DATA_TABLE (ALT_REF_TXT, CREATED_DATE)
Note: The first condition in the WHERE clause about contains an OR that will likely frag the use of INDEXes. In such case I have seen performance benefits in UNIONing two queries together...
<your query>
where
DT.REF_TXT = :refTxt
and DT.CREATED_DATE between :startDate and :endDate
UNION
<your query>
where
DT.ALT_REF_TXT = :refTxt
and DT.CREATED_DATE between :startDate and :endDate
Provide output of this query with "set autot trace". Let's see how many blocks it is pulling. Explain plan looks good, it should be very fast. If you need more, denormalize the lookup table info into DT. Violates 3rd normal form, but it will make your query faster by eliminating the joins. In a situation where milliseconds counts, everything is in buffers, and you need that query to run 1000 times/second, it can help by driving down the number of blocks looked at per row. It is the ultimate way to boost read performance, but complicates your app (and ruins your lovely ER diagram).