This question might be weird since my knowledge of sql is limited.
I am trying to optimize a database containing millions of records but i don't seem to know when there is a performance. I have been playing around with the 'Explain plan' in SQL Developer, but those numbers really don't mean a lot to me.
For instance, I got a table with following columns :
ID | Creation_date | End_date | Critical_date | Flag
I do run a query on this table looking like this :
SELECT COUNT(ID) FROM TABLE
WHERE FLAG = 1
Creation_date < :p_d_one
End_date > : p_d_two
Critical_date < :p_d_three
This gives me a COST of 64. So I decided to add indexes.
Now the thing I do not understand is that when I create An index for Creation_date, the cost drops drastically -> (64 -> 9)
CREATE INDEX TABLE_INDEX_CREATION_DATE ON TABLE (Creation_date desc)
but when I create and index for the other fields it doesn't change a bit (even though they are also dates).
Any idea if it is still worth including them into my index ?
Should I still be creating indexes for other queries if they don't have any influence on the cost ?
Is the cost calculated using live stats from the database (Cardinality, ... ) ?
You could create an index over all the fields mentioned in the query (flag, creation_date, end_date, critical_date, id in that order) so your query becomes index-only, i.e. the RDBMS does not have to read the table at all, it can get everything it needs from the index.
For a particular query, Oracle will not use more than one index on a given table. Having more than one index is only necessary to support different queries.
Related
We have a 6B row table that is giving us challenges when retrieving data.
Our query returns values instantly when doing a...
SELECT * WHERE Event_Code = 102225120
That type of instant result is exactly what we need. We now want to filter to receive values for just a particular year - but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row. There are other approaches like using TRUNC, or BETWEEN, or specifying the datetime in YYYY-MM-DD format for doing comparisons.
Of note, we do not have the option to add indexes to the database as it is a vendor's database.
What is the way to add a date filtering query and enable Oracle to begin streaming the results back in the fastest way possible?
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row
That question is quite different from yours. Firstly, your statement above applies to any data type, not only dates. Also the word many is relative to the number of records in the table. If the optimizer decides that the query will return many of all records in your table, then it may decide that a full scan of the table is faster than using the index. In your situation, this translates to how many records are in 2017 out of all records in the table? This calculation gives you the cardinality of your query which then gives you an idea if an index will be faster or not.
Now, if you decide that an index will be faster, based on the above, the next step is to know how to build your index. In order for the optimizer to use the index, it must match the condition that you're using. You are not comparing dates in your query, you are only comparing the year part. So an index on the date column will not be used by this query. You need to create an index on the year part, so use the same condition to create the index.
we do not have the option to add indexes to the database as it is a vendor's database.
If you cannot modify the database, there is no way to optimize your query. You need to talk to the vendor and get access to modify the database or ask them to add the index for you.
A function can also cause slowness for the number of records involved. Not sure if Function Based Index can help you for this, but you can try.
Had you tried to add a year column in the table? If not, try to add a year column and update it using code below.
UPDATE table
SET year = EXTRACT(YEAR FROM PERFORMED_DATE_TIME);
This will take time though.
But after this, you can run the query below.
SELECT *
FROM table
WHERE Event_Code = 102225120 AND year = 2017;
Also, try considering Table Partitioned for this big data. For starters, see link below,
link: https://oracle-base.com/articles/8i/partitioned-tables-and-indexes
Your question is a bit ambiguous IMHO:
but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Do you mean that
SELECT * WHERE Event_Code = 102225120
is fast, but
SELECT * WHERE Event_Code = 102225120 AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
is slow???
For starters I'll agree with Mitch Wheat that you should try to use PERFORMED_DATE_TIME between Jan 1, 2017 and Dec 31, 2017 instead of Year(field) = 2017. Even if you'd have an index on the field, the latter would hardly be able to make use of it while the first method would benefit enormously.
I'm also hoping you want to be more specific than just 'give me all of 2017' because returning over 1B rows is NEVER going to be fast.
Next, if you can't make changes to the database, would you be able to maintain a 'shadow' in another database? This would require that you create a table with all date-values AND the PK of the original table in another database and query those to find the relevant PK values and then JOIN those back to your original table to find whatever you need. The biggest problem with this would be that you need to keep the shadow in sync with the original table. If you know the original table only changes overnight, you could merge the changes in the morning and query all day. If the application is 'real-time(ish)' then this probably won't work without some clever thinking... And yes, your initial load of 6B values will be rather heavy =)
May this could be usefull (because you avoid functions (a cause for context switching) and if you have an index on your date field, it could be used) :
with
dt as
(
select
to_date('01/01/2017', 'DD/MM/YYYY') as d1,
to_date('31/01/2017', 'DD/MM/YYYY') as d2
from dual
),
dates as
(
select
dt.d1 + rownum -1 as d
from dt
connect by dt.d1 + rownum -1 <= dt.d2
)
select *
from your_table, dates
where dates.d = PERFORMED_DATE_TIME
Move the date literal to RHS:
AND PERFORMED_DATE_TIME >= date '2017-01-01'
AND PERFORMED_DATE_TIME < date '2018-01-01'
But without an (undisclosed) appropriate index on PERFORMED_DATE_TIME, the query is unlikely to be any faster.
One option to create indexes in third party databases is to script in the index and then before any vendor upgrade run a script to remove any indexes you've added. If the index is important, ask the vendor to add it to their database design.
I am trying to see how to improve performance for aggregation queries in an Oracle database. The system is used to run financial series simulations.
Here is the simplified set-up:
The first table table1 has the following columns
date | id | value
It is read-only, has about 100 million rows and is indexed on id, date
The second table table2 is generated by the application according to user input, is relatively small (300K rows) and has this layout:
id | start_date | end_date | factor
After the second table is generated, I need to compute totals as follows:
select date, sum(value * nvl(factor,1)) as total
from table1
left join table2 on table1.id = table2.id
and table1.date between table2.start_date and table2.end_date group by date
My issue is that this is slow, taking up to 20-30 minutes if the second table is particularly large. Is there a generic way to speed this up, perhaps trading off storage space and execution time, ideally, to achieve something running in under a minute?
I am not a database expert and have been reading Oracle performance tuning docs but was not able to find anything appropriate for this. The most promising idea I found were OLAP cubes but I understand this would help only if my second table was fixed and I simply needed to apply different filters on the data.
First, to provide any real insight, you'd need to determine the execution plan that Oracle is producing for the slow query.
You say the second table is ~300K rows - yes that's small compared to 100M but since you have a range condition in the join between the two tables, it's hard to say how many rows from table1 are likely to be accessed in any given execution of the query. If a large proportion of the table is accessed, but the query optimizer doesn't recognize that, the index may actually be hurting instead of helping.
You might benefit from re-organizing table1 as an index-organized table, since you already have an index that covers most of the columns. But all I can say from the information so far is that it might help, but it might not.
Apart from indexes, Also try below. My two cents!
Try running this Query with PARALLEL option employing multiple processors. /*+ PARALLEL(table1,4) */ .
NVL has been done for million of rows, and this will be an impact
to some extent, any way data can be organised?
When you know the date in Advance, probably you divide this Query
into two chunks, by fetching the ids in TABLE2 using the start
date and end date. And issue a JOIN it to TABLE1 using a
view or temp table. By this we use the index (with id as
leading edge) optimally
Thanks!
I am not clear on the following:
Assume I have a table A and I have created an index on column X.
If I do select based on the column that result will be sorted, right?
But if for some reason I did a select followed by an ORDER BY X (e.g. I was unaware that the column was indexed) will the SQL server do the sort performing sequential access or will it go and use the index?
If you don't specify an ORDER BY in your SELECT query, then there's no guaranteed / no deterministic ordering of any kind.
The data will be returned - but there's no guarantee of any order (or that it will be returned in the same order next time you query for it)
So if you need an order - then you must specify an ORDER BY in your SELECT - it's as simply as that.
And yes - if there's an index in place, and if it makes sense - then the SQL Server query optimizer will indeed use that index. But there are a lot of ifs and buts involved - it might - or might not - use the index - that's entirely up to the optimizer to decide (and this depends on a lot of factors - too many to list here).
(this specifically applies to SQL Server which I know well - I don't know how other RDBMS handle this situation)
Whether or not the RDMS will use the index or not depends on the predicates in your SQL query, what other indexes are on the table and how the optimiser chooses to execute your query..
So if you have
SELECT A, X
FROM T
WHERE A = some_value
ORDER BY X
and there is an index on A then the RDMS may (emphasis on MAY) choose to access the data via that index rather than the index on T. This means the SELECT will be followed by a sort.
As with most database questions, the answer is "It depends"
Without ORDER BY in your SELECT, then there is no no deterministic ordering of any kind.
Indexing plus sorting is quite complex story. It depends on RDBMS , RDBMS version and data in table.
1 000 000 rows in table
comment table
i - int, unique number from sequence/auto_increment etc.
created - long, indexed
updated - long, indexed
title - varchar(50) - indexed
body - text
Selects:
SELECT id FROM comments;
Oracle,MySQL,Postgresql return records completely randomly. You can have the illusion of order. However, after some administrative work vacuum, table analyze, optimize table; everything can change
SELECT * FROM comments ORDER BY created DESC;
Postgresql and Oracle will perform full table scan.
SELECT created FROM comments ORDER BY created;
Oracle will perform full index scan (only index data) but Postgresql will perform full table scan.
SELECT * FROM comments WHERE created>sysday-7 ORDER BY created DESC;
Postgresql and Oracle will perform index range scan and read data from table.
SELECT created FROM comments WHERE created>sysday-7 ORDER BY created;
Oracle will perform index range scan (no read from table) but Postgresql will perform index range scan and read data from table.
SQL Fiddle: http://sqlfiddle.com/#!3/23cf8
In this query, when I have an In clause on an Id, and then also select other columns, the In is evaluated first, and then the Details column and other columns are pulled in via a RID Lookup:
--In production and in SQL Fiddle, Details is grabbed via a RID Lookup after the In clause is evaluated
SELECT [Id]
,[ForeignId]
,Details
--Generate a numbering(starting at 1)
--,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where foreignId In (1,2,3,5)
With this query, the Details are being pulled in via a Table Scan.
With NumberedContacts AS
(
SELECT [Id]
,[ForeignId]
--Generate a numbering(starting at 1)
,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where ForeignId In (1,2,3,5)
)
Select nc.[Id]
,nc.[ForeignId]
,sc.[Details]
From NumberedContacts nc
Inner Join SupportContacts sc on nc.Id = sc.Id
Where nc.ContactNumber <= 2 --Only grab the last 2 contacts per ForeignId
;
In SqlFiddle, the second query actually gets a RID Lookup, whereas in production with a million records it produces a Table Scan (the IN clause eliminates 99% of the rows)
Otherwise the query plan shown in SQL Fiddle is identical, the only difference being that for the second query the RID Lookup in SQL Fiddle, is a Table Scan in production :(
I would like to understand possibilities that would cause this behavior? What kinds of things would you look at to help determine the cause of it using a table scan here?
How can I influence it to use a RID Lookup there?
From looking at operation costs in the actual execution plan, I believe I can get the second query very close in performance to the first query if I can get it to use a RID Lookup. If I don't select the Detail column, then the performance of both queries is very close in production. It is only after adding other columns like Detail that performance degrades significantly for the second query. When I put it in SQL Fiddle and saw that the execution plan used an RID Lookup, I was surprised but slightly confused...
It doesn't have a clustered index because in testing with different clustered indexes, there was slightly worse performance for this and other queries. That was before I began adding other columns like Details though, and I can experiment with that more, but would like to have a understanding of what is going on now before I start shooting in the dark with random indexes.
What if you would change your main index to include the Details column?
If you use:
CREATE NONCLUSTERED INDEX [IX_SupportContacts_ForeignIdAsc_IdDesc]
ON SupportContacts ([ForeignId] ASC, [Id] DESC)
INCLUDE (Details);
then neither a RID lookup nor a table scan would be needed, since your query could be satisfied from just the index itself....
The differences in the query plans will be dependent on the types of indexes that exist and the statistics of the data for those tables in the different environments.
The optimiser uses the statistics (histograms of data frequency, mostly) and the available indexes to decide which execution plan is going to be the quickest.
So, for example, you have noticed that the performance degrades when the 'Details' column is included. This is an almost sure sign that either the 'Details' column is not part of an index, or if it is part of an index, the data in that column is mostly unique such that the index accesses would be equivalent (or almost equivalent) to a table scan.
Often when this situation arises, the optimiser will choose a table scan over the index access, as it can take advantage of things like block reads to access the table records faster than perhaps a fragmented read of an index.
To influence the path that will be chose by the optimiser, you would need to look at possible indexes that could be added/modified to make an index access more efficient, but this should be done with care as it can adversely affect other queries as well as possibly degrading insert performance.
The other important activity you can do to help the optimiser is to make sure the table statistics are kept up to date and refreshed at a frequency that is appropriate to the rate of change of the frequency distribution in the table data
If it's true that 99% of the rows would be omitted if it performed the query using the relevant index + RID then the likeliest problem in your production environment is that your statistics are out of date and the optimiser doesn't realise that ForeignID in (1,2,3,5) would limit the result set to 1% of the total data.
Here's a good link for discovering more about statistics from Pinal Dave: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
As for forcing the optimiser to follow the correct path WITHOUT updating the statistics, you could use a table hint - if you know the index that your plan should be using which contains the ID and ForeignID columns then stick that in your query as a hint and force SQL optimiser to use the index:
http://msdn.microsoft.com/en-us/library/ms187373.aspx
FYI, if you want the best performance from your second query, use this index and avoid the headache you're experiencing altogether:
create index ix1 on SupportContacts(ForeignID, Id DESC) include (Details);
How to improve the performance of ad hoc queries against tables having hundreds of high cardinality columns and millions of records?
In my case, I have a table with one indexed DATE column SDATE, one VARCHAR2 column NE and 750 numeric columns most of them high cardinality columns with values in the range of 0 to 100. The table is updated with almost 20000 new records every hour. The queries against this table look like:
SELECT * FROM TAB WHERE SDATE BETWEEN :SDATE AND :EDATE AND V1 > :V1 AND V3 < :V3
or
SELECT * FROM TAB WHERE SDATE BETWEEN :SDATE AND :EDATE AND NE = :NE AND V4 > :V4
etc.
So far, I have always advised users not to enter big interval dates so as to put a limit on the number of records resulted from the date index access path; however, from time to time it becomes necessary to specify bigger intervals.
If V1, V2, ..., V750 were all low cardinality columns, I would have been able to utilize bitmap indexes. Unfortunately they are not.
What's the advice on this? How should I tackle this problem?
Thanks.
I assume you're stuck with the design, so a few thoughts that I'd probably look at -
1) use partitions - if you have partitioning option
2) use some triggers to denormalise (or normalise in this case) a query table which is more optimised for the query usage
3) make some snapshots
4) look at having a current table or set of tables which has the days records (or some suitable subset), and roll them over to a big table to store hsitory.
It depends on usage patterns and all the other constraints the system has - this may get you started, if you have more details a better solution is probably out there.
I think the big problem would be the inserts. You have an index on sdate wich slow the inserts and speed up the selects. But, returning to your problems:
If users specify an interval wich is large (let's say >5%) it is beter to have the table partitioned by sdate in a daily or weekly or monthly manner.
Oracle partitioning docs
(If you partition the table, don't forget to partition also the index. And if you want to do it live, use exchange partition ).
Also, as workaround, if you have a powerfull machine, you may use parallel queries.
Oracle Parallel docs