Optimize Oracle SELECT on large dataset

Optimize Oracle SELECT on large dataset - sql

I am new in Oracle (working on 11gR2). I have a table TABLE with something like ~10 millions records in it, and this pretty simple query :
SELECT t.col1, t.col2, t.col3, t.col4, t.col5, t.col6, t.col7, t.col8, t.col9, t.col10
FROM TABLE t
WHERE t.col1 = val1
AND t.col11 = val2
AND t.col12 = val3
AND t.col13 = val4
The query is currently taking about 30s/1min.
My question is: how can I improve performance ? After a lot of research, I am aware of the most classical ways to improve performance but I have some problems :
Partitioning: can't really, the table is used in an other project and it would be too impactful. Plus it only delay the problem given the number of rows inserted in the table every day.
Add an index: The thing is, the columns used in the WHERE clause are not the one returned by the query (except for one). Thus, I have not been able to find an appropriate index yet. As far as I know, setting an index on 12~13 columns does not make a lot of sense (or does it?).
Materialized views: I must say I never used them, but I understood the maintenance cost is pretty high and my table is updated quite often.
I think the best way to do this would be to add an appropriate index, but I can't find the right columns on which it should be created.

An index makes sense provided that your query results in a small percentage of all rows. You would create one index on all four columns used in the WHERE clause.
If too many records match, then a full table scan will be done. You may be able to speed this up by having this done in parallel threads using the PARALLEL hint:
SELECT /*+parallel(t,4)*/
t.col1, t.col2, t.col3, t.col4, t.col5, t.col6, t.col7, t.col8, t.col9, t.col10
FROM TABLE t
WHERE t.col1 = val1 AND t.col11 = val2 AND t.col12 = val3 AND t.col13 = val4;

Table with 10 millions records is quite little table. You just need to create an appropriate index. Which column select for index - depends on content of them. For example, if you have column that contains only "1" and "0", or "yes" and "no", you shouldn't index it. The more different values contains column - the more effect gives index. Also you can make index on two or three (and more) columns, or function-based index (in this case index contains results of your SQL function, not columns values). Also you can create more than one index on table.
And in any case, if your query selects more then 20 - 30% of all table records, index will not help.
Also you said that table is used by many people. In this case, you need to cooperate with them to avoid duplicating indexes.

Indexes on each of the columns referenced in the WHERE clause will help performance of a query against a table with a large number of rows, where you are seeking a small subset, even if the columns in the WHERE clause are not returned in the SELECT column list.
The downside of course is that indexes impede insert/update performance. So when loading the table with large numbers of records, you might need to disable/drop the indexes prior to loading and then re-create/enable them again afterwards.

Related

Sybase ASE 15 : Creating an index on multiple columns which values are inequally discriminating

I have a table mytable of 5 million records and a query that looks like
select *
from mytable
where column1 = 'value1' and column2 = 'value2' and column3 = 'value3'
So I thought about creating an index based on the 3 columns but my problem is that I have no best column to put in the first position of the index because there is no column that is really discrimating compared to the others.
Therefore I would like to build something similar to the hash tables with a hash code based on these 3 columns. I tried a function-based index based on the concatenation of those 3 columns but it's taking so long to create that I never got it created and I believe it's the wrong way to achieve what I want. What is the correct way to achieve this ?

Just create an index with three columns:
create idx_mytable_col1_col2_col3 on mytable(col1, col2, col3)
You have equality comparisons. The order of the columns in the index does not matter in this case.
Let the database do the work for you.

ASE's indexes are generally stored as b-trees, and while there's some hashing 'magic' that takes place during index searching, there's still a bit of traversal/searching required; if the first column of an index is not very selective then you can see some degradation in index search performance when compared to an index where the more selective column(s) is listed first; the difference in performance is really going to depend on the selectivity of the column(s) in question and the sheer size of the index (ie, number of index levels and pages that have to be read/processed).
If you're running ASE 15.0.3+, and you're running ASE on linux, you may want to take a look at virtually-hashed tables. In a nutshell ... ASE stores the PK index as a hash instead of the normal b-tree, with the net result being that index search times are reduced. There are quite a few requirements/restrictions on virtually-hashed tables so I suggest you take a look at your Transact-SQL User's Guide for more details.
Obviously (?) there's a good bit more to table/index design than can be discussed here; certainly not something that can be addressed by looking at a single, generic query. ("Duh, Mark!" ?)

Oracle: How can I find tablespace fragmentation?

I've a JOIN beween two tables. It's really really slow and I can't find why.
The query takes hours in a PRODUCTION environment on a very big Client.
Can you ask me what you need to understand why it doesn't work well?
I can add indexes, partition the table, etc. It's Oracle 10g.
I expect a few thousand record. Because of the following condition:
f.eif_campo1 != c.fornitura AND and f.field29 = 'New'
Infact it should be always verified for all 18 million records
SELECT c.id_messaggio
,f.campo1
,c.f
FROM
flows c,
tab f
WHERE
f.field198 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field1 != c.ExampleF
and f.field29 = 'New'
and c.processtype in ('Example1')
and c.flag_ann = 'N';
Selectivity for the following record expressed as number of distinct values:
COUNT (DISTINCT extra_id) =>17*10^6,
COUNT (DISTINCT (extra_id || field20)) =>17*10^6,
COUNT (DISTINCT field198) =>36*10^6,
COUNT (DISTINCT (field19 || field20)) =>45*10^6,
COUNT (DISTINCT (field1)) =>18*10^6,
COUNT (DISTINCT (field20)) =>47
This is the execution plan [See large image][1]
![enter image description here][2]
Extra details:
I have relaxed one contition to see how many records are taken. 300 thousand.
![enter image description here][7]
--03:57 mins with parallel execution /*+ parallel(c 8) parallel(f 24) */
--395.358 rows
SELECT count(1)
FROM
flows c,
flet f
WHERE
f.field19 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field20 = 'ExampleF'
and c.process_type in ('ExampleP')
and c.flag_ann = 'N';

Your explain plan shows the following.
The database uses an index to retrieve rows from ENI_FLUSSI_HUB where
flh_tipo_processo_cod in ('VT','VOLTURA_ENI','CC')
It then winnows the rows
where flh_flag_ann = 'N'
This produces a result set which is used to access
rows from ETL_ELAB_INTERF_FLAT on the basis of f.idde_identif_dati_ext_id =
c.idde_identif_dati_ext_id
Finally those rows are filtered on the basis of the
remaining parts of the WHERE clause.
Now, the starting point is a good one if flh_tipo_processo_cod is a selective
column: that is, if it contains hundreds of different values, or if the values in
your list are relatively rare. It might even be a good path of the flag column
identifies relatively few columns with a value of 'N'. So you need to understand
both the distribution of your data - how many distinct values you have - and its
skew - which values appear very often or hardly at all. The overall
performance suggests that the distribution and/or skew of the
flh_tipo_processo_cod and flh_flag_ann columns is not good.
So what can you do? One approach is to follow Ben's suggestion, and use full
table scans. If you have an Enterprise Edition licence and plenty of CPU capacity
you could try parallel query to improve things. That might still be too slow, or it might be too disruptive for other users.
An alternative approach would be to use better indexes. A composite index on
eni_flussi_hub(flh_tipo_processo_cod,flh_flag_ann,idde_identif_dati_ext_id,
flh_fornitura,flh_id_messaggio) would avoid the need to read that table. Whether
this would be a new index or a replacement for ENI_FLK_IDX3 depends on the other
activity against the table. You might be able to benefit from index compression.
All the columns in the query projection are referenced in the WHERE clause. So
you could also use a composite index on the other table to avoid table reads. Agsin you need to understand the distribution and skew of the data. But you should probably lead with the least selective columns. Something like etl_elab_interf_flat(etl_elab_interf_flat,eif_campo200,dde_identif_dati_ext_id,eif_campo1,eif_campo198). Probably this is a new index. It's unlikely you would want to replace ETL_EIF_FK_IDX4 with this (especially if that really is an index on a foreign key constraint)..
Of course these are just guesses on my part. Tuning is a science and to do it properly requires lots of data. Use the Wait Interface to investigate where the database is spending its time. Use the 10053 event to understand why the Optimizer makes the choices it does. But above all, don't implement partitioning unless you really know the ramifications.

The simple answer seems to be your explain plan. You're accessing both tables by index rowid. Whilst to select a single row you cannot - to my knowledge - get faster, in your case you're selecting a lot more than a single row.
This means that for every single row you, you're going into both tables one row at a time, which when you're looking a significant proportion of a table or index is not what you want to do.
My suggestion would be to force a full scan of one or both of your tables. Try to use the smaller as a driver first:
SELECT /*+ full(c) */ c.flh_id_messaggio
, f.eif_campo1
, c.f
FROM flows c,
JOIN flet f
ON f.field19 = c.flh_id_messaggio
AND f.extra_id = c.extra_id
AND f.field1 <> c.f
WHERE ...
But you may have to change /*+ full(c) */ to /*+ full(c) full(f) */.
Your indexes seem to be separate column indexes as well. For this, and if possible, I would have indexes on:
flows of id_messaggio, extra_id, f
and on flet of field19, extra_id, field1.
This will only really matter if you do not use as full scan. Or, if you have everything you're returning and selecting is in one index.

Decision when to create Index on table column in database?

I am not db guy. But I need to create tables and do CRUD operations on them. I get confused should I create the index on all columns by default
or not? Here is my understanding which I consider while creating index.
Index basically contains the memory location range ( starting memory location where first value is stored to end memory location where last value is
stored). So when we insert any value in table index for column needs to be updated as it has got one more value but update of column
value wont have any impact on index value. Right? So bottom line is when my column is used in join between two tables we should consider
creating index on column used in join but all other columns can be skipped because if we create index on them it will involve extra cost of
updating index value when new value is inserted in column.Right?
Consider this scenario where table mytable contains two three columns i.e col1,col2,col3. Now we fire this query
select col1,col2 from mytable
Now there are two cases here. In first case we create the index on col1 and col2. In second case we don't create any index.** As per my understanding
case 1 will be faster than case2 because in case 1 we oracle can quickly find column memory location. So here I have not used any join columns but
still index is helping here. So should I consider creating index here or not?**
What if in the same scenario above if we fire
select * from mytable
instead of
select col1,col2 from mytable
Will index help here?

Don't create Indexes in every column! It will slow things down on insert/delete/update operations.
As a simple reminder, you can create an index in columns that are common in WHERE, ORDER BY and GROUP BY clauses. You may consider adding an index in colums that are used to relate other tables (through a JOIN, for example)
Example:
SELECT col1,col2,col3 FROM my_table WHERE col2=1
Here, creating an index on col2 would help this query a lot.
Also, consider index selectivity. Simply put, create index on values that has a "big domain", i.e. Ids, names, etc. Don't create them on Male/Female columns.

but update of column value wont have any impact on index value. Right?
No. Updating an indexed column will have an impact. The Oracle 11g performance manual states that:
UPDATE statements that modify indexed columns and INSERT and DELETE
statements that modify indexed tables take longer than if there were
no index. Such SQL statements must modify data in indexes and data in
tables. They also create additional undo and redo.
So bottom line is when my column is used in join between two tables we should consider creating index on column used in join but all other columns can be skipped because if we create index on them it will involve extra cost of updating index value when new value is inserted in column. Right?
Not just Inserts but any other Data Manipulation Language statement.
Consider this scenario . . . Will index help here?
With regards to this last paragraph, why not build some test cases with representative data volumes so that you prove or disprove your assumptions about which columns you should index?

In the specific scenario you give, there is no WHERE clause, so a table scan is going to be used or the index scan will be used, but you're only dropping one column, so the performance might not be that different. In the second scenario, the index shouldn't be used, since it isn't covering and there is no WHERE clause. If there were a WHERE clause, the index could allow the filtering to reduce the number of rows which need to be looked up to get the missing column.
Oracle has a number of different tables, including heap or index organized tables.
If an index is covering, it is more likely to be used, especially when selective. But note that an index organized table is not better than a covering index on a heap when there are constraints in the WHERE clause and far fewer columns in the covering index than in the base table.
Creating indexes with more columns than are actually used only helps if they are more likely to make the index covering, but adding all the columns would be similar to an index organized table. Note that Oracle does not have the equivalent of SQL Server's INCLUDE (COLUMN) which can be used to make indexes more covering (it's effectively making an additional clustered index of only a subset of the columns - useful if you want an index to be unique but also add some data which you don't want to be considered in the uniqueness but helps to make it covering for more queries)
You need to look at your plans and then determine if indexes will help things. And then look at the plans afterwards to see if they made a difference.

Unique index on two columns plus separate index on each one?

I don't know much about database optimization, but I'm trying to understand this case.
Say I have the following table:
cities
===========
state_id integer
name varchar(32)
slug varchar(32)
Now, say I want to perform queries like this:
SELECT * FROM cities WHERE state_id = 123 AND slug = 'some_city'
SELECT * FROM cities WHERE state_id = 123
If I want the "slug" for a city to be unique within its particular state, I'd add a unique index on state_id and slug.
Is that index enough? Or should I also add another on state_id so the second query is optimized? Or does the second query automatically use the unique index?
I'm working on PostgreSQL, but I feel this case is so simple that most DBMS work similarly.
Also, I know this surely doesn't make a difference on small tables, but my example is a simple one. Think of 200k+ rows tables.
Thanks!

A single unique index on (state_id, slug) should be sufficient. To be sure, of course, you'll need to run EXPLAIN and/or ANALYZE (perhaps with the help of something like http://explain.depesz.com/), but ultimately what indexes are appropriate depends very closely on what kind of queries you will be running. Remember, indexes make SELECTs faster and INSERTs, UPDATEs, and DELETEs slower, so you ideally want only as many indexes as are actually necessary.
Also, PostgreSQL has a smart query optimizer: it will use radically different search plans for queries on small tables and huge tables. If the table is small, it will just do a sequential scan and not even bother with any indexes, since the overhead of working with them is higher than just brute-force sifting through the table. This changes to a different plan once the table size passes a threshold, and may change again if the table gets larger again, or if you change your SELECT, or....
Summary: you can't trust the results of EXPLAIN and ANALYZE on datasets much smaller or different than your actual data. Make it work, then make it fast later (if you need to).

[EDIT: Misread the question... Hopefully my answer is more relevant now!]
In your case, I'd suggest 1 index on (state_id, slug). If you ever need to search just by slug, add an index on just that column. If you have those, then adding another index on state_id is unnecessary as the first index already covers it.
An index can be used whenever an initial segment of its columns are used in a WHERE clause. So e.g. an index on columns A, B and C will optimise queries containing WHERE clauses involving A, B and C, WHERE clauses with just A and B, or WHERE clauses with just A. Note that the order that columns appear in the index definition is very important -- this example index cannot be used for WHERE clauses involving just B and/or C.
(Of course it's up to the query optimiser whether or not a particular index actually gets used, but in your case with 200k rows, you can guarantee that a simple search by state_id or slug or both will use one of the indices.)

Any decent optimizer will see an index on three columns - say:
CREATE INDEX idx_1 ON SomeTable(Col1, Col2, Col3);
and will use that index for any of the following conditions:
WHERE Col1 = ...something...
WHERE Col1 = ...something... AND Col2 = ...otherthing...
WHERE Col3 = ....whatnot....
AND Col1 = ...something....
AND Col2 = ...otherthing...
That is, it will use the index if there are conditions applied to any contiguous leading subset of the columns of the index. Although I used equality, it can also apply to ranges (open - just greater than, for example) or closed (between two values).

To do optimization use EXPLAIN http://www.postgresql.org/docs/7.4/static/sql-explain.html and see for your self.
But optimization is not the most important reason to make those indexes; first it is a constraint inhibiting a database from not being logical.

What is a Covered Index?

I've just heard the term covered index in some database discussion - what does it mean?

A covering index is an index that contains all of, and possibly more, the columns you need for your query.
For instance, this:
SELECT *
FROM tablename
WHERE criteria
will typically use indexes to speed up the resolution of which rows to retrieve using criteria, but then it will go to the full table to retrieve the rows.
However, if the index contained the columns column1, column2 and column3, then this sql:
SELECT column1, column2
FROM tablename
WHERE criteria
and, provided that particular index could be used to speed up the resolution of which rows to retrieve, the index already contains the values of the columns you're interested in, so it won't have to go to the table to retrieve the rows, but can produce the results directly from the index.
This can also be used if you see that a typical query uses 1-2 columns to resolve which rows, and then typically adds another 1-2 columns, it could be beneficial to append those extra columns (if they're the same all over) to the index, so that the query processor can get everything from the index itself.
Here's an article: Index Covering Boosts SQL Server Query Performance on the subject.

Covering index is just an ordinary index. It's called "covering" if it can satisfy query without necessity to analyze data.
example:
CREATE TABLE MyTable
(
ID INT IDENTITY PRIMARY KEY,
Foo INT
)
CREATE NONCLUSTERED INDEX index1 ON MyTable(ID, Foo)
SELECT ID, Foo FROM MyTable -- All requested data are covered by index
This is one of the fastest methods to retrieve data from SQL server.

Covering indexes are indexes which "cover" all columns needed from a specific table, removing the need to access the physical table at all for a given query/ operation.
Since the index contains the desired columns (or a superset of them), table access can be replaced with an index lookup or scan -- which is generally much faster.
Columns to cover:
parameterized or static conditions; columns restricted by a parameterized or constant condition.
join columns; columns dynamically used for joining
selected columns; to answer selected values.
While covering indexes can often provide good benefit for retrieval, they do add somewhat to insert/ update overhead; due to the need to write extra or larger index rows on every update.
Covering indexes for Joined Queries
Covering indexes are probably most valuable as a performance technique for joined queries. This is because joined queries are more costly & more likely then single-table retrievals to suffer high cost performance problems.
in a joined query, covering indexes should be considered per-table.
each 'covering index' removes a physical table access from the plan & replaces it with index-only access.
investigate the plan costs & experiment with which tables are most worthwhile to replace by a covering index.
by this means, the multiplicative cost of large join plans can be significantly reduced.
For example:
select oi.title, c.name, c.address
from porderitem poi
join porder po on po.id = poi.fk_order
join customer c on c.id = po.fk_customer
where po.orderdate > ? and po.status = 'SHIPPING';
create index porder_custitem on porder (orderdate, id, status, fk_customer);
See:
http://literatejava.com/sql/covering-indexes-query-optimization/

Lets say you have a simple table with the below columns, you have only indexed Id here:
Id (Int), Telephone_Number (Int), Name (VARCHAR), Address (VARCHAR)
Imagine you have to run the below query and check whether its using index, and whether performing efficiently without I/O calls or not. Remember, you have only created an index on Id.
SELECT Id FROM mytable WHERE Telephone_Number = '55442233';
When you check for performance on this query you will be dissappointed, since Telephone_Number is not indexed this needs to fetch rows from table using I/O calls. So, this is not a covering indexed since there is some column in query which is not indexed, which leads to frequent I/O calls.
To make it a covered index you need to create a composite index on (Id, Telephone_Number).
For more details, please refer to this blog:
https://www.percona.com/blog/2006/11/23/covering-index-and-prefix-indexes/

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas