Role of null in performance tuning - sql

This query is regarding performance tuning of a query.
I have a table TEST1, which has 200,000 rows.
the table structure is as below.
ACCOUNT_NUMBER VARCHAR2(16)
BRANCH VARCHAR2(10)
ACCT_NAME VARCHAR2(100)
BALANCE NUMBER(20,5)
BANK_ID VARCHAR2(10)
SCHM_CODE VARCHAR2(10)
CUST_ID VARCHAR2(10)
And the indexes are as below.
fields Index Name Uniquness
ACCOUNT_NUMBER IDX_TEST_ACCT UNIQUE
SCHM_CODE,BRANCH IDX_TEST_SCHM_BR NONUNIQUE
Also I have one more table STATUS,
ACCOUNT_NUMBER VARCHAR2(16)
STATUS VARCHAR2(2)
ACCOUNT_NUMBER IDX_STATUS_ACCT UNIQUE
When I write a query joining to table tables like below, execution too much time and is costly query.
SELECT ACCOUNT_NUMBER,STATUS
FROM TEST,STATUS
where TEST.ACCOUNT_NUMBER = STATUS.ACCOUNT_NUMBER
AND TEST.BRANCH = '1000';
There is query return by product team to fetch the same details has ||null
in where condition, the query is returning the same results but the
performance is very good compared to my query.
SELECT ACCOUNT_NUMBER,STATUS
FROM TEST,STATUS
where TEST.ACCOUNT_NUMBER = STATUS.ACCOUNT_NUMBER
AND TEST.BRANCH||NULL = '1000';
Can anyone explain me how ||null in the where condition made that difference.
I am writing this because, I want to know how it made the difference and want to use wherever it is possible.

If you turn on autotrace and get the execution plans of both queries, I would guess that your query is trying to use the index IDX_TEST_SCHM_BR and the other query cannot use the index due to the clause TEST.BRANCH||NULL and cannot use the index since that clause prevents the optimizer from using the index.
Normally, the use of a function on a table column prevents Oracle from using the index, and in your case appending a null to a table column with the || operator is like invoking the function concat(TEST.BRANCH||NULL). To make your query run faster, you can
Add a hint to ignore the index SELECT /*+ NOINDEX(TEST1 IDX_TEST_SCHM_BR */ ACCOUNT_NUMBER, ... (Not recommended)
Create a new index with BRANCH as the only column (Recommended)
As #symcbean noted, if an index is not very selective, (IE: the query returns a lot of rows in the table), then a full table scan would probably be faster. In this case, since the BRANCH column is not the first column in the index, Oracle has to skip through the index to find the rows that match the join criteria. A general rule of thumb is if the query is returning more than around 20% of the rows, a full table scan is quicker. In this case due to the index definition, Oracle has to read through several index entries, skipping along until it finds the next new BRANCH value, so in this case much less than 5% probably
Also ensure your tables have current statistics gathered, and if any of your columns are not null, you should specify that in the table definition to help Oracle optimizer avoid issues like you are having.

Related

Performance impact of view on aggregate function vs result set limiting

The problem
Using PostgreSQL 13, I ran into a performance issue selecting the highest id from a view that joins two tables, depending on the select statement I execute.
Here's a sample setup:
CREATE TABLE test1 (
id BIGSERIAL PRIMARY KEY,
joincol VARCHAR
);
CREATE TABLE test2 (
joincol VARCHAR
);
CREATE INDEX ON test1 (id);
CREATE INDEX ON test1 (joincol);
CREATE INDEX ON test2 (joincol);
CREATE VIEW testview AS (
SELECT test1.id,
test1.joincol AS t1charcol,
test2.joincol AS t2charcol
FROM test1, test2
WHERE test1.joincol = test2.joincol
);
What I found out
I'm executing two statements which result in completely different execution plans and runtimes. The following statement executes in less than 100ms. As far as I understand the execution plan, the runtime is independent of the rowcount, since Postgres iterates the rows one by one (starting at the highest id, using the index) until a join on a row is possible and immediately returns.
SELECT id FROM testview ORDER BY ID DESC LIMIT 1;
However, this one takes over 1 second on average (depending on rowcount), since the two tables are "joined completely", before Postgres uses the index to select the highest id.
SELECT MAX(id) FROM testview;
Please refer to this sample on dbfiddle to check the explain plans:
https://www.db-fiddle.com/f/bkMNeY6zXqBAYUsprJ5eWZ/1
My real environment
On my real environment test1 contains only a hand full of rows (< 100), having unique values in joincol. test2 contains up to ~10M rows, where joincol always matches a value of test1's joincol. test2's joincol is not nullable.
The actual question
Why does Postgres not recognize that it could use an Index Scan Backward on row basis for the second select? Is there anything I could improve on the tables/indexes?
Queries not strictly equivalent
why does Postgres not recognize that it could use a Index Scan Backward on row basis for the second select?
To make the context clear:
max(id) excludes NULL values. But ORDER BY ... LIMIT 1 does not.
NULL values sort last in ascending sort order, and first in descending. So an Index Scan Backward might not find the greatest value (according to max()) first, but any number of NULL values.
The formal equivalent of:
SELECT max(id) FROM testview;
is not:
SELECT id FROM testview ORDER BY id DESC LIMIT 1;
but:
SELECT id FROM testview ORDER BY id DESC NULLS LAST LIMIT 1;
The latter query doesn't get the fast query plan. But it would with an index with matching sort order: (id DESC NULLS LAST).
That's different for the aggregate functions min() and max(). Those get a fast plan when targeting table test1 directly using the plain PK index on (id). But not when based on the view (or the underlying join-query directly - the view is not the blocker). An index sorting NULL values in the right place has hardly any effect.
We know that id in this query can never be NULL. The column is defined NOT NULL. And the join in the view is effectively an INNER JOIN which cannot introduce NULL values for id.
We also know that the index on test.id cannot contain NULL values.
But the Postgres query planner is not an AI. (Nor does it try to be, that could get out of hands quickly.) I see two shortcomings:
min() and max() get the fast plan only when targeting the table, regardless of index sort order, an index condition is added: Index Cond: (id IS NOT NULL)
ORDER BY ... LIMIT 1 gets the fast plan only with the exactly matching index sort order.
Not sure, whether that might be improved (easily).
db<>fiddle here - demonstrating all of the above
Indexes
Is there anything I could improve on the tables/indexes?
This index is completely useless:
CREATE INDEX ON "test" ("id");
The PK on test.id is implemented with a unique index on the column, that already covers everything the additional index might do for you.
There may be more, waiting for the question to clear up.
Distorted test case
The test case is too far away from actual use case to be meaningful.
In the test setup, each table has 100k rows, there is no guarantee that every value in joincol has a match on the other side, and both columns can be NULL
Your real case has 10M rows in table1 and < 100 rows in table2, every value in table1.joincol has a match in table2.joincol, both are defined NOT NULL, and table2.joincol is unique. A classical one-to-many relationship. There should be a UNIQUE constraint on table2.joincol and a FK constraint t1.joincol --> t2.joincol.
But that's currently all twisted in the question. Standing by till that's cleaned up.
This is a very good problem, and good testcase.
I tested it in postgres 9.3 perhaps 13 is can it more more fast.
I used Occam's Razor and i excluded some possiblities
View (without view is slow to)
JOIN can filter some rows (unfortunatly in your test not, but more length md5 5-6 yes)
Other basic equivalent select statements not solve yout problem (inner query or exists)
I achieved to use just index, but because the tables isn't bigger than indexes it was not the solution.
I think
CREATE INDEX on "test" ("id");
is useless, because PK!
If you change this
CREATE INDEX on "test" ("joincol");
to this
CREATE INDEX ON TEST (joincol, id);
Than the second query use just indexes.
After you run this
REINDEX table test;
REINDEX table test2;
VACUUM ANALYZE test;
VACUUM ANALYZE test2;
you can achive some performance tuning. Because you created indexes before inserts.
I think the reason is the two aim of DB.
First aim optimalize just some row. So run Nested Loop. You can force it with limit x.
Second aim optimalize whole table. Run this query fast for whole table.
In this situation postgres optimalizer didn't notice that simple MAX can run with NESTED LOOP. Or perhaps postgres cannot use limit in aggregate clause (can run on whole partial select, what is filtered with query).
And this is very expensive. But you have possiblities to write there other aggregates, like SUM, MIN, AVG stb.
Perhaps can help you the Window functions too.

Indexing not working with column using range operations in oracle

I have created index on timestamp column for my table, but when I am querying and checking the explain plan in oracle it is doing the full table scan rather that range scan
Below is the DDL script for the table
CREATE TABLE EVENT (
event_id VARCHAR2(100) NOT NULL,
status VARCHAR2(50) NOT NULL,
timestamp NUMBER NOT NULL,
action VARCHAR2(50) NOT NULL
);
ALTER TABLE EVENT ADD CONSTRAINT PK_EVENT PRIMARY KEY ( event_id ) ;
CREATE INDEX IX_EVENT$timestamp ON EVENT (timestamp);
Below is the explain plan query used to get the explain plan -
EXPLAIN PLAN SET STATEMENT_ID = 'test3' for select * from EVENT where timestamp between 1620741600000 and 1621900800000 and status = 'CANC';
SELECT * FROM PLAN_TABLE WHERE STATEMENT_ID = 'test3';
Here is the explain plan that oracle returned -
I am not sure why the index is not working here, rather it is still doing the full table scan even after creating the index on the timestamp column.
Can someone please help me with this.
Gordon is correct. You need this index to speed up the query you showed us.
CREATE INDEX IX_EVENT$timestamp ON EVENT (status, timestamp);
Why? Your query requires an equality match on status and then a range scan on timestamp. Without the possibility of using the index for the equality match, Oracle's optimizer seems to have decided it's cheaper to scan the table than the index.
Why did it decide that?
Who knows? Hundreds of programmers have been working on the optimizer for many decades.
Who cares? Just use the right index for the query.
The optimizer is cost based. So conceptually the optimizer will evaluate all the available plans, estimate the cost, and pick the one that has the lowest estimated cost. The costs are estimated based on statistics. The index statistics are automatically collected when an index is built. However your table statistics may not reflect real life.
An Active SQL Monitor report will help you diagnose the issue.

Oracle efficient way of updating non-indexed and non-partioned table?

Is there an efficient way to update rows of a table that has no indexes and no partitions (and ~50millions rows)?
I have a date field LOAD_DTTM and values of this field for rows that require update (around 2000 distinct dates).
WIll update be faster if i specify a date in a WHERE clause along with the UNIQUE_ID of a row?
If you want to update all, or a large number, of the rows then the quickest way is:
create table my_table_copy as
select ... -- all the columns, updating values as required
from my_table;
drop table my_table;
rename my_table_copy to my_table;
If your table had any indexes, constraints or triggers you would now need to re-add them - but it seems you don't have that issue!
You could create a PL/SQL procedure that loops and update and commit the table every n row count -- Say every 20.000 rows. I do not advise to update the full table as it will create a lock for a looong time and expose you to data loss in case of external factors.
The answer is NO.
Even if you specify both conditions in your WHERE clause as you stated, it won't help you to avoid a full scan of your table.
Even if one of your criteria will uniquely identify the row, it still won't help.
There is a real example tested on Oracle 12C ver.2 similar to your case. No indexes, no partitions, nothing. Just plain table with 4 columns
I have a table with 18mn records.
I also have CUSTOMER_ID which is a UNIQUE identifier for a row.
I also have ORDER_DATE column there.
Even if I do the query that you mentioned
update hit set status = 1 where customer_id = 408518625844 and order_date = '09-DEC-19';
it won't help me to avoid a full table scan. See below Execution Plan. Therefore under conditions, you've specified, you will be always getting the slowest execution time possible. Full Table Scan on 50mn rows is actually the worst-case scenario.
And pay attention to that Cost, it is 26539 on 18mn rows.
So if you have 50mn rows we can easily expect much more Cost for your query

Getting RID Lookup instead of Table Scan?

SQL Fiddle: http://sqlfiddle.com/#!3/23cf8
In this query, when I have an In clause on an Id, and then also select other columns, the In is evaluated first, and then the Details column and other columns are pulled in via a RID Lookup:
--In production and in SQL Fiddle, Details is grabbed via a RID Lookup after the In clause is evaluated
SELECT [Id]
,[ForeignId]
,Details
--Generate a numbering(starting at 1)
--,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where foreignId In (1,2,3,5)
With this query, the Details are being pulled in via a Table Scan.
With NumberedContacts AS
(
SELECT [Id]
,[ForeignId]
--Generate a numbering(starting at 1)
,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where ForeignId In (1,2,3,5)
)
Select nc.[Id]
,nc.[ForeignId]
,sc.[Details]
From NumberedContacts nc
Inner Join SupportContacts sc on nc.Id = sc.Id
Where nc.ContactNumber <= 2 --Only grab the last 2 contacts per ForeignId
;
In SqlFiddle, the second query actually gets a RID Lookup, whereas in production with a million records it produces a Table Scan (the IN clause eliminates 99% of the rows)
Otherwise the query plan shown in SQL Fiddle is identical, the only difference being that for the second query the RID Lookup in SQL Fiddle, is a Table Scan in production :(
I would like to understand possibilities that would cause this behavior? What kinds of things would you look at to help determine the cause of it using a table scan here?
How can I influence it to use a RID Lookup there?
From looking at operation costs in the actual execution plan, I believe I can get the second query very close in performance to the first query if I can get it to use a RID Lookup. If I don't select the Detail column, then the performance of both queries is very close in production. It is only after adding other columns like Detail that performance degrades significantly for the second query. When I put it in SQL Fiddle and saw that the execution plan used an RID Lookup, I was surprised but slightly confused...
It doesn't have a clustered index because in testing with different clustered indexes, there was slightly worse performance for this and other queries. That was before I began adding other columns like Details though, and I can experiment with that more, but would like to have a understanding of what is going on now before I start shooting in the dark with random indexes.
What if you would change your main index to include the Details column?
If you use:
CREATE NONCLUSTERED INDEX [IX_SupportContacts_ForeignIdAsc_IdDesc]
ON SupportContacts ([ForeignId] ASC, [Id] DESC)
INCLUDE (Details);
then neither a RID lookup nor a table scan would be needed, since your query could be satisfied from just the index itself....
The differences in the query plans will be dependent on the types of indexes that exist and the statistics of the data for those tables in the different environments.
The optimiser uses the statistics (histograms of data frequency, mostly) and the available indexes to decide which execution plan is going to be the quickest.
So, for example, you have noticed that the performance degrades when the 'Details' column is included. This is an almost sure sign that either the 'Details' column is not part of an index, or if it is part of an index, the data in that column is mostly unique such that the index accesses would be equivalent (or almost equivalent) to a table scan.
Often when this situation arises, the optimiser will choose a table scan over the index access, as it can take advantage of things like block reads to access the table records faster than perhaps a fragmented read of an index.
To influence the path that will be chose by the optimiser, you would need to look at possible indexes that could be added/modified to make an index access more efficient, but this should be done with care as it can adversely affect other queries as well as possibly degrading insert performance.
The other important activity you can do to help the optimiser is to make sure the table statistics are kept up to date and refreshed at a frequency that is appropriate to the rate of change of the frequency distribution in the table data
If it's true that 99% of the rows would be omitted if it performed the query using the relevant index + RID then the likeliest problem in your production environment is that your statistics are out of date and the optimiser doesn't realise that ForeignID in (1,2,3,5) would limit the result set to 1% of the total data.
Here's a good link for discovering more about statistics from Pinal Dave: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
As for forcing the optimiser to follow the correct path WITHOUT updating the statistics, you could use a table hint - if you know the index that your plan should be using which contains the ID and ForeignID columns then stick that in your query as a hint and force SQL optimiser to use the index:
http://msdn.microsoft.com/en-us/library/ms187373.aspx
FYI, if you want the best performance from your second query, use this index and avoid the headache you're experiencing altogether:
create index ix1 on SupportContacts(ForeignID, Id DESC) include (Details);

Index for nullable column

I have an index on a nullable column and I want to select all it's values like this:
SELECT e.ename
FROM emp e;
In the explain plan I see a FULL TABLE SCAN (even a hint didn't help)
SELECT e.ename
FROM emp e
WHERE e.ename = 'gdoron';
Does use the index...
I googled and found out there are no null entries in indexes, thus the first query can't use the index.
My question is simple: why there aren't null entries in indexes?
By default, relational databases ignore NULL values (because the relational model says that NULL means "not present"). So, Index does not store NULL value, consequently if you have null condition in SQL statement, related index is ignored (by default).
But you can suprass this problem, check THIS or THIS article.
If you're getting all of the rows from the table, why do you think it should use the index? A full table scan is the most efficient means to return all of the values. It has nothing to do with the nulls not being in the index and everything to do with the optimizer choosing the most efficient means of retrieving the data.
#A.B.Cade: It's possible that the optimizer could choose to use the index, but not likely. Let's say you've got a table with an indexed table with 100 rows, but only 10 values. If the optimizer uses the index, it has to get the 10 rows from the index, then expand it to 100 rows, whereas, with the full-table scan, it gets all 100 rows from the get-go. Here's an example:
create table test1 (blarg varchar2(10));
create index ak_test1 on test1 (blarg);
insert into test1
select floor(level/10) from dual connect by level<=100;
exec dbms_stats.gather_table_stats('testschema','test1');
exec dbms_stats.gather_index_stats('testschema','ak_test1');
EXPLAIN PLAN FOR
select * from test1;
My point is largely that this question is based largely on a flawed premise: that index-scans are intrinsically better that full-table scans. That is not always true, as this scenario demonstrates.
Should be noted, Bitmap-Indexes include rows that have NULL values.
But you should not create Bitmap-Index just because you like to have NULL values in
your index. Bitmap-Index indexes are intended for their use-case (see documentation)!
If you use them wrong, then your over-all performance may suffer significantly.
I am not sure the first query is pertinent in terms of index usage, at least the second could.
Anyway, while it is true that you cannot index a column containing a null value, there are ways to do it like for example:
create index MY_INDEX on emp(ename, 1);
notice the , 1) at the end which does the trick.