Optimize Oracle Between Date Statement - sql

I got an oracle SQL query that selects entries of the current day like so:
SELECT [fields]
FROM MY_TABLE T
WHERE T.EVT_END BETWEEN TRUNC(SYSDATE)
AND TRUNC(SYSDATE) + 86399/86400
AND T.TYPE = 123
Whereas the EVT_END field is of type DATE and T.TYPE is a NUMBER(15,0).
Im sure with increasing size of the table data (and ongoing time), the date constraint will decrease the result set by a much larger factor than the type constraint. (Since there are a very limited number of types)
So the basic question arising is, what's the best index to choose to make the selection on the current date faster. I especially wonder what the advantages and disadvantages of a functional index on TRUNC(T.EVT_END) to a normal index on T.EVT_END would be. When using a functional index the query would look something like that:
SELECT [fields]
FROM MY_TABLE T
WHERE TRUNC(T.EVT_END) = TRUNC(SYSDATE)
AND T.TYPE = 123
Because other queries use the mentioned date constraints without the additional type selection (or maybe with some other fields), multicolumn indexes wouldn't help me a lot.
Thanks, I'd appreciate your hints.

Your index should be TYPE, EVT_END.
CREATE INDEX PIndex
ON MY_TABLE (TYPE, EVT_END)
The optimizer plan will first go through this index to find the TYPE=123 section. Then under TYPE=123, it will have the EVT_END timestamps sorted, so it can search the b-tree for the first date in the range, and go through the dates sequentially until a data is out of the range.

Based on the query above the functional index will provide no value. For a functional index to be used the predicate in the query would need to be written as follows:
SELECT [fields]
FROM MY_TABLE T
WHERE TRUNC(T.EVT_END) BETWEEN TRUNC(SYSDATE) AND TRUNC(SYSDATE) + 86399/86400
AND T.TYPE = 123
The functional index on the column EVT_END, is being ignored. It would be better to have a normal index on the EVT_END date. For a functional index to be used the left hand of the condition must match the declaration of the functional index. I would probably write the query as:
SELECT [fields]
FROM MY_TABLE T
WHERE T.EVT_END BETWEEN TRUNC(SYSDATE) AND TRUNC(SYSDATE+1)
AND T.TYPE = 123
And I would create the following index:
CREATE INDEX bla on MY_TABLE( EVT_END )
This is assuming you are trying to find the events that ended within a day.

Results
If your index is cached, a function-based index performs best. If your index is not cached, a compressed function-based index performs best.
Below are the relative times generated by my test code. Lower is better. You cannot compare the numbers between cached and non-cached, they are totally different tests.
In cache Not in cache
Regular 120 139
FBI 100 138
Compressed FBI 126 100
I'm not sure why the FBI performs better than the regular index. (Although it's probably related to what you said about equality predicates versus range. You can see that the regular index has an extra "FILTER" step in its explain plan.) The compressed FBI has some additional overhead to uncompress the blocks. This small amount of extra CPU time is relevant when everything is already in memory, and CPU waits are most important. But when nothing is cached, and IO is more important, the reduced space of the compressed FBI helps a lot.
Assumptions
There seems to be a lot of confusion about this question. The way I read it, you only care about this one specific query, and you want to know whether a function-based index or a regular index will be faster.
I assume you do not care about other queries that may benefit from this index, additional time spent to maintain the index, if the developers remember to use it, or whether or not the optimizer chooses the index. (If the optimizer doesn't choose the index, which I think is unlikely, you can add a hint.) Let me know if any of these assumptions are wrong.
Code
--Create tables. 1 = regular, 2 = FBI, 3 = Compressed FBI
create table my_table1(evt_end date, type number) nologging;
create table my_table2(evt_end date, type number) nologging;
create table my_table3(evt_end date, type number) nologging;
--Create 1K days, each with 100K values
begin
for i in 1 .. 1000 loop
insert /*+ append */ into my_table1
select sysdate + i - 500 + (level * interval '1' second), 1
from dual connect by level <= 100000;
commit;
end loop;
end;
/
insert /*+ append */ into my_table2 select * from my_table1;
insert /*+ append */ into my_table3 select * from my_table1;
--Create indexes
create index my_table1_idx on my_table1(evt_end);
create index my_table2_idx on my_table2(trunc(evt_end));
create index my_table3_idx on my_table3(trunc(evt_end)) compress;
--Gather statistics
begin
dbms_stats.gather_table_stats(user, 'MY_TABLE1');
dbms_stats.gather_table_stats(user, 'MY_TABLE2');
dbms_stats.gather_table_stats(user, 'MY_TABLE3');
end;
/
--Get the segment size.
--This shows the main advantage of a compressed FBI, the lower space.
select segment_name, bytes/1024/1024/1024 GB
from dba_segments
where segment_name like 'MY_TABLE__IDX'
order by segment_name;
SEGMENT_NAME GB
MY_TABLE1_IDX 2.0595703125
MY_TABLE2_IDX 2.0478515625
MY_TABLE3_IDX 1.1923828125
--Test block.
--Uncomment different lines to generate 6 different test cases.
--Regular, Function-based, and Function-based compressed. Both cached and not-cached.
declare
v_count number;
v_start_time number;
v_total_time number := 0;
begin
--Uncomment two lines to test the server when it's "cold", and nothing is cached.
for i in 1 .. 10 loop
execute immediate 'alter system flush buffer_cache';
--Uncomment one line to test the server when it's "hot", and everything is cached.
--for i in 1 .. 1000 loop
v_start_time := dbms_utility.get_time;
SELECT COUNT(*)
INTO V_COUNT
--#1: Regular
FROM MY_TABLE1 T
WHERE T.EVT_END BETWEEN TRUNC(SYSDATE) AND TRUNC(SYSDATE) + 86399/86400;
--#2: Function-based
--FROM MY_TABLE2 T
--WHERE TRUNC(T.EVT_END) = TRUNC(SYSDATE);
--#3: Compressed function-based
--FROM MY_TABLE3 T
--WHERE TRUNC(T.EVT_END) = TRUNC(SYSDATE);
v_total_time := v_total_time + (dbms_utility.get_time - v_start_time);
end loop;
dbms_output.put_line('Seconds: '||v_total_time/100);
end;
/
Test Methodology
I ran each block at least 5 times, alternated between run types (in case something was running on my machine only part of the time), threw out the high and the low run times, and averaged them. The code above does not include all that logic, since it would take up 90% of this answer.
Other Things to Consider
There are still many other things to consider. My code assumes the data is inserted in a very index-friendly order. Things will be totally different if this is not true, as compression may not help at all.
Probably the best solution to this problem is to avoid it completely with partitioning. For reading the same amount of data, a full table scan is much faster than an index read because it uses multi-block IO. But there are some downsides to partitioning, like the large amount of money
required to buy the option, and extra maintenance tasks. For example, creating partitions ahead of time, or using interval partitioning (which has some other weird issues), gathering stats, deferred segment creation, etc.
Ultimately, you will need to test this yourself. But remember that testing even such a simple choice is difficult. You need realistic data, realistic tests, and a realistic environment. Realistic data is much harder than it sounds. With indexes, you cannot simply copy the data and build the indexes at once. create table my_table1 as select * from and create index ... will create a different index than if you create the table and perform a bunch of inserts and deletes in a specific order.

#S1lence:
I believe there would be a considerable time of thought behind this question being asked by you. And, I took a lot of time to post my answer here, as I don't like posting any guesses for answers.
I would like to share my websearch experience on this choice of normal Index on a date column against FBIs.
Based on my understanding on the link below, if you are about to use TRUNC function for sure, then you can strike out the option of normal index, as this consulting web space says that:
Even though the column may have an index, the trunc built-in function will invalidate the index, causing sub-optimal execution with unnecessary I/O.
I suppose that clears all. You've to go with FBI if you gonna use TRUNC for sure. Please let me know if my reply makes sense.
Oracle SQL Tuning with function-based indexes
Cheers,
Lakshmanan C.

The deicion over whether or not to use a function-based index should be driven by how you plan to write your queries. If all your queries against the date column will be in the form TRUNC(EVT_END), then you should use the FBI. However, in general it will be better to create an index on just EVT_END for the following reasons:
It will be more reusable. If you ever have queries checking particular times of the day then you can't use TRUNC.
There will be more distinct keys in the index using just the date. If you have 1,000 different times inserted during a day, EVT_END will will 1,000 distinct keys, whereas TRUNC(EVT_END) will only have 1 (this assumes that you're storing the time component and not just midnight for all the dates - in the second case both will have 1 distinct key for a day). This matters because the more distinct values an index has, the higher the selectivity of the index and the more likely it is to be used by the optimizer (see this)
The clustering factor is likely to be different, but in the case of using trunc it's more likely to go up, not down as stated in some other comments. This is because the clustering factor represents how closely the order of the values in the index match the physical storage of the data. If all your data is inserted in date order then a plain index will have the same order as the physical data. However, with TRUNC all times on a single day will map to the same value, so the order of rows in the index may be completely different to the physical data. Again, this means the trunc index is less likely to be used. This will entirely depend on your database's insertion/deletion patterns however.
Developers are more likely to write queries against where trunc isn't applied to the column (in my experience). Whether this holds true for you will depend upon your developers and the quality controls you have around deployed SQL.
Personally, I would go with Marlin's answer of TYPE, EVT_END as a first pass. You need to test this in your environment however and see how this affects this query and all others using the TYPE and EVT_END columns however.

Related

Oracle query: date filter gets really slow

I have this oracle query that takes around 1 minute to get the results:
SELECT TRUNC(sysdate - data_ricezione) AS delay
FROM notifiche#fe_engine2fe_gateway n
WHERE NVL(n.data_ricezione, TO_DATE('01011900', 'ddmmyyyy')) =
(SELECT NVL(MAX(n2.data_ricezione), TO_DATE('01011900', 'ddmmyyyy'))
FROM notifiche#fe_engine2fe_gateway n2
WHERE n.id_sdi = n2.id_sdi)
--AND sysdate-data_ricezione > 15
Basically i have this table named "notifiche", where each record represents a kind of update to another type of object (invoices). I want to know which invoice has not received any update in the last 15 days. I can do it by joining the notifiche n2 table, getting the most recent record for each invoice, and evaluate the difference between the update date (data_ricezione) and the current date (sysdate).
When i add the commented condition, the query takes then infinite time to complete (i mean hours, never saw the end of it...)
How is possibile that this simple condition make the query so slow?
How can I improve the performance?
Try to keep data_ricezione alone; if there's an index on it, it might help.
So: switch from
and sysdate - data_ricezione > 15
to
and -data_ricezione > 15 - sysdate / * (-1)
to
and data_ricezione < sysdate - 15
As everything is done over the database link, see whether the driving_site hint does any good, i.e.
select /*+ driving_site (n) */ --> "n" is table's alias
trunc(sysdate-data_ricezione) as delay
from
notifiche#fe_engine2fe_gateway n
...
Use an analytic function to avoid a self-join over a database link. The below query only reads from the table once, divides the rows into windows, finds theMAX value for each window, and lets you select rows based on that maximum. Analytic functions are tricky to understand at fist, but they often lead to code that is smaller and more efficient.
select id_sdi, data_ricezion
from
(
select id_sdi, data_ricezion, max(data_ricezion) over (partition by id_sdi) max_date
from notifiche#fe_engine2fe_gateway
)
where sysdate - max_date > 15;
As for why adding a simple condition can make the query slow - it's all about cardinality estimates. Cardinality, the number of rows, drives most of the database optimizer's decision. The best way to join a small amount of data may be very different than the best way to join a large amount of data. Oracle must always guess how many rows are returned by an operation, to know which algorithm to use.
Optimizer statistics (metadata about the tables, columns, and indexes) are what Oracle uses to make cardinality estimates. For example, to guess the number of rows filtered out by sysdate-data_ricezione > 15, the optimizer would want to know how many rows are in the table (DBA_TABLES.NUM_ROWS), what the maximum value for the column is (DBA_TAB_COLUMNS.HIGH_VALUE), and maybe a break down of how many rows are in different age ranges (DBA_TAB_HISTOGRAMS).
All of that information depends on optimizer statistics being correctly gathered. If a DBA foolishly disabled automatic optimizer statistics gathering, then these problems will happen all the time. But even if your system is using good settings, the predicate you're using may be an especially difficult case. Optimizer statistics aren't free to gather, so the system only collects them when 10% of the data changes. But since your predicate involves SYSDATE, the percentage of rows will change every day even if the table doesn't change. It may make sense to manually gather stats on this table more often than the default schedule, or use a /*+ dynamic_sampling */ hint, or create a SQL Profile/Plan Baseline, or one of the many ways to manage optimizer statistics and plan stability. But hopefully none of that will be necessary if you use an analytic function instead of a self-join.

What are the performance implications of Oracle IN Clause with no joins?

I have a query in this form that will on average take ~100 in clause elements, and at some rare times > 1000 elements. If greater than 1000 elements, we will chunk the in clause down to 1000 (an Oracle maximum).
The SQL is in the form of
SELECT * FROM tab WHERE PrimaryKeyID IN (1,2,3,4,5,...)
The tables I am selecting from are huge and will contain millions more rows than what is in my in clause. My concern is that the optimizer may elect to do a table scan (our database does not have up to date statistics - yeah - I know ...)
Is there a hint I can pass to force the use of the primary key - WITHOUT knowing the index name of the primary Key, perhaps something like ... /*+ DO_NOT_TABLE_SCAN */?
Are there any creative approaches to pulling back the data such that
We perform the least number of round-trips
We we read the least number of blocks (at the logical IO level?)
Will this be faster ..
SELECT * FROM tab WHERE PrimaryKeyID = 1
UNION
SELECT * FROM tab WHERE PrimaryKeyID = 2
UNION
SELECT * FROM tab WHERE PrimaryKeyID = 2
UNION ....
If the statistics on your table are accurate, it should be very unlikely that the optimizer would choose to do a table scan rather than using the primary key index when you only have 1000 hard-coded elements in the WHERE clause. The best approach would be to gather (or set) accurate statistics on your objects since that should cause good things to happen automatically rather than trying to do a lot of gymnastics in order to work around incorrect statistics.
If we assume that the statistics are inaccurate to the degree that the optimizer would be lead to believe that a table scan would be more efficient than using the primary key index, you could potentially add in a DYNAMIC_SAMPLING hint that would force the optimizer to gather more accurate statistics before optimizing the statement or a CARDINALITY hint to override the optimizer's default cardinality estimate. Neither of those would require knowing anything about the available indexes, it would just require knowing the table alias (or name if there is no alias). DYNAMIC_SAMPLING would be the safer, more robust approach but it would add time to the parsing step.
If you are building up a SQL statement with a variable number of hard-coded parameters in an IN clause, you're likely going to be creating performance problems for yourself by flooding your shared pool with non-sharable SQL and forcing the database to spend a lot of time hard parsing each variant separately. It would be much more efficient if you created a single sharable SQL statement that could be parsed once. Depending on where your IN clause values are coming from, that might look something like
SELECT *
FROM table_name
WHERE primary_key IN (SELECT primary_key
FROM global_temporary_table);
or
SELECT *
FROM table_name
WHERE primary_key IN (SELECT primary_key
FROM TABLE( nested_table ));
or
SELECT *
FROM table_name
WHERE primary_key IN (SELECT primary_key
FROM some_other_source);
If you got yourself down to a single sharable SQL statement, then in addition to avoiding the cost of constantly re-parsing the statement, you'd have a number of options for forcing a particular plan that don't involve modifying the SQL statement. Different versions of Oracle have different options for plan stability-- there are stored outlines, SQL plan management, and SQL profiles among other technologies depending on your release. You can use these to force particular plans for particular SQL statements. If you keep generating new SQL statements that have to be re-parsed, however, it becomes very difficult to use these technologies.

resource busy while rebuilding an index

There is a table T with column a:
CREATE TABLE T {
id_t integer not null,
text varchar2(100),
a integer
}
/
ALTER TABLE T ADD CONSTRAINT PK_T PRIMARY KEY (ID_T)
/
Index was created like this:
CREATE INDEX IDX_T$A ON T(a);
Also there's such a check constraint:
ALTER TABLE T ADD CONSTRAINT CHECK (a is null or a = 1);
Most of the records in T have null value of a, so the query using the index works really fast if the index is in consistent state and statistics for it is up to date.
But the problem is that values of a of some rows change really frequently (some rows get null value, some get 1), and I need to rebuild the index let's say every hour.
However, really often when the job doing this, trying to rebuild the index, it gets an exception:
ORA-00054: resource busy and acquire with NOWAIT specified
Can anybody help me with coping with this issue?
Index rebuild is not needed in most cases. Of course newly created indexes are efficient and their efficiency decreases over time. But this process stops after some time - it simply converges to some level.
If you really need to optimize indexes try to use less invasive DDL command "ALTER INDEX SHRINK SPACE COMPACT".
PS: I would also recommend you to use some smaller block size (4K or 8K) for you tablespace storage.
Have you tried adding "ONLINE" to that index rebuild statement?
Edit: If online rebuild is not available then you might look at a fast refresh on commit materialised view to store the rowid's or primary keys of rows that have a 1 for column A.
Start with a look at the documentation:-
http://docs.oracle.com/cd/B28359_01/server.111/b28326/repmview.htm
http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_6002.htm#SQLRF01302
You'd create a materialised view log on the table, and then a materialised view.
Think in particular about the resource requirements for this: changes to the master table require a change vector to be written to the materialised view log, which is effectively an additional insert for every change. Then the changes have to be propagated to another table (the materialised view storage table) with additional queries. It is by no means a low-impact option.
Rebuilding for Performance
Most Oracle experts are skeptical of frequently rebuilding indexes. For example, a quick glance at the presentation Rebuilding the Truth will show you that indexes do not behave in the naive way many people assume they do.
One of the relevant points in that presentation is "fully deleted blocks are recycled and are not generally problematic". If your values completely change, then your index should not grow infinitely large. Although your indexes are used in a non-typical way, that
behavior is probably a good thing.
Here's a quick example. Create 1 million rows and index 100 of them.
--Create table, constraints, and index.
CREATE TABLE T
(
id_t integer primary key,
text varchar2(100),
a integer check (a is null or a = 1)
);
CREATE INDEX IDX_T$A ON T(a);
--Insert 1M rows, with 100 "1"s.
insert into t
select level, level, case when mod(level, 10000) = 0 then 1 else null end
from dual connect by level <= 1000000;
commit;
--Initial sizes:
select segment_name, bytes/1024/1024 MB
from dba_segments
where segment_name in ('T', 'IDX_T$A');
SEGMENT_NAME MB
T 19
IDX_T$A 0.0625
Now completely shuffle the index rows around 1000 times.
--Move the 1s around 1000 times. Takes about 6 minutes.
begin
for i in 9000 .. 10000 loop
update t
set a = case when mod(id_t, i) = 0 then 1 else null end
--Don't update if the vlaue is the same
where nvl(a,-1) <> nvl(case when mod(id_t,i) = 0 then 1 else null end,-1);
commit;
end loop;
end;
/
The index segment size is still the same.
--The the index size is the same.
select segment_name, bytes/1024/1024 MB
from dba_segments
where segment_name in ('T', 'IDX_T$A');
SEGMENT_NAME MB
T 19
IDX_T$A 0.0625
Rebuilding for Statistics
It's good to worry about the statistics of objects whose data changes so dramatically. But again, although your system is unusual, it may work fine with the default Oracle behavior. Although the rows indexed may completely change, the relevant statistics may stay the same. If there are always 100 rows indexed, the number of rows, blocks, and distinctness will stay the same.
Perhaps the clustering factor will significantly change, if the 100 rows shift from being completely random to being very close to each other. But even that may not matter. If there are millions of rows, but only 100 indexed, the optimizer's decision will probably be the same regardless of the clustering factor. Reading 1 block (awesome clustering factor) or reading 100 blocks (worst-case clustering factor) will still look much better than doing a full table scan of millions of rows.
But statistics are complicated, I'm surely over-simplifying things. If you need to keep your statistics a specific way, you may want to lock them. Unfortunately you can't lock just an index, but you can lock the table and it's dependent indexes.
begin
dbms_stats.lock_table_stats(ownname => user, tabname => 'T');
end;
/
Rebuilding anyway
If a rebuild is still necessary, #Robe Eleckers idea to retry should work. Although instead of an exception, it would be easier to set DDL_LOCK_TIMEOUT.
alter session set ddl_lock_timeout = 500;
The session will still need to get an exclusive lock on the table, but this will make it much easier to find the right window of opportunity.
Since the field in question has very low cardinality I would suggest using a bitmap index and skipping the rebuilds altogether.
CREATE BITMAP INDEX IDX_T$A ON T(a);
Note (as mentioned in comments): transactional performance is very low for bitmap indexes so this would only work well if there are very few overlapping transactions doing updates to the table.

Fast way to discover the row count of a table in PostgreSQL

I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.
I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.
Is it possible to stop counting as soon as my constant value is surpassed?
I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.
Something like this:
SELECT text,count(*), percentual_calculus()
FROM token
GROUP BY text
ORDER BY count DESC;
Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.
(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)
Exact count
Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.
SELECT count(*) AS exact_count FROM myschema.mytable;
Estimate
Extremely fast:
SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';
Typically, the estimate is very close. How close, depends on whether ANALYZE or VACUUM are run enough - where "enough" is defined by the level of write activity to your table.
Safer estimate
The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:
SELECT c.reltuples::bigint AS estimate
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'mytable'
AND n.nspname = 'myschema';
The cast to bigint formats the real number nicely, especially for big counts.
Better estimate
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE oid = 'myschema.mytable'::regclass;
Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.
Replace 'myschema.mytable'::regclass with to_regclass('myschema.mytable') in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:
How to check if a table exists in a given schema
Better estimate yet (for very little added cost)
This does not work for partitioned tables because relpages is always -1 for the parent table (while reltuples contains an actual estimate covering all partitions) - tested in Postgres 14.
You have to add up estimates for all partitions instead.
We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:
These numbers are current as of the last VACUUM or ANALYZE on the
table. The planner then fetches the actual current number of pages in
the table (this is a cheap operation, not requiring a table scan). If
that is different from relpages then reltuples is scaled
accordingly to arrive at a current number-of-rows estimate.
Postgres uses estimate_rel_size defined in src/backend/utils/adt/plancat.c, which also covers the corner case of no data in pg_class because the relation was never vacuumed. We can do something similar in SQL:
Minimal form
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'mytable'::regclass; -- your table here
Safe and explicit
SELECT (CASE WHEN c.reltuples < 0 THEN NULL -- never vacuumed
WHEN c.relpages = 0 THEN float8 '0' -- empty table
ELSE c.reltuples / c.relpages END
* (pg_catalog.pg_relation_size(c.oid)
/ pg_catalog.current_setting('block_size')::int)
)::bigint
FROM pg_catalog.pg_class c
WHERE c.oid = 'myschema.mytable'::regclass; -- schema-qualified table here
Doesn't break with empty tables and tables that have never seen VACUUM or ANALYZE. The manual on pg_class:
If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.
If this query returns NULL, run ANALYZE or VACUUM for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)
If this query returns 0, the table seems to be empty. But I would ANALYZE to make sure. (And maybe check your autovacuum settings.)
Typically, block_size is 8192. current_setting('block_size')::int covers rare exceptions.
Table and schema qualifications make it immune to any search_path and scope.
Either way, the query consistently takes < 0.1 ms for me.
More Web resources:
The Postgres Wiki FAQ
The Postgres wiki pages for count estimates and count(*) performance
TABLESAMPLE SYSTEM (n) in Postgres 9.5+
SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);
Like #a_horse commented, the added clause for the SELECT command can be useful if statistics in pg_class are not current enough for some reason. For example:
No autovacuum running.
Immediately after a large INSERT / UPDATE / DELETE.
TEMPORARY tables (which are not covered by autovacuum).
This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:
Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
General rounding errors.
Typically, the estimate from pg_class will be faster and more accurate.
Answer to actual question
First, I need to know the number of rows in that table, if the total
count is greater than some predefined constant,
And whether it ...
... is possible at the moment the count pass my constant value, it will
stop the counting (and not wait to finish the counting to inform the
row count is greater).
Yes. You can use a subquery with LIMIT:
SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;
Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.
I did this once in a postgres app by running:
EXPLAIN SELECT * FROM foo;
Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:
Seq Scan on uids (cost=0.00..1.21 rows=8 width=75)
You can use the rows=(\d+) value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*) if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).
Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.
But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.
Reference taken from this Blog.
You can use below to query to find row count.
Using pg_class:
SELECT reltuples::bigint AS EstimatedCount
FROM pg_class
WHERE oid = 'public.TableName'::regclass;
Using pg_stat_user_tables:
SELECT
schemaname
,relname
,n_live_tup AS EstimatedCount
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
How wide is the text column?
With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).
I'd recommend:
If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.
Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column.
Again, this is to decrease the workload (scan through a narrow column index)
Edit:
Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.
You can also just SELECT MAX(id) FROM <table_name>; change id to whatever the PK of the table is
In Oracle, you could use rownum to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*) then:
SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)
For SQL Server (2005 or above) a quick and reliable method is:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')
AND (index_id=0 or index_id=1);
Details about sys.dm_db_partition_stats are explained in MSDN
The query adds rows from all parts of a (possibly) partitioned table.
index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)
Even faster (but unreliable) methods are detailed here.

Checking for the presence of text in a text column efficiently

I have a table with about 2,000,000 rows. I need to query one of the columns to retrieve the rows where a string exsists as part of the value.
When I run the query I will know the position of the string, but not before hand. So a view which takes a substring is not an option.
As far as I can see I have three options
using like ‘% %’
using instr
using substr
I do have the option of creating a function based index, if I am nice to the dba.
At the moment all queries are taking about two seconds. Does anyone have experience of which of these options will work best, or if there is another option? The select will be used for deletes every few seconds, it will typically select 10 rows.
edit with some more info
The problem comes about as we are using a table for storing objects with arbitrary keys and values. The objects come from outside our system so we have limited scope to control them so the text column is something like 'key1=abc,key2=def,keyn=ghi' I know this is horribly denormalised but as we don't know what the keys will be (to some extent) it is a reliable way to store and retrieve values. To retrieve a row is fairly fast as we are searching the whole of the column, which is indexed. But the performance is not good if we want to retrieve the rows with key2=def.
We may be able to create a table with columns for the most common keys, but I was wondering if there was a way to improve performance with the existing set up.
In Oracle 10:
CREATE TABLE test (tst_test VARCHAR2(200));
CREATE INDEX ix_re_1 ON test(REGEXP_REPLACE(REGEXP_SUBSTR(tst_test, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1'))
SELECT *
FROM TEST
WHERE REGEXP_REPLACE(REGEXP_SUBSTR(TST_TEST, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1') = 'TEST'
This will use newly selected index.
You will need as many indices as there are KEYs in you data.
Presence of an INDEX, of course, impacts performance, but it depends very little on REGEXP being there:
SQL> CREATE INDEX ix_test ON test (tst_test)
2 /
Index created
Executed in 0,016 seconds
SQL> INSERT
2 INTO test (tst_test)
3 SELECT 'KEY1=' || level || ';KEY2=' || (level + 10000)
4 FROM dual
5 CONNECT BY
6 LEVEL <= 1000000
7 /
1000000 rows inserted
Executed in 47,781 seconds
SQL> TRUNCATE TABLE test
2 /
Table truncated
Executed in 2,546 seconds
SQL> DROP INDEX ix_test
2 /
Index dropped
Executed in 0 seconds
SQL> CREATE INDEX ix_re_1 ON test(REGEXP_REPLACE(REGEXP_SUBSTR(tst_test, 'KEY1=[^,]*'), 'KEY1=([^,]*)', '\1'))
2 /
Index created
Executed in 0,015 seconds
SQL> INSERT
2 INTO test (tst_test)
3 SELECT 'KEY1=' || level || ';KEY2=' || (level + 10000)
4 FROM dual
5 CONNECT BY
6 LEVEL <= 1000000
7 /
1000000 rows inserted
Executed in 53,375 seconds
As you can see, on my not very fast machine (Core2 4300, 1 Gb RAM) you can insert 20000 records per second to an indexed field, and this rate almost does not depend on type of INDEX being used: plain or function based.
You can use Tom Kyte's runstats package to compare the performance of different implementations - running each say 1000 times in a loop. For example, I just compared LIKE with SUBSTR and it said that LIKE was faster, taking about 80% of the time of SUBSTR.
Note that "col LIKE '%xxx%'" is different from "SUBSTR(col,5,3) = 'xxx'". The equivalent LIKE would be:
col LIKE '____xxx%'
using one '_' for each leading character to be ignored.
I think whichever way you do it, the results will be similar - it always involves a full table (or perhaps full index) scan. A function-based index would only help if you knew the offset of the substring at the time of creating the index.
I am rather concerned when you say that "The select will be used for deletes every few seconds". This does rather suggest a design flaw somewhere, but without knowing the requirements it's hard to say.
UPDATE:
If your column values are like 'key1=abc,key2=def,keyn=ghi' then perhaps you could consider adding another table like this:
create table key_values
( main_table_id references main_table
, key_value varchar2(50)
, primary key (fk_col, key_value)
);
create index key_values_idx on key_values (key_value);
Split the key values up and store them in this table like this:
main_table_id key_value
123 key1=abc
123 key2=def
123 key3=ghi
(This could be done in an AFTER INSERT trigger on main_table for example)
Then your delete could be:
delete main_table
where id in (select main_table_id from key_values
where key_value = 'key2=def');
Can you provide a bit more information?
Are you querying for an arbitrary substring of a string column, or is there some syntax on the strings store in the columns that would allow for some preprocessing to minimize repeated work?
Have you already done any timing tests on your three options to determine their relative performance on the data you're querying?
I suggest reconsidering your logic.
Instead of looking for where a string exists, it may be faster to check if it has a length of >0 and is not a string.
You can use the TRANSLATE function in oracle to convert all non string characters to nulls then check if the result is null.
Separate answer to comment on the table design.
Can't you at least have a KEY/VALUE structure, so instead of storing in a single column, 'key1=abc,key2=def,keyn=ghi' you would have a child table like
KEY VALUE
key1 abc
key2 def
key3 ghi
Then you can create a single index on key and value and your queries are much simpler (since I take it you are actually looking for an exact match on a given key's value).
Some people will probably comment that this is a horrible design, but I think it's better than what you have now.
If you're always going to be looking for the same substring, then using INSTR and a function-based index makes sense to me. You could also do this if you have a small set of constant substrings you will be looking for, creating one FBI for each one.
Quassnoi's REGEXP idea looks promising too. I haven't used regular expressions inside Oracle yet.
I think that Oracle Text would be another way to go. Info on that here
Not sure about improving existing setup stuff, but Lucene (full-text search library; ported to many platforms) can really help. There's extra burden of synchronizing index with the DB, but if you have anything that resembles a service layer in some programming language this becomes an easy task.
Similar to Anton Gogolev's response, Oracle does incorporate a text search engine documented here
There's also extensible indexing, so you can build your own index structures, documented here
As you've agreed, this is a very poor data structure, and I think you will struggle to achieve the aim of deleting stuff every few seconds. Depending on how this data gets input, I'd look at properly structuring the data on load, at least to the extent of having rows of "parent_id", "key_name", "key_value".