Oracle query: date filter gets really slow - sql

I have this oracle query that takes around 1 minute to get the results:
SELECT TRUNC(sysdate - data_ricezione) AS delay
FROM notifiche#fe_engine2fe_gateway n
WHERE NVL(n.data_ricezione, TO_DATE('01011900', 'ddmmyyyy')) =
(SELECT NVL(MAX(n2.data_ricezione), TO_DATE('01011900', 'ddmmyyyy'))
FROM notifiche#fe_engine2fe_gateway n2
WHERE n.id_sdi = n2.id_sdi)
--AND sysdate-data_ricezione > 15
Basically i have this table named "notifiche", where each record represents a kind of update to another type of object (invoices). I want to know which invoice has not received any update in the last 15 days. I can do it by joining the notifiche n2 table, getting the most recent record for each invoice, and evaluate the difference between the update date (data_ricezione) and the current date (sysdate).
When i add the commented condition, the query takes then infinite time to complete (i mean hours, never saw the end of it...)
How is possibile that this simple condition make the query so slow?
How can I improve the performance?

Try to keep data_ricezione alone; if there's an index on it, it might help.
So: switch from
and sysdate - data_ricezione > 15
to
and -data_ricezione > 15 - sysdate / * (-1)
to
and data_ricezione < sysdate - 15
As everything is done over the database link, see whether the driving_site hint does any good, i.e.
select /*+ driving_site (n) */ --> "n" is table's alias
trunc(sysdate-data_ricezione) as delay
from
notifiche#fe_engine2fe_gateway n
...

Use an analytic function to avoid a self-join over a database link. The below query only reads from the table once, divides the rows into windows, finds theMAX value for each window, and lets you select rows based on that maximum. Analytic functions are tricky to understand at fist, but they often lead to code that is smaller and more efficient.
select id_sdi, data_ricezion
from
(
select id_sdi, data_ricezion, max(data_ricezion) over (partition by id_sdi) max_date
from notifiche#fe_engine2fe_gateway
)
where sysdate - max_date > 15;
As for why adding a simple condition can make the query slow - it's all about cardinality estimates. Cardinality, the number of rows, drives most of the database optimizer's decision. The best way to join a small amount of data may be very different than the best way to join a large amount of data. Oracle must always guess how many rows are returned by an operation, to know which algorithm to use.
Optimizer statistics (metadata about the tables, columns, and indexes) are what Oracle uses to make cardinality estimates. For example, to guess the number of rows filtered out by sysdate-data_ricezione > 15, the optimizer would want to know how many rows are in the table (DBA_TABLES.NUM_ROWS), what the maximum value for the column is (DBA_TAB_COLUMNS.HIGH_VALUE), and maybe a break down of how many rows are in different age ranges (DBA_TAB_HISTOGRAMS).
All of that information depends on optimizer statistics being correctly gathered. If a DBA foolishly disabled automatic optimizer statistics gathering, then these problems will happen all the time. But even if your system is using good settings, the predicate you're using may be an especially difficult case. Optimizer statistics aren't free to gather, so the system only collects them when 10% of the data changes. But since your predicate involves SYSDATE, the percentage of rows will change every day even if the table doesn't change. It may make sense to manually gather stats on this table more often than the default schedule, or use a /*+ dynamic_sampling */ hint, or create a SQL Profile/Plan Baseline, or one of the many ways to manage optimizer statistics and plan stability. But hopefully none of that will be necessary if you use an analytic function instead of a self-join.

Related

How to utilize table partition in oracle database in effective manner?

I have created a partitioned table as
CREATE TABLE orders_range(order_id NUMBER
,client_id NUMBER
,order_date DATE)
PARTITION BY RANGE(order_date)
(PARTITION orders2011 VALUES LESS THAN (to_date('1/1/2012','dd/mm/yyyy'))
,PARTITION orders2012 VALUES LESS THAN (to_date('1/1/2013','dd/mm/yyyy'))
,PARTITION orders2013 VALUES LESS THAN (MAXVALUE));
when I am selecting the records using
SELECT * FROM ORDERS_RANGE partition(orders2011);
in explain plan the cpu cost is 75
but when i go for normal query using where clause the cpu cost is only 6 then what is the advantage of table partitioning when it comes to performance?
Can anyone explain me in detail?
Thanks in advance.
First, you generally can't directly compare the cost of two different plans running against two different objects. It is entirely possible that one plan with a cost of 10,000 will run much more quickly than a different plan with a cost of 10. You can compare the cost of two different plans for a single SQL statement within a single 10053 trace (so long as you remember that these are estimates and if the optimizer estimates incorrectly, many cost values are incorrect and the optimizer is likely to pick a less efficient plan). It may make sense to compare the cost between two different queries if you are trying to work out the algorithm the optimizer is using for a particular step but that's pretty unusual.
Second, in your example, you're not inserting any data. Generally, if you're going to partition a table, you're doing so because you have multiple GB of data in that table. If you compare something like
SELECT *
FROM unpartitioned_table_with_1_billion_rows
vs
SELECT *
FROM partitioned_table_with_1_billion_rows
WHERE partition_key = date '2014-04-01' -- Restricts the data to only 10 million rows
the partitioned approach will, obviously, be more efficient not least of all because you're only reading the 10 million rows in the April 1 partition rather than the 1 billion rows in the table.
If the table has no data, it's possible that the query against the partitioned table would be a tiny bit less efficient since you've got to do more things in the course of parsing the query. But reading 0 rows from a 0 row table is going to take essentially no time either way so the difference in parse time is likely to be irrelevant.
In general, you wouldn't ever use the ORDERS_RANGE partition(orders2011) syntax to access data. In addition to hard-coding the partition name, which means that you'd often be resorting to dynamic SQL to assemble the query, you'd be doing a lot more hard parsing and that you'd be putting more pressure on the shared pool and that you'd risk making a mistake if someone changed the partitioning on the table. It makes far more sense to supply a predicate on the partition key and to let Oracle work out how to appropriately prune the partitions. In other words
SELECT *
FROM orders_range
WHERE order_date < date '2012-01-01'
would be a much more sensible query.

Speeding up aggregations for a large table in Oracle

I am trying to see how to improve performance for aggregation queries in an Oracle database. The system is used to run financial series simulations.
Here is the simplified set-up:
The first table table1 has the following columns
date | id | value
It is read-only, has about 100 million rows and is indexed on id, date
The second table table2 is generated by the application according to user input, is relatively small (300K rows) and has this layout:
id | start_date | end_date | factor
After the second table is generated, I need to compute totals as follows:
select date, sum(value * nvl(factor,1)) as total
from table1
left join table2 on table1.id = table2.id
and table1.date between table2.start_date and table2.end_date group by date
My issue is that this is slow, taking up to 20-30 minutes if the second table is particularly large. Is there a generic way to speed this up, perhaps trading off storage space and execution time, ideally, to achieve something running in under a minute?
I am not a database expert and have been reading Oracle performance tuning docs but was not able to find anything appropriate for this. The most promising idea I found were OLAP cubes but I understand this would help only if my second table was fixed and I simply needed to apply different filters on the data.
First, to provide any real insight, you'd need to determine the execution plan that Oracle is producing for the slow query.
You say the second table is ~300K rows - yes that's small compared to 100M but since you have a range condition in the join between the two tables, it's hard to say how many rows from table1 are likely to be accessed in any given execution of the query. If a large proportion of the table is accessed, but the query optimizer doesn't recognize that, the index may actually be hurting instead of helping.
You might benefit from re-organizing table1 as an index-organized table, since you already have an index that covers most of the columns. But all I can say from the information so far is that it might help, but it might not.
Apart from indexes, Also try below. My two cents!
Try running this Query with PARALLEL option employing multiple processors. /*+ PARALLEL(table1,4) */ .
NVL has been done for million of rows, and this will be an impact
to some extent, any way data can be organised?
When you know the date in Advance, probably you divide this Query
into two chunks, by fetching the ids in TABLE2 using the start
date and end date. And issue a JOIN it to TABLE1 using a
view or temp table. By this we use the index (with id as
leading edge) optimally
Thanks!

Ad hoc queries against high cardinality columns

How to improve the performance of ad hoc queries against tables having hundreds of high cardinality columns and millions of records?
In my case, I have a table with one indexed DATE column SDATE, one VARCHAR2 column NE and 750 numeric columns most of them high cardinality columns with values in the range of 0 to 100. The table is updated with almost 20000 new records every hour. The queries against this table look like:
SELECT * FROM TAB WHERE SDATE BETWEEN :SDATE AND :EDATE AND V1 > :V1 AND V3 < :V3
or
SELECT * FROM TAB WHERE SDATE BETWEEN :SDATE AND :EDATE AND NE = :NE AND V4 > :V4
etc.
So far, I have always advised users not to enter big interval dates so as to put a limit on the number of records resulted from the date index access path; however, from time to time it becomes necessary to specify bigger intervals.
If V1, V2, ..., V750 were all low cardinality columns, I would have been able to utilize bitmap indexes. Unfortunately they are not.
What's the advice on this? How should I tackle this problem?
Thanks.
I assume you're stuck with the design, so a few thoughts that I'd probably look at -
1) use partitions - if you have partitioning option
2) use some triggers to denormalise (or normalise in this case) a query table which is more optimised for the query usage
3) make some snapshots
4) look at having a current table or set of tables which has the days records (or some suitable subset), and roll them over to a big table to store hsitory.
It depends on usage patterns and all the other constraints the system has - this may get you started, if you have more details a better solution is probably out there.
I think the big problem would be the inserts. You have an index on sdate wich slow the inserts and speed up the selects. But, returning to your problems:
If users specify an interval wich is large (let's say >5%) it is beter to have the table partitioned by sdate in a daily or weekly or monthly manner.
Oracle partitioning docs
(If you partition the table, don't forget to partition also the index. And if you want to do it live, use exchange partition ).
Also, as workaround, if you have a powerfull machine, you may use parallel queries.
Oracle Parallel docs

Oracle: How can I find tablespace fragmentation?

I've a JOIN beween two tables. It's really really slow and I can't find why.
The query takes hours in a PRODUCTION environment on a very big Client.
Can you ask me what you need to understand why it doesn't work well?
I can add indexes, partition the table, etc. It's Oracle 10g.
I expect a few thousand record. Because of the following condition:
f.eif_campo1 != c.fornitura AND and f.field29 = 'New'
Infact it should be always verified for all 18 million records
SELECT c.id_messaggio
,f.campo1
,c.f
FROM
flows c,
tab f
WHERE
f.field198 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field1 != c.ExampleF
and f.field29 = 'New'
and c.processtype in ('Example1')
and c.flag_ann = 'N';
Selectivity for the following record expressed as number of distinct values:
COUNT (DISTINCT extra_id) =>17*10^6,
COUNT (DISTINCT (extra_id || field20)) =>17*10^6,
COUNT (DISTINCT field198) =>36*10^6,
COUNT (DISTINCT (field19 || field20)) =>45*10^6,
COUNT (DISTINCT (field1)) =>18*10^6,
COUNT (DISTINCT (field20)) =>47
This is the execution plan [See large image][1]
![enter image description here][2]
Extra details:
I have relaxed one contition to see how many records are taken. 300 thousand.
![enter image description here][7]
--03:57 mins with parallel execution /*+ parallel(c 8) parallel(f 24) */
--395.358 rows
SELECT count(1)
FROM
flows c,
flet f
WHERE
f.field19 = c.id_messaggio
AND f.extra_id = c.extra_id
and f.field20 = 'ExampleF'
and c.process_type in ('ExampleP')
and c.flag_ann = 'N';
Your explain plan shows the following.
The database uses an index to retrieve rows from ENI_FLUSSI_HUB where
flh_tipo_processo_cod in ('VT','VOLTURA_ENI','CC')
It then winnows the rows
where flh_flag_ann = 'N'
This produces a result set which is used to access
rows from ETL_ELAB_INTERF_FLAT on the basis of f.idde_identif_dati_ext_id =
c.idde_identif_dati_ext_id
Finally those rows are filtered on the basis of the
remaining parts of the WHERE clause.
Now, the starting point is a good one if flh_tipo_processo_cod is a selective
column: that is, if it contains hundreds of different values, or if the values in
your list are relatively rare. It might even be a good path of the flag column
identifies relatively few columns with a value of 'N'. So you need to understand
both the distribution of your data - how many distinct values you have - and its
skew - which values appear very often or hardly at all. The overall
performance suggests that the distribution and/or skew of the
flh_tipo_processo_cod and flh_flag_ann columns is not good.
So what can you do? One approach is to follow Ben's suggestion, and use full
table scans. If you have an Enterprise Edition licence and plenty of CPU capacity
you could try parallel query to improve things. That might still be too slow, or it might be too disruptive for other users.
An alternative approach would be to use better indexes. A composite index on
eni_flussi_hub(flh_tipo_processo_cod,flh_flag_ann,idde_identif_dati_ext_id,
flh_fornitura,flh_id_messaggio) would avoid the need to read that table. Whether
this would be a new index or a replacement for ENI_FLK_IDX3 depends on the other
activity against the table. You might be able to benefit from index compression.
All the columns in the query projection are referenced in the WHERE clause. So
you could also use a composite index on the other table to avoid table reads. Agsin you need to understand the distribution and skew of the data. But you should probably lead with the least selective columns. Something like etl_elab_interf_flat(etl_elab_interf_flat,eif_campo200,dde_identif_dati_ext_id,eif_campo1,eif_campo198). Probably this is a new index. It's unlikely you would want to replace ETL_EIF_FK_IDX4 with this (especially if that really is an index on a foreign key constraint)..
Of course these are just guesses on my part. Tuning is a science and to do it properly requires lots of data. Use the Wait Interface to investigate where the database is spending its time. Use the 10053 event to understand why the Optimizer makes the choices it does. But above all, don't implement partitioning unless you really know the ramifications.
The simple answer seems to be your explain plan. You're accessing both tables by index rowid. Whilst to select a single row you cannot - to my knowledge - get faster, in your case you're selecting a lot more than a single row.
This means that for every single row you, you're going into both tables one row at a time, which when you're looking a significant proportion of a table or index is not what you want to do.
My suggestion would be to force a full scan of one or both of your tables. Try to use the smaller as a driver first:
SELECT /*+ full(c) */ c.flh_id_messaggio
, f.eif_campo1
, c.f
FROM flows c,
JOIN flet f
ON f.field19 = c.flh_id_messaggio
AND f.extra_id = c.extra_id
AND f.field1 <> c.f
WHERE ...
But you may have to change /*+ full(c) */ to /*+ full(c) full(f) */.
Your indexes seem to be separate column indexes as well. For this, and if possible, I would have indexes on:
flows of id_messaggio, extra_id, f
and on flet of field19, extra_id, field1.
This will only really matter if you do not use as full scan. Or, if you have everything you're returning and selecting is in one index.

Fast way to discover the row count of a table in PostgreSQL

I need to know the number of rows in a table to calculate a percentage. If the total count is greater than some predefined constant, I will use the constant value. Otherwise, I will use the actual number of rows.
I can use SELECT count(*) FROM table. But if my constant value is 500,000 and I have 5,000,000,000 rows in my table, counting all rows will waste a lot of time.
Is it possible to stop counting as soon as my constant value is surpassed?
I need the exact number of rows only as long as it's below the given limit. Otherwise, if the count is above the limit, I use the limit value instead and want the answer as fast as possible.
Something like this:
SELECT text,count(*), percentual_calculus()
FROM token
GROUP BY text
ORDER BY count DESC;
Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.
(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)
Exact count
Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.
SELECT count(*) AS exact_count FROM myschema.mytable;
Estimate
Extremely fast:
SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';
Typically, the estimate is very close. How close, depends on whether ANALYZE or VACUUM are run enough - where "enough" is defined by the level of write activity to your table.
Safer estimate
The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:
SELECT c.reltuples::bigint AS estimate
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'mytable'
AND n.nspname = 'myschema';
The cast to bigint formats the real number nicely, especially for big counts.
Better estimate
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE oid = 'myschema.mytable'::regclass;
Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.
Replace 'myschema.mytable'::regclass with to_regclass('myschema.mytable') in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:
How to check if a table exists in a given schema
Better estimate yet (for very little added cost)
This does not work for partitioned tables because relpages is always -1 for the parent table (while reltuples contains an actual estimate covering all partitions) - tested in Postgres 14.
You have to add up estimates for all partitions instead.
We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:
These numbers are current as of the last VACUUM or ANALYZE on the
table. The planner then fetches the actual current number of pages in
the table (this is a cheap operation, not requiring a table scan). If
that is different from relpages then reltuples is scaled
accordingly to arrive at a current number-of-rows estimate.
Postgres uses estimate_rel_size defined in src/backend/utils/adt/plancat.c, which also covers the corner case of no data in pg_class because the relation was never vacuumed. We can do something similar in SQL:
Minimal form
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'mytable'::regclass; -- your table here
Safe and explicit
SELECT (CASE WHEN c.reltuples < 0 THEN NULL -- never vacuumed
WHEN c.relpages = 0 THEN float8 '0' -- empty table
ELSE c.reltuples / c.relpages END
* (pg_catalog.pg_relation_size(c.oid)
/ pg_catalog.current_setting('block_size')::int)
)::bigint
FROM pg_catalog.pg_class c
WHERE c.oid = 'myschema.mytable'::regclass; -- schema-qualified table here
Doesn't break with empty tables and tables that have never seen VACUUM or ANALYZE. The manual on pg_class:
If the table has never yet been vacuumed or analyzed, reltuples contains -1 indicating that the row count is unknown.
If this query returns NULL, run ANALYZE or VACUUM for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)
If this query returns 0, the table seems to be empty. But I would ANALYZE to make sure. (And maybe check your autovacuum settings.)
Typically, block_size is 8192. current_setting('block_size')::int covers rare exceptions.
Table and schema qualifications make it immune to any search_path and scope.
Either way, the query consistently takes < 0.1 ms for me.
More Web resources:
The Postgres Wiki FAQ
The Postgres wiki pages for count estimates and count(*) performance
TABLESAMPLE SYSTEM (n) in Postgres 9.5+
SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);
Like #a_horse commented, the added clause for the SELECT command can be useful if statistics in pg_class are not current enough for some reason. For example:
No autovacuum running.
Immediately after a large INSERT / UPDATE / DELETE.
TEMPORARY tables (which are not covered by autovacuum).
This only looks at a random n % (1 in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:
Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
Dead tuples or a FILLFACTOR occupy space per block. If unevenly distributed across the table, the estimate may be off.
General rounding errors.
Typically, the estimate from pg_class will be faster and more accurate.
Answer to actual question
First, I need to know the number of rows in that table, if the total
count is greater than some predefined constant,
And whether it ...
... is possible at the moment the count pass my constant value, it will
stop the counting (and not wait to finish the counting to inform the
row count is greater).
Yes. You can use a subquery with LIMIT:
SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;
Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class, though.
I did this once in a postgres app by running:
EXPLAIN SELECT * FROM foo;
Then examining the output with a regex, or similar logic. For a simple SELECT *, the first line of output should look something like this:
Seq Scan on uids (cost=0.00..1.21 rows=8 width=75)
You can use the rows=(\d+) value as a rough estimate of the number of rows that would be returned, then only do the actual SELECT COUNT(*) if the estimate is, say, less than 1.5x your threshold (or whatever number you deem makes sense for your application).
Depending on the complexity of your query, this number may become less and less accurate. In fact, in my application, as we added joins and complex conditions, it became so inaccurate it was completely worthless, even to know how within a power of 100 how many rows we'd have returned, so we had to abandon that strategy.
But if your query is simple enough that Pg can predict within some reasonable margin of error how many rows it will return, it may work for you.
Reference taken from this Blog.
You can use below to query to find row count.
Using pg_class:
SELECT reltuples::bigint AS EstimatedCount
FROM pg_class
WHERE oid = 'public.TableName'::regclass;
Using pg_stat_user_tables:
SELECT
schemaname
,relname
,n_live_tup AS EstimatedCount
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
How wide is the text column?
With a GROUP BY there's not much you can do to avoid a data scan (at least an index scan).
I'd recommend:
If possible, changing the schema to remove duplication of text data. This way the count will happen on a narrow foreign key field in the 'many' table.
Alternatively, creating a generated column with a HASH of the text, then GROUP BY the hash column.
Again, this is to decrease the workload (scan through a narrow column index)
Edit:
Your original question did not quite match your edit. I'm not sure if you're aware that the COUNT, when used with a GROUP BY, will return the count of items per group and not the count of items in the entire table.
You can also just SELECT MAX(id) FROM <table_name>; change id to whatever the PK of the table is
In Oracle, you could use rownum to limit the number of rows returned. I am guessing similar construct exists in other SQLs as well. So, for the example you gave, you could limit the number of rows returned to 500001 and apply a count(*) then:
SELECT (case when cnt > 500000 then 500000 else cnt end) myCnt
FROM (SELECT count(*) cnt FROM table WHERE rownum<=500001)
For SQL Server (2005 or above) a quick and reliable method is:
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('MyTableName')
AND (index_id=0 or index_id=1);
Details about sys.dm_db_partition_stats are explained in MSDN
The query adds rows from all parts of a (possibly) partitioned table.
index_id=0 is an unordered table (Heap) and index_id=1 is an ordered table (clustered index)
Even faster (but unreliable) methods are detailed here.