Vertica and joins - sql

I'm adapting a web analysis tool to use Vertica as the DB. I'm having real problems optimizing joins. I tried creating pre-join projections for some of my queries, and while it did make the queries blazing fast, it slowed data loading into the fact table to a crawl.
A simple INSERT INTO ... SELECT * FROM which we use to load data into the fact table from a staging table goes from taking ~5 seconds to taking 20+ minutes.
Because of this I dropped all pre-join projections and tried using the Database Designer to design query specific projections but it's not enough. Even with those projections a simple join is taking ~14 seconds, something that takes ~1 second with a pre-join projection.
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
We're running Vertica on a 5 node cluster, each node having 2 x quad core CPU and 32 GB of memory. The tables in my example query have 188,843,085 and 25,712,878 rows respectively.
The EXPLAIN output looks like this:
EXPLAIN SELECT referer_via_.url as referralPageUrl, COUNT(DISTINCT sessio
n.id) as visits FROM owa_session as session JOIN owa_referer AS referer_vi
a_ ON session.referer_id = referer_via_.id WHERE session.yyyymmdd BETWEEN
'20121123' AND '20121123' AND session.site_id = '49' GROUP BY referer_via_
.url ORDER BY visits DESC LIMIT 250;
Access Path:
+-SELECT LIMIT 250 [Cost: 1M, Rows: 250 (STALE STATISTICS)] (PATH ID: 0)
| Output Only: 250 tuples
| Execute on: Query Initiator
| +---> SORT [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID: 1)
| | Order: count(DISTINCT "session".id) DESC
| | Output Only: 250 tuples
| | Execute on: All Nodes
| | +---> GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 1M, Rows: 1 (STALE
STATISTICS)] (PATH ID: 2)
| | | Aggregates: count(DISTINCT "session".id)
| | | Group By: referer_via_.url
| | | Execute on: All Nodes
| | | +---> GROUPBY HASH (SORT OUTPUT) (RESEGMENT GROUPS) [Cost: 1M, Rows
: 1 (STALE STATISTICS)] (PATH ID: 3)
| | | | Group By: referer_via_.url, "session".id
| | | | Execute on: All Nodes
| | | | +---> JOIN HASH [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID:
4) Outer (RESEGMENT)
| | | | | Join Cond: ("session".referer_id = referer_via_.id)
| | | | | Execute on: All Nodes
| | | | | +-- Outer -> STORAGE ACCESS for session [Cost: 463, Rows: 1 (ST
ALE STATISTICS)] (PUSHED GROUPING) (PATH ID: 5)
| | | | | | Projection: public.owa_session_projection
| | | | | | Materialize: "session".id, "session".referer_id
| | | | | | Filter: ("session".site_id = '49')
| | | | | | Filter: (("session".yyyymmdd >= 20121123) AND ("session"
.yyyymmdd <= 20121123))
| | | | | | Execute on: All Nodes
| | | | | +-- Inner -> STORAGE ACCESS for referer_via_ [Cost: 293K, Rows:
26M] (PATH ID: 6)
| | | | | | Projection: public.owa_referer_DBD_1_seg_Potency_2012112
2_Potency_20121122
| | | | | | Materialize: referer_via_.id, referer_via_.url
| | | | | | Execute on: All Nodes

To speedup join:
Design session table as being partitioned on column "yyyymmdd". This will enable partition pruning
Add condition on column "yyyymmdd" to _referer_via_ and partition on it, if it is possible (most likely not)
have column site_id as possible close to the beginning of order by list in used (super)projection of session
have both tables segmented on referer_id and id correspondingly.
And having more nodes in cluster do help.

My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
I guess the amount affected would vary depending on data sets and structures you are working with. But, since this is the variable you changed, I believe it is safe to say the pre-join projection is causing the slowness. You are gaining query time at the expense of insertion time.
Someone please correct me if any of the following is wrong. I'm going by memory and by information picked up with conversations with others.
You can speed up your joins without a pre-join projection a few ways. In this case, the referrer ID. I believe if you segment your projections for both tables with the join predicate that would help. Anything you can do to filter the data.
Looking at your explain plan, you are doing a hash join instead of a merge join, which you probably want to look at.
Lastly, I would like to know via the explain plan or through system tables if your query is actually using the projections Database Designer has recommended. If not, explicitly specify them in your query and see if that helps.

You seem to have a lot of STALE STATISTICS.
Responding to STALE statistics is important. Because that is the reason why your queries are slow. Without statistics about the underlying data, Vertica's query optimizer cannot choose the best execution plan. And responding to STALE statistics only improves SELECT performance not update performance.
If you update your tables regularly do remember there are additional things you have to consider in VERTICA. Please check the answer that I posted to this question.
I hope that should help improve your update speed.
Explore the AHM settings as explained in that answer. If you don't need to be able to select deleted rows in a table later, it is often a good idea to not keep them around. There are ways to keep only the latest epoch version of the data. Or manually purge deleted data.
Let me know how it goes.

I think your query could use some more of being explicit. Also don't use that Devil BETWEEN Try this:
EXPLAIN SELECT
referer_via_.url as referralPageUrl,
COUNT(DISTINCT session.id) as visits
FROM owa_session as session
JOIN owa_referer AS referer_via_
ON session.referer_id = referer_via_.id
WHERE session.yyyymmdd <= '20121123'
AND session.yyyymmdd > '20121123'
AND session.site_id = '49'
GROUP BY referer_via_.url
-- this `visits` column needs a table name
ORDER BY visits DESC LIMIT 250;
I'll say I'm really perplexed as to why you would use the same DATE with BETWEEN may want to look into that.

this is my view coming from an academic background working with column databases, including Vertica (recent PhD graduate in database systems).
Blockquote
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
Blockquote
Yes, updating projections is very slow and you should ideally do it only in large batches to amortize the update cost. The fundamental reason is that each projection represents another copy of the data (of each table column that is part of the projection).
A single row insert requires adding one value (one attribute) to each column in the projection. For example, a single row insert in a table with 20 attributes requires at least 20 column updates. To make things worse, each column is sorted and compressed. This means that inserting the new value in a column requires multiple operations on large chunks of data: read data / decompress / update / sort / compress data / write data back. Vertica has several optimization for updates but cannot hide completely the cost.
Projections can be thought of as the equivalent of multi-column indexes in a traditional row store (MySQL, PostgreSQL, Oracle, etc.). The upside of projections versus traditional B-Tree indexes is that reading them (using them to answer a query) is much faster than using traditional indexes. The reasons are multiple: no need to access head data as for non-clustered indexes, smaller size due to compression, etc. The flipside is that they are way more difficult to update. Tradeoffs...

Related

Why Query_parallelism affects the result of a join between two UUID columns

I'm running the following test on ignite 2.10.0
I have 2 tables created with a query_parallelism=1 and without affinity key.
When I join the 2 following tables I have the result as expected.
0: jdbc:ignite:thin://localhost:10800> SELECT "id" AS "_A_id", "source_id" AS "_A_source_id" FROM PUBLIC."source_ml_blue";
+--------------------------------------+--------------------------------------+
| _A_id | _A_source_id |
+--------------------------------------+--------------------------------------+
| 86c068cd-da89-11eb-a185-3da86c6c6bb3 | 86c068cc-da89-11eb-a185-3da86c6c6bb3 |
+--------------------------------------+--------------------------------------+
1 row selected (0.004 seconds)
0: jdbc:ignite:thin://localhost:10800> SELECT "id" AS "_B_id", "flx_src_ip_text" AS "_B_src_ip" FROM PUBLIC."source_nprobe_tcp_blue";
+--------------------------------------+-----------+
| _B_id | _B_src_ip |
+--------------------------------------+-----------+
| 86c068cc-da89-11eb-a185-3da86c6c6bb3 | 1.1.1.1 |
+--------------------------------------+-----------+
1 row selected (0.003 seconds)
0: jdbc:ignite:thin://localhost:10800> SELECT _A."id" AS "_A_id", _A."source_id" AS "_A_source_id", _B."id" AS "_B_id", _B."flx_src_ip_text" AS "_B_src_ip" FROM PUBLIC."source_ml_blue" AS "_A" INNER JOIN PUBLIC."source_nprobe_tcp_blue" AS "_B" ON "_A"."source_id"="_B"."id";
+--------------------------------------+--------------------------------------+--------------------------------------+-----------+
| _A_id | _A_source_id | _B_id | _B_src_ip |
+--------------------------------------+--------------------------------------+--------------------------------------+-----------+
| 86c068cd-da89-11eb-a185-3da86c6c6bb3 | 86c068cc-da89-11eb-a185-3da86c6c6bb3 | 86c068cc-da89-11eb-a185-3da86c6c6bb3 | 1.1.1.1 |
+--------------------------------------+--------------------------------------+--------------------------------------+-----------+
1 row selected (0.005 seconds)
If I delete and create the same tables with a query_parallelism = 8, I do not have a SQL error (the parallelism is equal on the 2 tables) BUT the result of the join is empty.
any idea why I get this behavior ?
You observe this behaviour because of optimisations for parallel query execution. Most likely your records landed to different partitions (handled by a different thread). If you increase the number of records in both tables you will see a subset of this join as a result.
The most elegant option here is to let "_A"."source_id" and "_B"."id" be affinity keys. Most likely ignite.jdbc.distributedJoins is going to affect performance for clustered installation. Affinity collocation will make items with matching "_A"."source_id" and "_B"."id" reside in the same partition to avoid cross-partitional interaction (for clustered environments it would lead to additional networks hops).
The problem comes from the SQL client : it has to be aware of the parallelism.
On DBeaver, I had to enable ignite.jdbc.distributedJoins in the connection properties to make the request works properly.

Is comparing two tables faster by importing them into a sql database or by using jdbc?

Background
I need to compare two tables in two different datacenters to make sure they're the same. The tables can be hundreds of millions, even a billion lines.
An example of this is having a production data pipeline and a development data pipeline. I need to verify that the tables at the end of each pipeline are the same, however, they're located in different datacenters.
The tables are the same if all the values and datatypes for each row and column match. There are primary keys for each table.
Here's an example input and output:
Input
table1:
Name | Age |
Alice| 25.0|
Bob | 49 |
Jim | 45 |
Cal | 52 |
table2:
Name | Age |
Bob | 49 |
Cal | 42 |
Alice| 25 |
Output:
table1 missing rows (empty):
Name | Age |
| |
table2 missing rows:
Name | Age |
Jim | 45 |
mismatching rows:
Name | Age | table |
Alice| 25.0| table1|
Alice| 25 | table2|
Cal | 52 | table1|
Cal | 42 | table2|
Note: The output doesn't need to be exactly like the above format, but it does need to contain the same information.
Question
Is it faster to import these tables into a new, common SQL environment, then use SQL to produce my desired output?
OR
Is it faster to use something like JDBC, retrieve all rows for each table, sort each table, then compare them line by line to produce my desired output?
Edits:
The above solutions would be executed at a datacenter that's hosting one of the tables. In the first solution, the only purpose for creating a new database would be to compare these tables using SQL, there are no other uses.
You should definitively start with the database option. Especially if the databases are connected with a database link you can easy set up the transfer of the data.
Such comparison often leads to a full outer join of the two sources and the experience tell us that DIY joins are notorically less performant that the native database implementation (you can deploy for example a parallel option).
Anyway you may try to implement some sofisticated algoritm that can make the compare without the necessity to transfer the whole table.
An example is based on the Merkle Trees where you first scan both source in their location to recognise which parts are identical (that can be ignored) and transfer and compare only the party with a difference.
So if you expect the tables are nearly identical and have keys that allows some hierarchy such approach could end better than a brute force full compare.
The faster solution is to load both tables to variables (memory) in your programing language and then compare them with your favorite algorithm.
Copy them first to a new table is the more than the double of time in read/write operations to disk, especially the write ones.

Query a dynamic deduplicated table

I am using BigQuery to give my colleagues access to aggregated data in our system.
I have a raw_orders table where I store orders data. The thing is that the lines in this table are subject to change across time. When a change occurs, I add a new line in this table. So my table looks like this:
+-----+-------+---------------------+---------------------+
| id | total | created_at | updated_at |
+-----+-------+---------------------+---------------------+
| ABC | 15.76 | 2020-01-01 12:56:32 | 2020-01-02 14:58:43 |
| ABC | 12.43 | 2020-01-01 12:56:32 | 2020-01-01 12:56:32 |
| DEF | 19.03 | 2020-01-01 12:56:32 | 2020-01-02 14:58:43 |
| DEF | 12.03 | 2020-01-01 12:56:32 | 2020-01-01 12:56:32 |
+-----+-------+---------------------+---------------------+
To allow my collaborators to query on a deduplicated table easily, I created a view of deduplicated lines using:
CREATE OR REPLACE VIEW xxx.orders as
select ro.*
from (
select ro.id, max(ro.updated_at) max_updated_at
from xxx.raw_orders ro
group by ro.id
) tmp inner join xxx.raw_orders ro2 on ro2.id = tmp.id && ro2.updated_at = tmp.max_updated_at
order by f.created_at desc
This works great, but I feel that I am spending too much budget on simple requests like:
SELECT * FROM rubee.orders WHERE created_at > '2020-11-01 00:00:00';
If I understand well, because of the view step, big query must use a lot of storage to deduplicate lines before responding a single result.
Am I doing something wrong here? How do you give access to deduplicated data without spending too much storage? Would you have a better strategy for what I try to do?
Ideally, you will use a materialized view for the purpose, but right now BigQuery has limited support on materialized view. You cannot create a mview to replace the view you were using.
It is possible to create a materialized view for the inner query, which may make the whole query less expensive but please read on.
Cost. There is no simple answer whether you are "spending too much budget" on the query.
If you're on pay-per-query plan and charged by "processed bytes", then although the query is more expensive for BigQuery to process, you're charged no more than scanning the whole table once (although technically the table was scanned more than once). In another word, deduplication is free. However, if your query pattern allows to to cluster/partition your table somehow to avoid scanning the whole table, then this "self-join" view does prevent you from saving the budget.
If you have reservation on slots, then you will benefit from making the query faster.
Suggestions. Give the situation is different case by case, the general suggestions are:
If it is possible, separating the data into "archived" and "active" so that "archived" data stay deduplicated (and partitioned/clustered to allow efficient search), and you only need a view to dedup "active" data.
Create a materialized view (on the inner "GROUP BY" query) may speed up the query a bit but not necessarily make it "cheaper", you may be charged the size of the base table + mview.

Index is not being used by optimizer

I have a query which is performing very badly due to full scan of a table.I have checked the statistics rebuild the indexes but its not working.
SQL Statement:
select distinct NA_DIR_EMAIL d, NA_DIR_EMAIL r
from gcr_items , gcr_deals
where gcr_deals.GCR_DEALS_ID=gcr_items.GCR_DEALS_ID
and
gcr_deals.bu_id=:P0_BU_ID
and
decode(:P55_DIRECT,'ALL','Y',trim(upper(NA_ORG_OWNER_EMAIL)))=
decode(:P55_DIRECT,'ALL','Y',trim(upper(:P55_DIRECT)))
order by 1
Execution Plan :
Plan hash value: 3180018891
-------------------------------------------------------------------------
| Id | Operation | Name | Rows | Time |
-------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8 | 00:11:42 |
| 1 | SORT ORDER BY | | 8 | 00:11:42 |
| 2 | HASH UNIQUE | | 8 | 00:11:42 |
|* 3 | HASH JOIN | | 7385 | 00:11:42 |
|* 4 | VIEW | index$_join$_002 | 10462 | 00:00:05 |
|* 5 | HASH JOIN | | | |
|* 6 | INDEX RANGE SCAN | GCR_DEALS_IDX12 | 10462 | 00:00:01 |
| 7 | INDEX FAST FULL SCAN| GCR_DEALS_IDX1 | 10462 | 00:00:06 |
|* 8 | TABLE ACCESS FULL | GCR_ITEMS | 7386 | 00:11:37 |
-------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("GCR_DEALS"."GCR_DEALS_ID"="GCR_ITEMS"."GCR_DEALS_ID")
4 - filter("GCR_DEALS"."BU_ID"=TO_NUMBER(:P0_BU_ID))
5 - access(ROWID=ROWID)
6 - access("GCR_DEALS"."BU_ID"=TO_NUMBER(:P0_BU_ID))
8 - filter(DECODE(:P55_DIRECT,'ALL','Y',TRIM(UPPER("NA_ORG_OWNER_EMAI
L")))=DECODE(:P55_DIRECT,'ALL','Y',TRIM(UPPER(:P55_DIRECT))))
In the beginning a part of the condition in the WHERE clause must be decomposed (or "decompiled" - or "reengeenered") into a simpler form without using decode function, which a form can be understandable by the query optimizer:
AND
decode(:P55_DIRECT,'ALL','Y',trim(upper(NA_ORG_OWNER_EMAIL)))=
decode(:P55_DIRECT,'ALL','Y',trim(upper(:P55_DIRECT)))
into:
AND (
:P55_DIRECT = 'ALL'
OR
trim(upper(:P55_DIRECT)) = trimm(upper(NA_ORG_OWNER_EMAIL))
)
To find rows in the table based on values stored in the index, Oracle uses an access method named Index scan, see this link for details:
https://docs.oracle.com/cd/B19306_01/server.102/b14211/optimops.htm#i52300
One of the most common access method is Index Range Scan see here:
https://docs.oracle.com/cd/B19306_01/server.102/b14211/optimops.htm#i45075
The documentation says (in the latter link) that:
The optimizer uses a range scan when it finds one or more leading
columns of an index specified in conditions, such as the following:
col1 = :b1
col1 < :b1
col1 > :b1
AND combination of the preceding conditions for leading columns in the
index
col1 like 'ASD%' wild-card searches should not be in a leading
position otherwise the condition col1 like '%ASD' does not result in a
range scan.
The above means that the optimizer is able to use the index to find rows only for query conditions that contain basic comparision operators: = < > <= >= LIKE which are used to comparing simple values with plain column names. What the documentation doesn't clearly say - and you need to deduce it reading between the lines - is a fact that when some function is used in the condition, in a form function( column_name ) or function( expression_involving_column_names ) , then the index range scan cannot be used. In this case the query optimizer must evaluate this expression individually for each row in the table, thus must read all rows (perform a full table scan).
A short conclusion and a rule of thumb:
Functions in the WHERE clause can prevent the optimizer from using
indexes
If you see some function somewhere in the WHERE clause, then it is a sign that you are
running the red light
STOP immediately and think three times how
this function impact the query optimizer and the performance of your
query, and try to rewrite the condition to a form that the optimizer
is able to understand.
Now take a look at our rewritten condition:
AND (
:P55_DIRECT = 'ALL'
OR
trim(upper(:P55_DIRECT)) = trimm(upper(NA_ORG_OWNER_EMAIL))
)
and STOP - there are still two functions trim and upper applied to a column named NA_ORG_OWNER_EMAIL. We need to think how they can impact the query optimizer.
I assume that you have created a plain index on a single column: CREATE INDEX somename ON GCR_ITEMS( NA_ORG_OWNER_EMAIL ).If yes, then the index contains only plain values of NA_ORG_OWNER_EMAIL.
But the query is trying to find trimm(upper(NA_ORG_OWNER_EMAIL)) values, which are not stored in the index, so this index cannot be used in this case.
This condition requires a function based index:
https://docs.oracle.com/cd/E11882_01/appdev.112/e41502/adfns_indexes.htm#ADFNS00505
CREATE INDEX somename ON GCR_ITEMS( trim( upper( NA_ORG_OWNER_EMAIL )))
Unfortunately even the function based index will still not help, because the condition in the query is too general - if a value of :P55_DIRECT = ALL the query must retrieve all rows from the table (perform a full table scan), otherwise must use the index to search value within it.
This is because the query is planned (think of it as "compiled") by the query optimizer only once, during it's first execution. Then the plan is stored in the cache and used to execute the query for all further executions. A value of the parameter is not know in advance, so the plan must consider each possible cases, thus will always perform a full table scan.
In 12c there is a new feature "Adaptive query optimalization":
https://docs.oracle.com/database/121/TGSQL/tgsql_optcncpt.htm#TGSQL94982
where the query optimizer analyses each parameters of the query on each runs, and is able to detect that the plan is not optimal for some runtime parameters, and choose a better "subplans" depending on actual parameter's value ... but you must use 12c, and additionally pay for Enterprise Edition, because only this edition includes that feature. And it's still not certain if the adaptive plan will work in this case or not.
What you can do without paying for 12c EE is to DIVIDE this general query into two separate variants, one for a case where :P55_DIRECT = ALL, and the other for remaining cases, and run an appropriate variant in the client (your application) depending on the value of this parameter.
A version for :P55_DIRECT = ALL, that will perform a full table scan
where gcr_deals.GCR_DEALS_ID=gcr_items.GCR_DEALS_ID
and
gcr_deals.bu_id=:P0_BU_ID
order by 1
and a version for other cases, that will use the function based index:
where gcr_deals.GCR_DEALS_ID=gcr_items.GCR_DEALS_ID
and
gcr_deals.bu_id=:P0_BU_ID
and
trim(upper(:P55_DIRECT)) = trimm(upper(NA_ORG_OWNER_EMAIL))
order by 1

Determine Oracle query execution time and proposed datasize without actually executing query

In oracle Is there any way to determine howlong the sql query will take to fetch the entire records and what will be the size of it, Without actually executing and waiting for entire result.
I am getting repeatedly to download and provide the data to the users using normal oracle SQL select (not datapump/import etc) . Some times rows will be in millions.
Actual run time will not known unless you run it, but you can try to estimate it..
first you can do explain plan explain only, this will NOT run query -- based on your current stats it will show you more or less how it will be executed
this will not have actual time and efforts to read the data from datablocks..
do you have large blocksize
is this schema normalized/de-normalized for query/reporting?
how large is row does it fit in same block so only 1 fetch is needed?
of rows you are expecting
based on amount of data * your network latency
Based on this you can try estimate time
This requires good statistics, explain plan for ..., adjusting sys.aux_stats, and then adjusting your expectations.
Good statistics The explain plan estimates are based on optimizer statistics. Make sure that tables and indexes have up-to-date statistics. On 11g this usually means sticking with the default settings and tasks, and only manually gathering statistics after large data loads.
Explain plan for ... Use a statement like this to create and store the explain plan for any SQL statement. This even works for creating indexes and tables.
explain plan set statement_id = 'SOME_UNIQUE_STRING' for
select * from dba_tables cross join dba_tables;
This is usually the best way to visualize an explain plan:
select * from table(dbms_xplan.display);
Plan hash value: 2788227900
-------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Time |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 12M| 5452M| 00:00:19 |
|* 1 | HASH JOIN RIGHT OUTER | | 12M| 5452M| 00:00:19 |
| 2 | TABLE ACCESS FULL | SEG$ | 7116 | 319K| 00:00:01 |
...
The raw data is stored in PLAN_TABLE. The first row of the plan usually sums up the estimates for the other steps:
select cardinality, bytes, time
from plan_table
where statement_id = 'SOME_UNIQUE_STRING'
and id = 0;
CARDINALITY BYTES TIME
12934699 5717136958 19
Adjust sys.aux_stats$ The time estimate is based on system statistics stored in sys.aux_stats. These are numbers for metrics like CPU speed, single-block I/O read time, etc. For example, on my system:
select * from sys.aux_stats$ order by sname
SNAME PNAME PVAL1 PVAL2
SYSSTATS_INFO DSTART 09-11-2014 11:18
SYSSTATS_INFO DSTOP 09-11-2014 11:18
SYSSTATS_INFO FLAGS 1
SYSSTATS_INFO STATUS COMPLETED
SYSSTATS_MAIN CPUSPEED
SYSSTATS_MAIN CPUSPEEDNW 3201.10192837466
SYSSTATS_MAIN IOSEEKTIM 10
SYSSTATS_MAIN IOTFRSPEED 4096
SYSSTATS_MAIN MAXTHR
SYSSTATS_MAIN MBRC
SYSSTATS_MAIN MREADTIM
SYSSTATS_MAIN SLAVETHR
SYSSTATS_MAIN SREADTIM
The numbers can be are automatically gathered by dbms_stats.gather_system_stats. They can also be manually modified. It's a SYS table but relatively safe to modify. Create some sample queries, compare the estimated time with the actual time, and adjust the numbers until they match.
Discover you probably wasted a lot of time
Predicting run time is theoretically impossible to get right in all cases, and in practice it is horribly difficult to forecast for non-trivial queries. Jonathan Lewis wrote a whole book about those predictions, and that book only covers the "basics".
Complex explain plans are typically "good enough" if the estimates are off by one or two orders of magnitude. But that kind of difference is typically not good enough to show to a user, or use for making any important decisions.