When Oracle is estimating the 'Cost' for certain queries, does it actually look at the amount of data (rows) in a table?
For example:
If I'm doing a full table scan of employees for name='Bob', does it estimate the cost by counting the amount of existing rows, or is it always a set cost?
The cost optimizer uses segment (table and index) statistics as well as system (cpu + i/o performance) statistics for the estimates. Although it depends on how your database is configured, from 10g onwards the segment statistics are usually computed once per day by a process that is calling the DBMS_STATS package.
In the default configuration, Oracle will check the table statistics (which you can look at by querying the ALL_TABLES view - see the column NUM_ROWS). Normally an Oracle job is run periodically to re-gather these statistics by querying part or all of the table.
If the statistics haven't been gathered (yet), the optimizer will (depending on the optimizer_dynamic_sampling parameter) run a quick sample query on the table in order to calculate an estimate for the number of rows in that table.
(To be more accurate, the cost of scanning a table is calculated not from the number of rows, but the number of blocks in the table (which you can see in the BLOCKS column in ALL_TABLES). It takes this number and divides it by a factor related to the multi-block read count to calculate the cost of that part of the plan.)
Related
As a general rule of thumb, is there a maximum/sensible number of statistics for an individual table on MS SQL Server?
I have a DB which I've noticed has over 100 statistics on a table with 6 indexes.
Auto-created statistics ("WA") are created for columns specified as WHERE clause predicates that are not already indexed to help estimate row counts for better execution plans.
The "dta" stats are generated SSMS advisor tools. These hypothetical stats may have been left behind from a tuning exercise that wasn't completed. See this page from Brent Ozar's sp_Blitz for a script to identify and drop these.
The max number of stats is 30,000 according the maximum capacity specifications documentation.
I'm doing simple tests on Redshift to try and speed up the insertion of data into a Redshift table. One thing I noticed today is that doing something like this
CREATE TABLE a (x int) DISTSTYLE key DISTKEY (x) SORTKEY (x);
INSERT INTO a (x) VALUES (1), (2), (3), (4);
VACUUM a; ANALYZE a;
EXPLAIN SELECT MAX(x) FROM a;
yields
QUERY PLAN
XN Aggregate (cost=0.05..0.05 rows=1 width=4)
-> XN Seq Scan on a (cost=0.00..0.04 rows=4 width=4)
I know this is only 4 rows, but it still shouldn't be doing a full table scan to find the max value of a pre-sorted column. Isn't that metadata included in the work done by ANALYZE?
And just as a sanity check, the EXPLAIN for SELECT x FROM a WHERE x > 3 only scans 2 rows instead of the whole table.
Edit: I inserted 1,000,000 more rows into the table with random values from 1 to 10,000. Did a vacuum and analyze. The query plan still says it has to scan all 1,000,004 rows.
Analyzing query plans in a tiny data set does not yield any practical insight on how the database would perform a query.
The optimizer has thresholds and when the cost difference between different plans is small enough it stops considering alternative plans. The idea is that for simple queries, the time spent searching for the "perfect" execution plan, can possibly exceed the total execution time of a less optimal plan.
Redshift has been developed on the code for ParAccel DB. ParAccel has literally hundreds of parameters that can be changed/adjusted to optimize the database for different workloads/situations.
Since Redshift is a "managed" offering, it has these settings preset at levels deemed optimal by Amazon engineers given an "expected" workload.
In general, Redshift and ParAccel are not that great for single slice queries. These queries tend to be run in all slices anyway, even if they are only going to find data in a single slice.
Once a query is executing in a slice, the minimum amount of data read is a block. Depending on block size this can mean hundreds of thousand rows.
Remember, Redshift does not have indexes. So you are not going to have a simple record lookup that will read a few entries off an index and then go laser focused on a single page on the disk. It will always read at least an entire block for that table, and it will do that in every slice.
How to have a meaningful data set to be able to evaluate a query plan?
The short answer is that your table would have a "large number" of data blocks per slice.
How many blocks is per slice is my table going to require? The answer depends on several factors:
Number of nodes in your cluster
Type of node in the cluster - Number of slices per node
Data Type - How many bytes each value requires.
The type of compression encoding for the column involved in the
query. The optimal encoding depends on data demographics
So let's start at the top.
Redshift is an MPP Database, where processing is spread accross multiple nodes. See Redshift's architecture here.
Each node is further sub-divided in slices, which are dedicated data partitions and corresponding hardware resources to process queries on that partition of the data.
When a table is created in Redshift, and data is inserted, Redshift will allocate a minimum of one block per slice.
Here is a simple example:
If you created a cluster with two ds1.8xlarge nodes, you would have 16 slices per node times two nodes for a total of 32 slices.
Let's say we are querying and column in the WHERE clause is something like "ITEM_COUNT" an integer. An integer consumes 4 bytes.
Redshift uses a block size of 1MB.
So in this scenario, your ITEM_COUNT column would have available to it a minimum of 32 blocks times block size of 1MB which would equate to 32MB of storage.
If you have 32MB of storage and each entry only consumes 4 bytes, you can have more than 8 million entries, and they could all fit inside of a single block.
In this example in the Amazon Redshift documentation they load close to 40 million rows to evaluate and compare different encoding techniques. Read it here.
But wait.....
There is compression, if you have a 75% compression rate, that would mean that even 32 million records would still be able to fit into that single block.
What is the bottom line?
In order to analyze your query plan you would need tables, columns that have several blocks. In our example above 32 milion rows would still be a single block.
This means that in the configuration above, with all the assumptions, a table with a single record would basically most likely have the same query plan as a table with 32 million records, because, in both cases the database only needs to read a single block per slice.
If you want to understand how your data is distributed across slices and how many blocks are being used you can use the queries below:
How many rows per slice:
Select trim(name) as table_name, id, slice, sorted_rows, rows
from stv_tbl_perm
where name like '<<your-tablename>>'
order by slice;
How to count how many blocks:
select trim(name) as table_name, col, b.slice, b.num_values, count(b.slice)
from stv_tbl_perm a, stv_blocklist b
where a.id = b.tbl
and a.slice = b.slice
and name like '<<your-tablename>>'
group by 1,2,3,4
order by col, slice;
At following link
http://www.programmerinterview.com/index.php/database-sql/selectivity-in-sql-databases/
the author has written that since "SEX" column has only two possible values thus its selectivity for 10000 records would be; according to formula given; 0.02 %.
But my question that how a database system come to know that this particular column has this many unique values? Wouldn't the database system require scanning the entire table at least once? or some other way the database system would come to know about those unique values?
First, you are applying the formula wrong. The selectivity for sex (in the example given) would be 50% not 0.02%. That means that each value appears about 50% of the time.
The general way that databases keep track of this is using something called "statistics". These are measures that are kept about all tables and used by the optimizer. Sometimes, the information can also be provided by an index on the column.
Comming back to your actual question: Yes, the database scans all table data frequently and saves some statistics, (e.g. max value, min value, number of distinct keys, number of rows in a table, etc.) in a internal table. These statistics are used to estimate the basic result of your query (or other DML operations) in order to evalutat the optimal execution plan. You can manually trigger generation of statistic by running command EXEC DBMS_STATS.GATHER_DATABASE_STATS; or some of the other ones. You can advise Oracle also to read only a sample of all data (e.g. 10% of all rows)
Usually data content does not change drastically, so it does not matter if the numbers are not absolutly exact, they are (usually) sufficient to estimate an execution plan.
Oracle has many processes related to calculating the number of distinct values (NDV).
Manual Statistics Gathering: Statistics gathering can be triggered manually, through many different procedures in DBMS_STATS.
AUTOTASK: Since 10g Oracle has a default AUTOTASK job, "auto optimizer stats collection". It will only gather statistics if the current stats are stale.
Bulk Load: In 12c statistics can be gathered during a bulk load.
Sample: The NDV can be computed from 100% of the data or can be estimated based on a sample. The sample can be either based on blocks or rows.
One-pass distinct sampling: 11g introduced a new AUTO_SAMPLE_SIZE algorithm. It scans the entire table but only uses one pass. It's much faster to scan the whole table than to have to sort even a small part of it. There are several more in-depth descriptions of the algorithm, such as this one.
Incremental Statistics: For partitioned tables Oracle can store extra information about the NDV, called a synopsis. With this information, if only a single partition is modified, only that one partition needs to be analyzed to generate both partition and global statistics.
Index NDV: Index statistics are created by default when an index is created. Also, the information can be periodically re-gathered from DBMS_STATS.GATHER_INDEX_STATS or the cascade option in other procedures in DBMS_STATS.
Custom Statistics: The NDV can be manually set with DBMS_STATS.SET_* or ASSOCIATE STATISTICS.
Dynamic Sampling: Right before a query is executed, Oracle can automatically sample a small number of blocks from the table to estimate the NDV. This usually only happens when statistics are missing.
Database scans the data set in a table so it can use the most efficient method to retrieve data. Database measures the uniqueness of values using the following formula:
Index Selectivity = number of distinct values / the total number of values
The result will be between zero or one. Index Selectivity of zero means that there are not any unique values. In these cases indexes actually reduce performance. So database uses sequential scanning instead of seek operations.
For more information on indexes read https://dba.stackexchange.com/questions/42553/index-seek-vs-index-scan
I.E. if we have got a table with 4 million rows.
Which has got a STATUS field that can assume the following value: TO_WORK, BLOCKED or WORKED_CORRECTLY.
Would you partition on a field which will change just one time (most of times from to_work to worked_correctly)? How many partitions would you create?
The absolute number of rows in a partition is not the most useful metric. What you really want is a column which is stable as the table grows, and which delivers on the potential benefits of partitioning. These are: availability, tablespace management and performance.
For instance, your example column has three values. That means you can have three partitions, which means you can have three tablespaces. So if a tablespace becomes corrupt you lose one third of your data. Has partitioning made your table more available? Not really.
Adding or dropping a partition makes it easier to manage large volumes of data. But are you ever likely to drop all the rows with a status of WORKED_CORRECTLY? Highly unlikely. Has partitioning made your table more manageable? Not really.
The performance benefits of partitioning come from query pruning, where the optimizer can discount chunks of the table immediately. Now each partition has 1.3 million rows. So even if you query on STATUS='WORKED_CORRECTLY' you still have a huge number of records to winnow. And the chances are, any query which doesn't involve STATUS will perform worse than it did against the unpartitioned table. Has partitioning made your table more performant? Probably not.
So far, I have been assuming that your partitions are evenly distributed. But your final question indicates that this is not the case. Most rows - if not all - rows will end up in the WORKED_CORRECTLY. So that partition will become enormous compared to the others, and the chances of benefits from partitioning become even more remote.
Finally, your proposed scheme is not elastic. As the current volume each partition would have 1.3 million rows. When your table grows to forty million rows in total, each partition will hold 13.3 million rows. This is bad.
So, what makes a good candidate for a partition key? One which produces lots of partitions, one where the partitions are roughly equal in size, one where the value of the key is unlikely to change and one where the value has some meaning in the life-cycle of the underlying object, and finally one which is useful in the bulk of queries run against the table.
This is why something like DATE_CREATED is such a popular choice for partitioning of fact tables in data warehouses. It generates a sensible number of partitions across a range of granularities (day, month, or year are the usual choices). We get roughly the same number of records created in a given time span. Data loading and data archiving are usually done on the basis of age (i.e. creation date). BI queries almost invariably include the TIME dimension.
The number of rows in a table isn't generally a great metric to use to determine whether and how to partition the table.
What problem are you trying to solve? Are you trying to improve query performance? Performance of data loads? Performance of purging your data?
Assuming you are trying to improve query performance? Do all your queries have predicates on the STATUS column? Are they doing single row lookups of rows? Or would you want your queries to scan an entire partition?
In an application I need to query a Postgres DB where I expect tens or even hundreds of millions of rows in the result set. I might do this query once a day, or even more frequently. The query itself is relatively simple, although may involve a few JOINs.
My question is: How smart is Postgres with respect to avoiding having to seek around the disk for each row of the result set? Given the time required for a hard disk seek, this could be extremely expensive.
If this isn't an issue, how does Postgres avoid it? How does it know how to lay out data on the disk such that it can be streamed out in an efficient manner in response to this query?
When PostgreSQL analyzes your data, one of the statistics calculated, and used by the query planner is the correlation between the ordering of values in your field or index, and the order on disk.
Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk. (This column is NULL if the column data type does not have a < operator.)
The index cost estimation functions also calculate a correlation:
The indexCorrelation should be set to the correlation (ranging between -1.0 and 1.0) between the index order and the table order. This is used to adjust the estimate for the cost of fetching rows from the parent table.
I don't know for sure, but I assume that the correlation values from various possible plans are used by the planner when determining whether the number of rows to be read from a table can be done with lower cost by performing a table scan, with sequential io (possibly joining in with another concurrent scan of the same table), filtering for the required rows, or an index scan, with its resulting seeks.
PostgreSQL doesn't keep tables sorted according to any particular key, but they can periodically be recreated in a particular index order using the CLUSTER command (which will be slow, with a disk seek per row, if the data to cluster has low correlation to the index values order).
PostgreSQL is able to effectively collect a set of disk blocks that need retrieving, then obtain them in physical order to reduce seeking. It does this through Bitmap Scans. Release Notes for 8.1 say:
Bitmap scans are useful even with a single index, as they reduce the amount of random access needed; a bitmap index scan is efficient for retrieving fairly large fractions of the complete table, whereas plain index scans are not.
Edit: I meant to mention the planner cost contants seq_page_cost and random_page_cost that inform the planner of the relative costs of performing a disk page fetch that is part of a series of sequential fetches, vs. a non-sequentially-fetched disk page.