Confused about the cost of SQL in PL/SQL Developer

Confused about the cost of SQL in PL/SQL Developer - sql

Select pn.pn_level_id From phone_number pn Where pn.phone_number='15183773646'
Select pn.pn_level_id From phone_number pn Where pn.phone_number=' 15183773646'
Do you think they are the same ? No. Basically they are not the same in the PL/SQL Devloper.
I'm wondering why the latter one's cost is less than the previous sql.
Any help would be appreciated!

The cost is not the same because when cost is calculated planner takes into account available statistics. Statistics among other things contains values which appear most frequently in column with their frequencies. It allows planner to better estimate number of rows which will be fetched and to decide how to better get data (e.g. via sequential scan or by index).
In your case value 15183773646 is probably among mostly frequently appearing and that's why planner estimation is different for the query involving it (as planner has better estimation of number of such rows) as compared to other values for which it basically guesses.

The optimizer evaluates all the ways to extract data, since you database doesn't have the pn.phone_number=' 15183773646' and this value is not stored in the index hence Oracle skip one I/O operation this is why the second cost is less.

Related

What is the efficiency of a query + subquery that finds the minimum parameter of a table in SQL?

I'm currently taking an SQL course and trying to understand efficiency of queries.
Given this query, what's the efficiency of it:
SELECT *
FROM Customers
WHERE Age = (SELECT MIN(Age)
FROM Customers)
What i'm trying to understand is if the subquery runs once at the beginning and then the query is O(n+n)?
Or does the subquery run everytime you go through a customer's age which makes it O(n^2)?
Thank you!

If you want to understand how the query optimizer interperets a query you have to review the execution / explain plan which almost every RDBMS makes available.
As noted in the comments you tell the RDBMS what you want, not how to get it.
Very often it helps to have a deeper understanding of the particular database engine being used in order to write a query in the most performant way, ie, to be able to think like the query processor.
Like any language, there's more than one way to skin a cat, so to speak, and with SQL there is usually more than one way to write a query that results in the same output - very often many ways, depending on the complexity.
How a query execution plan gets built and executed is determined by the query optimizer at compile time and depends on many factors, depending on the RDBMS, such as data cardinality, table size, row size, estimated number of rows, sargability, indexes, available resources, current load, concurrency, isolation level - just to name a few.
It often helps to write queries in the most performant way by thinking what you would have to do to accomplish the same task.
In your example, you are looking for all the rows in a table where a particular value equals another value. You have chosen to find that value by first looking for the minimum age - you would only have to do this once as it's a single scalar value, so it's reasonable to assume (but not guaranteed) the database engine would do the same.
You could also approach the problem by aggregating and limiting to the top qualifying row and including ties, if the syntax is supported by the RDBMS, and joining the results.
Ultimately there is no black and white answer.

Oracle ORDERED hint cost vs speed

So, a few weeks ago, I asked about Oracle execution plan cost vs speed in relation to the FIRST_ROWS(n) hint. I've run into a similar issue, but this time around the ORDERED hint. When I use the hint, my execution time improves dramatically (upwards of 90%), but the EXPLAIN PLAN for the query reports an enormous cost increase. In this particular query, the cost goes from 1500 to 24000.
The query is paramterized for pagination, and joins 19 tables to get the data out. I'd post it here, but it is 585 lines long and is written for a vendor's messy, godawful schema. Unless you happened to be intimately familiar with the product this is used for, it wouldn't be much help to see it. However, I gathered the schema stats at 100% shortly before starting work on tuning the query, so the CBO is not working in the dark here.
I'll try to summarize what the query does. The query essentially returns objects and their children in the system, and is structured as a large subquery block joined directly to several tables. The first part returns object IDs and is paginated inside its query block, before the joins to other tables. Then, it is joined to several tables that contain child IDs.
I know that the CBO is not all knowing or infalible, but it really bothers me to see an execution plan this costly perform so well; it goes against a lot of what I've been taught. With the FIRST_ROWS hint, the solution was to provide a value n such that the optimizer could reliably generate the execution plan. Is there a similar kind of thing happening with the ORDERED hint for my query?

The reported cost is for the execution of the complete query, not just the first set of rows. (PostgreSQL does the costing slightly differently, in that it provides the cost for the initial return of rows and for the complete set).
For some plans the majority of the cost is incurred prior to returning the first rows (eg where a sort-merge is used), and for others the initial cost is very low but the cost per row is relatively high thereafter (eg. nested loop join).
So if you are optimising for the return of the first few rows and joining 19 tables you may get a very low cost for the return of the first 20 with a nested loop-based plan. However for of the complete set of rows the cost of that plan might be very much higher than others that are optimised for returning all rows at the expense of a delay in returning the first.

You should not rely on the execution cost to optimize a query. What matters is the execution time (and in some cases resource usages).
From the concept guide:
The cost is an estimated value proportional to the expected resource use needed to execute the statement with a particular plan.
When the estimation is off, most often it is because the statistics available to the optimizer are misleading. You can correct that by giving the optimizer more accurate statistics. Check that the statistics are up to date. If they are, you can gather additional statistics, for example by enabling dynamic statistic gathering of manually creating an histogram on a data-skewed column.
Another factor that can explain the disparity between relative cost and execution time is that the optimizer is built upon simple assumptions. For example:
Without an histogram, every value in a column is uniformly distributed
An equality operator will select 5% of the rows (without histogram or dynamic stats)
The data in each column is independent upon the data in every other column
Furthermore, for queries with bind variables, a single cost is computed for further executions (even if the bind value change, possibly modifying the cardinality of the query)
...
These assumptions are made so that the optimizer can return an execution cost that is a single figure (and not an interval). For most queries these approximation don't matter much and the result is good enough.
However, you may find that sometimes the situation is simply too complex for the optimizer and even gathering extra statistics doesn't help. In that case you'll have to manually optimize the query, either by adding hints yourself, by rewriting the query or by using Oracle tools (such as SQL profiles).
If Oracle could devise a way to accurately determine the execution cost, we would never need to optimize a query manually in the first place !

Tips and Tricks to speed up an SQL [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Does the order of columns in a WHERE clause matter?
These are the basics SQL Function and Keywords.
Is there any tips or trick to speed up your SQL ?
For example; I have a query with a lot of keywords. (AND, GROUP BY, ORDER BY, IN, BETWEEN, LIKE... etc.)
Which Keyword should be on top in my query?
How can i decide it?
Example;
Where NUMBER IN (156, 646)
AND DATE BETWEEN '01/01/2011' AND '01/02/2011'
OR
Where DATE BETWEEN '01/01/2011' AND '01/02/2011'
AND NUMBER IN (156, 646)
Which one is faster? Depends of what?

Don't use functions in the where clause. Because the query engine must execute the function for every single row.

There are no "tricks".
Given the competition between the database vendors about which one is "faster", any "trick" that is always true would be implemented in the database itself. (The tricks are implemented in the part of the database called "optimizer").
There are only things to be aware of, but they typically can't be reduced into:
Use feature X
Avoid feature Y
Model like this
Never model like that
Look at all the raging questions/discussions about indexes, index types, index strategies, clustering, single column keys, compound keys, referential integrity, access paths, joins, join mechanisms, storage engines, optimizer behaviour, datatypes, normalization, query transformations, denormalization, procedures, buffer cache, resultset cache, application cache, modeling, aggregation, functions, views, indexed views, set processing, procedural processing and the list goes on.
All of them was invented to attack a specific problem area. Variations on that problem make the "trick" more or less suitable. Very often the tricks have zero effect, and sometimes sometimes flat out horrible. Why? Because when we don't understand why something works, we are basically just throwing features at the problem until it goes away.
The key point here is that there is a reason why something makes a query go faster, and the understanding of what that something is, is crucial to the process of understanding why a different unrelated query is slow, and how to deal with it. And it is never a trick, nor magic.
We (humans) are lazy, and we want to be thrown that fish when what we really need is to learn how to catch it.
Now, what specific fish do YOU want to catch?
Edited for comments:
The placement of your predicates in the where clause makes no difference since the order in which they are processed is determined by the database. Some of the things which will affect that order (for your example) are :
Whether or not the query can be rewritten against an indexed view
What indexes are available that covers one or both of columns NUMBER and DATE and in what order they exist in that index
The estimated selectivity of your predicates, which basically mean the estimated percentage of rows matched by your predicate. The lower % the more likely the optimizer is to use your index efficiently.
The clustering factor (or whatever the name is in SQL Server) if SQL Server factors that into the query cost. This has to do with how the order of the index entries aligns with the physical order of the table rows. Better alignment = reduces cost for higher % of rows fetched via that index.
Now, if the only values you have in column NUMBER are 156, 646 and they are pretty much evenly spread, an index would be useless. A full scan would be a better alternative.
On the other hand, if those are unique order numbers (backed by a unique index), the optimizer will pick that index and drive the query from there. Similarily, if the rows having a DATE between the first and second of January 2011 make up a small enough % of the rows, an index leading with DATE will be considered.
Or if you include order by NUMBER, DATE another parameter comes into the equation; the cost of sorting. An index on (NUMBER, DATE) will now seem more attractive to the optimizer, because even though it might not be the most efficient way of aquiring the rows, the sorting (which is expensive) can be skipped.
Or, if your query included a join to another table (say customer) on customer_id and you also had a filter on customer.ssn, again the equation changes, because (since you did a good job with foreign keys and a backing index) you will now have a very efficient access path into your first table, without using the indexes in NUMBER or DATE. Unless you only have one customer and all of the 10 million orders where his...

Read about sargable queries (ones which can use the index vice ones which can't). Avoid correlated subqueries, functions in where clauses, cursors and while loops. Don't use select * especially if you have joins, never return more than the data you need.
Actually there are whole books written on performance tuning, get one and read it for the datbase you are using as the techniques vary from database to database.

Learn to use indexes properly.
http://Use-The-Index-Luke.com/

Are SQL Execution Plans based on Schema or Data or both?

I hope this question is not too obvious...I have already found lots of good information on interpreting execution plans but there is one question I haven't found the answer to.
Is the plan (and more specifically the relative CPU cost) based on the schema only, or also the actual data currently in the database?
I am try to do some analysis of where indexes are needed in my product's database, but am working with my own test system which does not have close to the amount of data a product in the field would have. I am seeing some odd things like the estimated CPU cost actually going slightly UP after adding an index, and am wondering if this is because my data set is so small.
I am using SQL Server 2005 and Management Studio to do the plans

It will be based on both Schema and Data. The Schema tells it what indexes are available, the Data tells it which is better.
The answer can change in small degrees depending on the DBMS you are using (you have not stated), but they all maintain statistics against indexes to know whether an index will help. If an index breaks 1000 rows into 900 distinct values, it is a good index to use. If an index only results in 3 different values for 1000 rows, it is not really selective so it is not very useful.

SQL Server is 100% cost-based optimizer. Other RDBMS optimizers are usually a mix of cost-based and rules-based, but SQL Server, for better or worse, is entirely cost driven. A rules based optimizer would be one that can say, for example, the order of the tables in the FROM clause determines the driving table in a join. There are no such rules in SQL Server. See SQL Statement Processing:
The SQL Server query optimizer is a
cost-based optimizer. Each possible
execution plan has an associated cost
in terms of the amount of computing
resources used. The query optimizer
must analyze the possible plans and
choose the one with the lowest
estimated cost. Some complex SELECT
statements have thousands of possible
execution plans. In these cases, the
query optimizer does not analyze all
possible combinations. Instead, it
uses complex algorithms to find an
execution plan that has a cost
reasonably close to the minimum
possible cost.
The SQL Server query optimizer does
not choose only the execution plan
with the lowest resource cost; it
chooses the plan that returns results
to the user with a reasonable cost in
resources and that returns the results
the fastest. For example, processing a
query in parallel typically uses more
resources than processing it serially,
but completes the query faster. The
SQL Server optimizer will use a
parallel execution plan to return
results if the load on the server will
not be adversely affected.
The query optimizer relies on
distribution statistics when it
estimates the resource costs of
different methods for extracting
information from a table or index.
Distribution statistics are kept for
columns and indexes. They indicate the
selectivity of the values in a
particular index or column. For
example, in a table representing cars,
many cars have the same manufacturer,
but each car has a unique vehicle
identification number (VIN). An index
on the VIN is more selective than an
index on the manufacturer. If the
index statistics are not current, the
query optimizer may not make the best
choice for the current state of the
table. For more information about
keeping index statistics current, see
Using Statistics to Improve Query
Performance.

Both schema and data.
It takes the statistics into account when building a query plan, using them to approximate the number of rows returned by each step in the query (as this can have an effect on the performance of different types of joins, etc).
A good example of this is the fact that it doesn't bother to use indexes on very small tables, as performing a table scan is faster in this situation.

I can't speak for all RDBMS systems, but Postgres specifically uses estimated table sizes as part of its efforts to construct query plans. As an example, if a table has two rows, it may choose a sequential table scan for the portion of the JOIN that uses that table, whereas if it has 10000+ rows, it may choose to use an index or hash scan (if either of those are available.) Incidentally, it used to be possible to trigger poor query plans in Postgres by joining VIEWs instead of actual tables, since there were no estimated sizes for VIEWs.
Part of how Postgres constructs its query plans depend on tunable parameters in its configuration file. More information on how Postgres constructs its query plans can be found on the Postgres website.

For SQL Server, there are many factors that contribute to the final execution plan. On a basic level, Statistics play a very large role but they are based on the data but not always all of the data. Statistics are also not always up to date. When creating or rebuilding an Index, the statistics should be based on a FULL / 100% sample of the data. However, the sample rate for automatic statistics refreshing is much lower than 100% so it is possible to sample a range that is in fact not representative of much of the data. Estimated number of rows for the operation also plays a role which can be based on the number of rows in the table or the statistics on a filtered operation. So out-of-date (or incomplete) Statistics can lead the optimizer to choose a less-than-optimal plan just as a few rows in a table can cause it to ignore indexes entirely (which can be more efficient).
As mentioned in another answer, the more unique (i.e. Selective) the data is the more useful the index will be. But keep in mind that the only guaranteed column to have statistics is the leading (or "left-most" or "first") column of the Index. SQL Server can, and does, collect statistics for other columns, even some not in any Indexes, but only if AutoCreateStatistics DB option is set (and it is by default).
Also, the existence of Foreign Keys can help the optimizer when those fields are in a query.
But one area not considered in the question is that of the Query itself. A query, slightly changed but still returning the same results, can have a radically different Execution Plan. It is also possible to invalidate the use of an Index by using:
LIKE '%' + field
or wrapping the field in a function, such as:
WHERE DATEADD(DAY, -1, field) < GETDATE()
Now, keep in mind that read operations are (ideally) faster with Indexes but DML operations (INSERT, UPDATE, and DELETE) are slower (taking more CPU and Disk I/O) as the Indexes need to be maintained.
Lastly, the "estimated" CPU, etc. values for cost are not always to be relied upon. A better test is to do:
SET STATISTICS IO ON
run query
SET STATISTICS IO OFF
and focus on "logical reads". If you reduce Logical Reads then you should be improving performance.
You will, in the end, need a set of data that comes somewhat close to what you have in Production in order to performance tune with regards to both Indexes and the Queries themselves.

Oracle specifics:
The stated cost is actually an estimated execution time, but it is given in a somewhat arcane unit of measure that has to do with estimated time for block reads. It's important to realize that the calculated cost doesn't say much about the runtime anyway, unless each and every estimate made by the optimizer was 100% perfect (which is never the case).
The optimizer uses the schema for a lot of things when deciding what transformations/heuristics can be applied to the query. Some examples of schema things that matter a lot when evaluating xplans:
Foreign key constraints (can be used for table elimiation)
Partitioning (exclude entire ranges of data)
Unique constraints (index unique vs range scans for example)
Not null constraints (anti-joins are not available with not in() on nullable columns
Data types (type conversions, specialized date arithmetics)
Materialized views (for rewriting a query against an aggregate)
Dimension Hierarchies (to determine functional dependencies)
Check constraints (the constraint is injected if it lowers cost)
Index types (b-tree(?), bitmap, joined, function based)
Column order in index (a = 1 on {a,b} = range scan, {b,a} = skip scan or FFS)
The core of the estimates comes from using the statistics gathered on actual data (or cooked). Statistics are gathered for tables, columns, indexes, partitions and probably something else too.
The following information is gathered:
Nr of rows in table/partition
Average row/col length (important for costing full scans, hash joins, sorts, temp tables)
Number of nulls in col (is_president = 'Y' is pretty much unique)
Distinct values in col (last_name is not very unique)
Min/max value in col (helps unbounded range conditions like date > x)
...to help estimate the nr of expected rows/bytes returned when filtering data. This information is used to determine what access paths and join mechanisms are available and suitable given the actual values from the SQL query compared to the statistics.
On top of all that, there is also the physical row order which affects how "good" or attractive an index become vs a full table scan. For indexes this is called "clustering factor" and is a measure of how much the row order matches the order of the index entries.

TableCardinality

What does it means TableCardinality in execution plan?
I am looking at data base tuning performances
Thanks

Cardinality is the number of UNIQUE values for a given table / column (so it's not surprising that is equal to the number of entries in the primary key index as that index is most likely clustered). The cardinality of the index or table is usefull to sql server as it allows the query optimiser to make educated guesses about the possible best plans to use when referencing that table.
When the optimiser has to take your sql code and work out what to do with it, it will consider alternative plans before settling on the one that it will use to actually retrieve the data. In most real world cases the number of possible plans is too large for sqlserver to calculate the absolute best plan via sampling all possibles so the optimiser will use statistical analysis to determin a "good enough" plan.
Cardinality is one of the metrics it uses to work out such a plan as it may use this to derermine which physical joins to use (hashmaps or loops for example)

From this article
SQL Server keeps track of table cardinality when the query plan is compiled. It does so because it will trigger an automatic recompile if the actual cardinality is substantially different from the compile time cardinality.
It would seem a reasonable guess that that is what the TableCardinality in the plan XML is (and shown in the properties window) but I haven't found anything to confirm that.

Table Cardinality, in the context of the Execution Plan, is normally the same as the number of rows in the table per the latest statistics on that table.
Source: see http://support.microsoft.com/kb/195565 and search for "table cardinality" - you will see this note: "NOTE: In this strictest sense, SQL Server counts cardinality as the number of rows in the table."
Also, I checked a number of tables in various queries, and in each case, TableCardinality was equal to the number of rows in the table, or in some cases, the rowcnt value in sys.sysindexes for the primary key for that table.

Isn't it the same as Estimated Number of Rows?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas