The function "PROBE" on column may be causing a table scan

The function "PROBE" on column may be causing a table scan - sql

I used SQL Doctor from Idera against my database. It generated report in "Query Optimization" category I got finding that "The function "PROBE" on column may be causing a table scan". Tool provided link http://sqldoctor.idera.com/query-optimization/implicit-conversion-recommendation/ but I can't find anything related to PROBE.
If anyone know what it stands for and where can I find the exact details for it.

I don't normally like to do all-link answers, but you asked for "what it stands for and where [you can] find the exact details for it."
Here's a nice summary explanation: Probe Residual on Hash Match
Here's a long Microsoft explanation: Interpreting Execution Plans Containing Bitmap Filters.
And here's one that I think might be the most helpful: Probe Residual when you have a Hash Match – a hidden cost in execution plans
And here's my two cents as well. Without seeing your queries, tables, or execution plan, I'm mostly guessing, but I would say that the fact that you were directed to that page in the documentation suggests that you are doing a join that requires an implicit conversion. Since PROBE is associated with hash matches, I infer your join is one of those.
So my guess is that you are joining on two or more fields that have mismatched data types, and that the conversion this requires means that the indexes on one of your tables can't be used. Without a usable index, the query engine needs to do a table scan, a very expensive operation (particularly if you have a large table.)

Related

How many indices will be used per query in SQL/sqlite3?

I found my query runs extremely slow, when two tables are inner joined together for a large data scale, while for a small dataset it was fine. I was told that sqlite3 can only use one index for every table in a query, but I am not sure about that, does anyone know where the information can be found? Thanks a lot!

Much of the information about how sqlite uses indexes can be found in the following documentation urls:
https://www.sqlite.org/queryplanner.html
https://www.sqlite.org/optoverview.html
https://www.sqlite.org/queryplanner-ng.html
Basically, each time a table is used in a query, at most one index of the table can normally be used (a notable exception is that OR conditions in a WHERE or ON might allow multiple indexes, one per term). The query planner tries to pick the one that will be most appropriate and fastest. Running ANALYZE on a populated database generates statistics that can be used to improve the selection when there are multiple possible indexes.
Understanding EXPLAIN QUERY PLAN reports is essential to figuring out how a particular query ends up being resolved and how and what indexes are involved.

Does SQLite have an eval command?

I have a query referenced in Why is SQLite refusing to use available indexes when adding a JOIN? that is a compound query. When the segments of the query are evaluated individually, the query plan generated applies the relevant indicies and runs smoothly. However, when run together (via a JOIN) it fails to do so. Therefore, I was wondering if there was a way to create a query that runs 'eval' on the subquery and passes that to the outer query to force SQLite to use the query plans that would have been generated had they been done individually.

The answer to your other question tells you why already: indexes are not used when they're not useful.
In essence:
If it's cheapest to hop back and forth on disk pages to fetch a handful of rows that match a query, an index gets used.
If it's cheapest to just read the entire mess and filter out uneeded rows, an index is not used.
Some databases (e.g. Postgres) offer an intermediary level between the two in the form of a bitmap index scan: it amounts to the second with a pre-flight check based on the index, to avoid visiting disk pages that contain no matching rows.
That's all there is to it, really: a few rows, index; lots of rows, no index.
Naturally, poorly written queries don't use indexes either, but that's for different reasons: they just confuse the query planner, and while smart the latter is not all-knowing. Joining on a union or an aggregate, in particular, are a prime recipe for not using indexes. (And that is what you are doing.)

Per usual you should write your queries and indexes that way so that Sqlite's query optimizer recognizes the optimal indexes and just uses them.
But as your question in this case is more specific it seems you look for an equivalent of SQL Server's FORCE(INDEX) clause.
As I have read about it in Sqlite there is the clause INDEXED BY, though it seems Sqlite's community's opinions about it are split (probably because of what I mentioned in my first sentence)
link 1 sqlite.org's documentation about it
link 2 for a tutorial on that

How DBMS Process different types of joins?

How a database engines process sql joins? Do they apply different technique to process different type of joins? Explanation with example will be appreciated.

Query evaluation is very complex. I recommend that you pick up a Database textbook and read the portion on query evaluation of your favourite DMBS's documentation.
In a nutshell, there exist 3 main types of algorithms: single pass, loop based, and sort/merged based. Each is used depending on the number of tuples in the tables to join, the expected number of joined tuples, the size of memory and the disk speed (if properly tuned), the existence of indexes, and how good the planner of the DBMS is.
Single pass happen when the table to be joined fits in memory.
Loop-based is usually done when one table fits completely in memory (they can be index, or hash based).
Multiple passes are required for sort/merge based joins.
This URL has some good examples:
http://etutorials.org/SQL/Postgresql/Part+I+General+PostgreSQL+Use/Chapter+4.+Performance/Understanding+How+PostgreSQL+Executes+a+Query/
--dmg

How to Optimize Queries in a Database - The Basics

It seems that all questions regarding this topic are very specific, and while I value specific examples, I'm interested in the basics of SQL optimization. I am very comfortable working in SQL, and have a background in hardware/low level software.
What I want is the tools both tangible software, and a method to look at the mysql databases I look at on a regular basis and know what the difference between orders of join statements and where statements.
I want to know why an index helps, like, exactly why. I want to know specifically what happens differently, and I want to know how I can actually look at what is happening. I don't need a tool that will breakdown every step of my SQL, I just want to be able to poke around and if someone can't tell me what column to index, I will be able to get out a sheet of paper and within some period of time be able to come up with the answers.
Databases are complicated, but they aren't THAT complicated, and there must be some great material out there for learning the basics so that you know how to find the answers to optimization problems you encounter, even if could hunt down the exact answer on a forum.
Please recommend some reading that is concise, intuitive, and not afraid to get down to the low level nuts and bolts. I prefer online free resources, but if a book recommendation demolishes the nail head it hits I'd consider accepting it.

You have to do a look up for every where condition and for every join...on condition. The two work the same.
Suppose we write
select name
from customer
where customerid=37;
Somehow the DBMS has to find the record or records with customerid=37. If there is no index, the only way to do this is to read every record in the table comparing the customerid to 37. Even when it finds one, it has no way of knowing there is only one, so it has to keep looking for others.
If you create an index on customerid, the DBMS has ways to search the index very quickly. It's not a sequential search, but, depending on the database, a binary search or some other efficient method. Exactly how doesn't matter, accept that it's much faster than sequential. The index then takes it directly to the appropriate record or records. Furthermore, if you specify that the index is "unique", then the database knows that there can only be one so it doesn't waste time looking for a second. (And the DBMS will prevent you from adding a second.)
Now consider this query:
select name
from customer
where city='Albany' and state='NY';
Now we have two conditions. If you have an index on only one of those fields, the DBMS will use that index to find a subset of the records, then sequentially search those. For example, if you have an index on state, the DBMS will quickly find the first record for NY, then sequentially search looking for city='Albany', and stop looking when it reaches the last record for NY.
If you have an index that includes both fields, i.e. "create index on customer (state, city)", then the DBMS can immediately zoom to the right records.
If you have two separate indexes, one on each field, the DBMS will have various rules that it applies to decide which index to use. Again, exactly how this is done depends on the particular DBMS you are using, but basically it tries to keep statistics on the total number of records, the number of different values, and the distribution of values. Then it will search those records sequentially for the ones that satisfy the other condition. In this case the DBMS would probably observe that there are many more cities than there are states, so by using the city index it can quickly zoom to the 'Albany' records. Then it will sequentially search these, checking the state of each against 'NY'. If you have records for Albany, California these will be skipped.
Every join requires some sort of look-up.
Say we write
select customer.name
from transaction
join customer on transaction.customerid=customer.customerid
where transaction.transactiondate='2010-07-04' and customer.type='Q';
Now the DBMS has to decide which table to read first, select the appropriate records from there, and then find the matching records in the other table.
If you had an index on transaction.transactiondate and customer.customerid, the best plan would likely be to find all the transactions with this date, and then for each of those find the customer with the matching customerid, and then verify that the customer has the right type.
If you don't have an index on customer.customerid, then the DBMS could quickly find the transaction, but then for each transaction it would have to sequentially search the customer table looking for a matching customerid. (This would likely be very slow.)
Suppose instead that the only indexes you have are on transaction.customerid and customer.type. Then the DBMS would likely use a completely different plan. It would probably scan the customer table for all customers with the correct type, then for each of these find all transactions for this customer, and sequentially search them for the right date.
The most important key to optimization is to figure out what indexes will really help and create those indexes. Extra, unused indexes are a burden on the database because it takes work to maintain them, and if they're never used this is wasted effort.
You can tell what indexes the DBMS will use for any given query with the EXPLAIN command. I use this all the time to determine if my queries are being optimized well or if I should be creating additional indexes. (Read the documentation on this command for an explanation of its output.)
Caveat: Remember that I said that the DBMS keeps statistics on the number of records and the number of different values and so on in each table. EXPLAIN may give you a completely different plan today than it gave yesterday if the data has changed. For example, if you have a query that joins two tables and one of these tables is very small while the other is large, it will be biased toward reading the small table first and then finding matching records in the large table. Adding records to a table can change which is larger, and thus lead the DBMS to change its plan. Thus, you should attempt to do EXPLAINS against a database with realistic data. Running against a test database with 5 records in each table is of far less value than running against a live database.
Well, there's much more that could be said, but I don't want to write a book here.

Let's say you're looking for a friend in another city. One way would be to go from door to door and ask whether this is the house you're looking for. Another way is to look at the map.
The index is the map to a table. It can tell the DB engine exactly where the thing you're looking for is. Thus, you index every column that you think you will have to search for, and leave out the columns that you are just reading data from, and never searching for.
Good technical reading about indices and about ORDER BY optimization. And if you want to see what exactly is happening, you want the EXPLAIN statement.

Don't think about optimizing databases. Think about optimizing queries.
Generally, you optimize one case at the expense of others. You just have to decide which cases you're interested in.

I don't know about MySql tools but in MS SqlServer you have a tool that shows all of the operations a query would take and how much of the processing time of the entire query would take.
Using this tool helped me to understand how queries are optimized by the query optimizer much more than I think any book could help because what the optimizer does is often not easy to understand. By tweaking the query and possibly the underlining database I could see how each change affected the query plan. There are certain key points in writing queries but to me it looks like you already have an idea of those so optimizing in your case is much more about this than any general rules. After a few years of db development I did look at a few books specifically aimed at database optimization on the SQL Server and found very little useful info.
Quick googling came up with this: http://www.mysql.com/products/enterprise/query.html which sounds like a similar tool.
This was of course on a query level, database level optimizations are again a different kettle of fish, but there you are looking at parameters such as how your database is divided on the hard drives etc. At least in SqlServer you can select to divide tables to different hdd's and even disk plates and this can have a big effect because the drives and drive heads can work in parallel. Another is how you can build your queries so that the database can run them in several threads and processors in parallel, but both of these issues again depend on the database engine and even version you are using.

[Caution: Most of this Answer does not apply to MySQL. I bring this up because the OP tagged the Question with mysql.]
"I'm interested particularly in how indices will affect joins"
As an example, I'll take the case of equijoin (SELECT FROM A,B WHERE A.x = B.y).
If there are no indexes at all (which is possible in theory but I think not in SQL), then basically the only way to compute the join is to take the entire table A and partition it over x, take the entire table y and partition it over y, then match the partitions, and finally for each pair of matching partitions compute the result rows. That's costly (or even outright impossible due to memory restrictions) for all but the smallest tables.
Same story if there do exist indexes on A and/or B, but not any of them has x resp. y as its first attribute.
If there does exist an index on x, but not on y (or conversely), then another possibility opens up : scan table B, for each row pick value y, lookup that value in the index and fetch the corresponding A rows to compute the join. Note that this still won't win you much if no other further restrictions apply (AND z = ...) - except in the case where there are only few matches between x and y values.
If ordered indexes (hash-based indexes are not ordered) exist on both x and y, then a third possibility opens up : do a matching scan on the indexes themselves (the indexes themselves are likely to be smaller than the tables themselves, so scanning the index itself will take a shorter time), and for the matching x/y values, compute the join of the corresponding rows.
That's the baseline. Variations arise for joins on x>y etc.

How do you interpret a query's explain plan?

When attempting to understand how a SQL statement is executing, it is sometimes recommended to look at the explain plan. What is the process one should go through in interpreting (making sense) of an explain plan? What should stand out as, "Oh, this is working splendidly?" versus "Oh no, that's not right."

I shudder whenever I see comments that full tablescans are bad and index access is good. Full table scans, index range scans, fast full index scans, nested loops, merge join, hash joins etc. are simply access mechanisms that must be understood by the analyst and combined with a knowledge of the database structure and the purpose of a query in order to reach any meaningful conclusion.
A full scan is simply the most efficient way of reading a large proportion of the blocks of a data segment (a table or a table (sub)partition), and, while it often can indicate a performance problem, that is only in the context of whether it is an efficient mechanism for achieving the goals of the query. Speaking as a data warehouse and BI guy, my number one warning flag for performance is an index based access method and a nested loop.
So, for the mechanism of how to read an explain plan the Oracle documentation is a good guide: http://download.oracle.com/docs/cd/B28359_01/server.111/b28274/ex_plan.htm#PFGRF009
Have a good read through the Performance Tuning Guide also.
Also have a google for "cardinality feedback", a technique in which an explain plan can be used to compare the estimations of cardinality at various stages in a query with the actual cardinalities experienced during the execution. Wolfgang Breitling is the author of the method, I believe.
So, bottom line: understand the access mechanisms. Understand the database. Understand the intention of the query. Avoid rules of thumb.

This subject is too big to answer in a question like this. You should take some time to read Oracle's Performance Tuning Guide

The two examples below show a FULL scan and a FAST scan using an INDEX.
It's best to concentrate on your Cost and Cardinality. Looking at the examples the use of the index reduces the Cost of running the query.
It's a bit more complicated (and i don't have a 100% handle on it) but basically the Cost is a function of CPU and IO cost, and the Cardinality is the number of rows Oracle expects to parse. Reducing both of these is a good thing.
Don't forget that the Cost of a query can be influenced by your query and the Oracle optimiser model (eg: COST, CHOOSE etc) and how often you run your statistics.
Example 1:
SCAN http://docs.google.com/a/shanghainetwork.org/File?id=dd8xj6nh_7fj3cr8dx_b
Example 2 using Indexes:
INDEX http://docs.google.com/a/fukuoka-now.com/File?id=dd8xj6nh_9fhsqvxcp_b
And as already suggested, watch out for TABLE SCAN. You can generally avoid these.

Looking for things like sequential scans can be somewhat useful, but the reality is in the numbers... except when the numbers are just estimates! What is usually far more useful than looking at a query plan is looking at the actual execution. In Postgres, this is the difference between EXPLAIN and EXPLAIN ANALYZE. EXPLAIN ANALYZE actually executes the query, and gets real timing information for every node. That lets you see what's actually happening, instead of what the planner thinks will happen. Many times you'll find that a sequential scan isn't an issue at all, instead it's something else in the query.
The other key is identifying what the actual expensive step is. Many graphical tools will use different sized arrows to indicate how much different parts of the plan cost. In that case, just look for steps that have thin arrows coming in and a thick arrow leaving. If you're not using a GUI you'll need to eyeball the numbers and look for where they suddenly get much larger. With a little practice it becomes fairly easy to pick out the problem areas.

Really for issues like these, the best thing to do is ASKTOM. In particular his answer to that question contains links to the online Oracle doc, where a lot of the those sorts of rules are explained.
One thing to keep in mind, is that explain plans are really best guesses.
It would be a good idea to learn to use sqlplus, and experiment with the AUTOTRACE command. With some hard numbers, you can generally make better decisions.
But you should ASKTOM. He knows all about it :)

The output of the explain tells you how long each step has taken. The first thing is to find the steps that have taken a long time and understand what they mean. Things like a sequential scan tell you that you need better indexes - it is mostly a matter of research into your particular database and experience.

One "Oh no, that's not right" is often in the form of a table scan. Table scans don't utilize any special indexes and can contribute to purging of every useful in memory caches. In postgreSQL, for example, you will find it looks like this.
Seq Scan on my_table (cost=0.00..15558.92 rows=620092 width=78)
Sometimes table scans are ideal over, say, using an index to query the rows. However, this is one of those red-flag patterns that you seem to be looking for.

Basically, you take a look at each operation and see if the operations "make sense" given your knowledge of how it should be able to work.
For example, if you're joining two tables, A and B on their respective columns C and D (A.C=B.D), and your plan shows a clustered index scan (SQL Server term -- not sure of the oracle term) on table A, then a nested loop join to a series of clustered index seeks on table B, you might think there was a problem. In that scenario, you might expect the engine to do a pair of index scans (over the indexes on the joined columns) followed by a merge join. Further investigation might reveal bad statistics making the optimizer choose that join pattern, or an index that doesn't actually exist.

look at the percentage of time spent in each subsection of the plan, and consider what the engine is doing. for example, if it is scanning a table, consider putting an index on the field(s) that is is scanning for

I mainly look for index or table scans. This usually tells me I'm missing an index on an important column that's in the where statement or join statement.
From http://www.sql-server-performance.com/tips/query_execution_plan_analysis_p1.aspx:
If you see any of the following in an
execution plan, you should consider
them warning signs and investigate
them for potential performance
problems. Each of them are less than
ideal from a performance perspective.
* Index or table scans: May indicate a need for better or additional indexes.
* Bookmark Lookups: Consider changing the current clustered index,
consider using a covering index, limit
the number of columns in the SELECT
statement.
* Filter: Remove any functions in the WHERE clause, don't include wiews
in your Transact-SQL code, may need
additional indexes.
* Sort: Does the data really need to be sorted? Can an index be used to
avoid sorting? Can sorting be done at
the client more efficiently?
It is not always possible to avoid
these, but the more you can avoid
them, the faster query performance
will be.

Rules of Thumb
(you probably want to read up on the details too:
Oracle Docs
ASKTOM
SQL Server Docs
)
Bad
Table Scans of Several Large Tables
Good
Using a unique index
Index includes all required fields
Most Common Win
In about 90% of performance problems I have seen, the easiest win is to break up a query with lots (4 or more) of tables into 2 smaller queries and a temporary table.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas