Order of tables in INNER JOIN - sql

Going through a book, Learning SQL by Alan Beaulieu. On topic of inner joins, it tells that whatever be the order of tables in a INNER JOIN, results are same and gives reason as follows:
If you are confused about why all three versions of the account/employee/customer query
yield the same results, keep in mind that SQL is a nonprocedural language, meaning
that you describe what you want to retrieve and which database objects need to be
involved, but it is up to the database server to determine how best to execute your
query. Using statistics gathered from your database objects, the server must pick one
of three tables as a starting point (the chosen table is thereafter known as the driving
table), and then decide in which order to join the remaining tables. Therefore, the order
in which tables appear in your from clause is not significant.
So does it imply that if statistics gathered from database objects change, then results would also change?

So does it imply that if statistics gathered from database objects change, then results would also change?
No. The same query will always produce the same results (provided, of course, that the underlying data is the same). What the author is explaining is that the database may choose a strategy or another to process the query (starting from one table or another, using a this or that algorithm to join the rows, and so on). That decision is made based on many factors, some of them being based on information that is available in the statistics.
The key point is that SQL is a declarative language, not a procedural language: you don't get to chose how the database handles the query, you just tell it what result you want.
However, regardless of the algorithm that the database chooses, the result is guaranteed to be consistent.
Note that there are edge case where the database does not guarantee that results are the same for consecutive executions of the same query (like a query without a row limiting clause but without an order by): it's the responsibility of the client to provide a query whose results are properly defined (the language does gives you enough rope to hang yourself, if you really want to).

Related

SQL - Order of records returned in join by default [duplicate]

As I know, from the relational database theory, a select statement without an order by clause should be considered to have no particular order. But actually in SQL Server and Oracle (I've tested on those 2 platforms), if I query from a table without an order by clause multiple times, I always get the results in the same order. Does this behavior can be relied on? Anyone can help to explain a little?
No, that behavior cannot be relied on. The order is determined by the way the query planner has decided to build up the result set. simple queries like select * from foo_table are likely to be returned in the order they are stored on disk, which may be in primary key order or the order they were created, or some other random order. more complex queries, such as select * from foo where bar < 10 may instead be returned in order of a different column, based on an index read, or by the table order, for a table scan. even more elaborate queries, with multipe where conditions, group by clauses, unions, will be in whatever order the planner decides is most efficient to generate.
The order could even change between two identical queries just because of data that has changed between those queries. a "where" clause may be satisfied with an index scan in one query, but later inserts could make that condition less selective, and the planner could decide to perform a subsequent query using a table scan.
To put a finer point on it. RDBMS systems have the mandate to give you exactly what you asked for, as efficiently as possible. That efficiency can take many forms, including minimizing IO (both to disk as well as over the network to send data to you), minimizing CPU and keeping the size of its working set small (using methods that require minimal temporary storage).
without an ORDER BY clause, you will have not asked exactly for a particular order, and so the RDBMS will give you those rows in some order that (maybe) corresponds with some coincidental aspect of the query, based on whichever algorithm the RDBMS expects to produce the data the fastest.
If you care about efficiency, but not order, skip the ORDER BY clause. If you care about the order but not efficiency, use the ORDER BY clause.
Since you actually care about BOTH use ORDER BY and then carefully tune your query and database so that it is efficient.
No, you can't rely on getting the results back in the same order every time. I discovered that when working on a web page with a paged grid. When I went to the next page, and then back to the previous page, the previous page contained different records! I was totally mystified.
For predictable results, then, you should include an ORDER BY. Even then, if there are identical values in the specified columns there, you can get different results. You may have to ORDER BY fields that you didn't really think you needed, just to get a predictable result.
Tom Kyte has a pet peeve about this topic. For whatever reason, people are fascinated by this, and keep trying to come up with cases where you can rely upon a specific order without specifying ORDER BY. As others have stated, you can't. Here's another amusing thread on the topic on the AskTom website.
The Right Answer
This is a new answer added to correct the old one. I've got answer from Tom Kyte and I post it here:
If you want rows sorted YOU HAVE TO USE AN ORDER. No if, and, or buts about it. period. http://tkyte.blogspot.ru/2005/08/order-in-court.html You need order by on that IOT. Rows are sorted in leaf blocks, but leaf blocks are not stored sorted. fast full scan=unsorted rows.
https://twitter.com/oracleasktom/status/625318150590980097
https://twitter.com/oracleasktom/status/625316875338149888
The Wrong Answer
(Attention! The original answer on the question was placed below here only for the sake of the history. It's wrong answer. The right answer is placed above)
As Tom Kyte wrote in the article mentioned before:
You should think of a heap organized table as a big unordered
collection of rows. These rows will come out in a seemingly random
order, and depending on other options being used (parallel query,
different optimizer modes and so on), they may come out in a different
order with the same query. Do not ever count on the order of rows from
a query unless you have an ORDER BY statement on your query!
But note he only talks about heap-organized tables. But there is also index-orgainzed tables. In that case you can rely on order of the select without ORDER BY because order implicitly defined by primary key. It is true for Oracle.
For SQL Server clustered indexes (index-organized tables) created by default. There is also possibility for PostgreSQL store information aligning by index. More information can be found here
UPDATE:
I see, that there is voting down on my answer. So I would try to explain my point a little bit.
In the section Overview of Index-Organized Tables there is a phrase:
In an index-organized table, rows are stored in an index defined on the primary key for the table... Index-organized tables are useful when related pieces of data must be stored together or data must be physically stored in a specific order.
http://docs.oracle.com/cd/E25054_01/server.1111/e25789/indexiot.htm#CBBJEBIH
Because of index, all data is stored in specific order, I believe same is true for Pg.
http://www.postgresql.org/docs/9.2/static/sql-cluster.html
If you don't agree with me please give me a link on the documenation. I'll be happy to know that there is something to learn for me.

Is the order of the result of SELECT DISTINCT ... WHERE ... "random"?

I have an SQL query that reads
SELECT DISTINCT [NR] AS K_ID
FROM [DB].[staging].[TABLE]
WHERE [N]=1 and [O]='XXX' and [TYPE] in ('1_P', '2_I')
Since I'm saving the result in a CSV file (via Python Pandas) which is under version control I've noticed that the order of the result changes every time I run the query. In order to eliminate the Python part here I ran the query in MS SQL Server Management Studio, where I'm also observing a different order with every attempt.
It doesn't matter in my case, but: Is it correct, that the result of the query can be ordered differently with every execution? And if so, is there a way to make the order "deterministic"?
SQL database are based on a relational algebra set theory concept, where what you think of as tables are more formally called unordered relations. Unless you specify an ORDER BY, the database is free to return the data is whatever order is convenient.
This order might match an index, rather than the order on disk. It might also start in the middle of the data, if the database can take advantage of work already in progress for another query to reduce total reads between the two (Enterprise Edition will do this).
Worse, even the order on disk might change. If there's no primary key, the database can even move a page around to help things run more efficiently.
In other words, if the order matters (and it usually does), specify an ORDER BY clause.
SQL queries return results as an unordered set, unless the outermost query has an order by.
On smaller amounts of data, the results look repeatable. However, on larger systems -- and particularly on parallel systems -- the ordering may be based on hashing algorithms, when nodes complete, and congestion on the network (among other factors). So, you can in fact see different orderings each time you run.

If you join two tables in the SELECT statement, all indexes on the table columns can no longer be used?

Let's say we have:
SELECT *
FROM Pictures
JOIN Categories ON Categories.CategoryId = Pictures.CategoryId
WHERE Pictures.UserId = #UserId
ORDER BY Pictures.UploadDate DESC
In this case, the database first join the two tables and then work on the derived table, which I think would mean the indexes on the individual tables would be no use, unless you can come up with an index that is bound to some column in the derived table?
You have a fundamental misunderstanding of how SQL works. The SQL language specifies what result set should be returned. It says nothing about how the database should achieve those results.
It is up to the database engine to parse the statement and come up with an execution plan (hopefully an efficient one) that will produce the correct results. Many modern relational databases have sophisticated query optimizers that completely pull apart the statement and derive execution plans that seem to have no relationship with the original query. (At least not to the untrained eye)
The execution plan for the same query can even change over time if the engine uses a cost based optimizer. A cost based optimizer makes decisions based on statistics that have been gathered about data and indexes. As the statistics change, the execution plan can also change.
With your simple query you assume that the database has to join the tables and create a temporary result set before it applies the where clause. That might be how you think about the problem, but the database is free to implement it entirely differently. I doubt there are many (if any) databases that would create a temporary result set for your simple query.
This is not to say that you cannot ever predict when an index may or may not be used. But it takes practice and experience to get a feel for how a database might execute a query.
This will join the tables giving you all the category information if a picture's 'CategoryId' is in the table 'Categories''s CategoryId field. (and no result for a particular 'Picture' if there is no such category)
This query will likely return several rows of data. The indexes of either table will be useful no matter which table you would like to access.
Normally your program would loop through the result set.
CategoryId will give you the row in Categories with all the relevant fields in that Category and 'Picture.Id' (assuming there is such a field) will give you a reference to that exact picture row in the database.
You can then manipulate either table by using the relevant index
"UPDATE Categories SET .... WHERE CategoryId = " +
"UPDATE Pictures ..... WHERE PictureId =" +
or some such depending on your programming environment.
Indexes are up to the optimizer for use, which depends on what is occurring in the query. For the query posted, there's nothing obvious to stop an index from being used. However, not all databases operate the same -- MySQL only allows one index to be used per SELECT (check the query plan, because the optimizer might interpret the JOIN so that another index may be used).
The stuff that is likely to ensure that an index can not be used is any function/operation that alters the data. IE: getting the month/etc out of a date, wildcarding the left side of a LIKE clause...

Methods of visualizing joins

Just wondering if anyone has any tricks (or tools) they use to visualize joins. You know, you write the perfect query, hit run, and after it's been running for 20 minutes, you realize you've probably created a cartesian join.
I sometimes have difficulty visualizing what's going to happen when I add another join statement and wondered if folks have different techniques they use when trying to put together lots of joins.
Always keep the end in mind.
Ascertain which are the columns you need
Try to figure out the minimum number of tables which will be needed to do it.
Write your FROM part with the table which will give max number of columns. eg FROM Teams T
Add each join one by one on a new line. Ensure whether you'll need OUTER, INNER, LEFT, RIGHT JOIN at each step.
Usually works for me. Keep in mind that it is Structured query language. Always break your query into logical lines and it's much easier.
Every join combines two resultsets into one. Each may be from a single database table or a temporary resultset which is the result of previous join(s) or of a subquery.
Always know the order that joins are processed, and, for each join, know the nature of the two temporary result sets that you are joining together. Know what logical entity each row in that resultset represents, and what attributes in that resultset uniquely identify that entity. If your join is intended to always join one row to one row, these key attributes are the ones you need to use (in join conditions) to implement the join. If your join is intended to generate some kind of cartesian product, then it is critical to understand the above to understand how the join conditions (whatever they are) will affect the cardinality of the new joined resultset.
Try to be consistent in the use of outer join directions. I try to always use Left Joins when I need an outer join, as I "think" of each join as "joining" the new table (to the right) to whatever I have already joined together (on the left) of the Left Join statement...
Run an explain plan.
These are always hierarchical trees (to do this, first I must do that). Many tools exist to make these plans into graphical trees, some in SQL browsers, (e.g, Oracle SQLDeveloper, whatever SQlServer's GUI client is called). If you don't have a tool, most plan text ouput includes a "depth" column, which you can use to indent the line.
What you want to look for is the cost of each row. (Note that for Oracle, though, higher costs can mean less time, if it allows Oracle to do a hash join rather than nested loops, and if the final result set has high cardinality (many, many rows).)
I have never found a better tool than thinking it through and using my own mind.
If the query is so complicated that you cannot do that, you may want to use either CTE's, views, or some other carefully organized subqueries to break it into logical pieces so you can easily understand and visualize each piece even if you cannot manage the whole.
Also, if your concern is effeciency, then SQL Server Management Studio 2005 or later lets you get estimated query execution plans without actually executing the query. This can give you very good ideas of where problems lie, if you are using MS SQL Server.

How to Optimize Queries in a Database - The Basics

It seems that all questions regarding this topic are very specific, and while I value specific examples, I'm interested in the basics of SQL optimization. I am very comfortable working in SQL, and have a background in hardware/low level software.
What I want is the tools both tangible software, and a method to look at the mysql databases I look at on a regular basis and know what the difference between orders of join statements and where statements.
I want to know why an index helps, like, exactly why. I want to know specifically what happens differently, and I want to know how I can actually look at what is happening. I don't need a tool that will breakdown every step of my SQL, I just want to be able to poke around and if someone can't tell me what column to index, I will be able to get out a sheet of paper and within some period of time be able to come up with the answers.
Databases are complicated, but they aren't THAT complicated, and there must be some great material out there for learning the basics so that you know how to find the answers to optimization problems you encounter, even if could hunt down the exact answer on a forum.
Please recommend some reading that is concise, intuitive, and not afraid to get down to the low level nuts and bolts. I prefer online free resources, but if a book recommendation demolishes the nail head it hits I'd consider accepting it.
You have to do a look up for every where condition and for every join...on condition. The two work the same.
Suppose we write
select name
from customer
where customerid=37;
Somehow the DBMS has to find the record or records with customerid=37. If there is no index, the only way to do this is to read every record in the table comparing the customerid to 37. Even when it finds one, it has no way of knowing there is only one, so it has to keep looking for others.
If you create an index on customerid, the DBMS has ways to search the index very quickly. It's not a sequential search, but, depending on the database, a binary search or some other efficient method. Exactly how doesn't matter, accept that it's much faster than sequential. The index then takes it directly to the appropriate record or records. Furthermore, if you specify that the index is "unique", then the database knows that there can only be one so it doesn't waste time looking for a second. (And the DBMS will prevent you from adding a second.)
Now consider this query:
select name
from customer
where city='Albany' and state='NY';
Now we have two conditions. If you have an index on only one of those fields, the DBMS will use that index to find a subset of the records, then sequentially search those. For example, if you have an index on state, the DBMS will quickly find the first record for NY, then sequentially search looking for city='Albany', and stop looking when it reaches the last record for NY.
If you have an index that includes both fields, i.e. "create index on customer (state, city)", then the DBMS can immediately zoom to the right records.
If you have two separate indexes, one on each field, the DBMS will have various rules that it applies to decide which index to use. Again, exactly how this is done depends on the particular DBMS you are using, but basically it tries to keep statistics on the total number of records, the number of different values, and the distribution of values. Then it will search those records sequentially for the ones that satisfy the other condition. In this case the DBMS would probably observe that there are many more cities than there are states, so by using the city index it can quickly zoom to the 'Albany' records. Then it will sequentially search these, checking the state of each against 'NY'. If you have records for Albany, California these will be skipped.
Every join requires some sort of look-up.
Say we write
select customer.name
from transaction
join customer on transaction.customerid=customer.customerid
where transaction.transactiondate='2010-07-04' and customer.type='Q';
Now the DBMS has to decide which table to read first, select the appropriate records from there, and then find the matching records in the other table.
If you had an index on transaction.transactiondate and customer.customerid, the best plan would likely be to find all the transactions with this date, and then for each of those find the customer with the matching customerid, and then verify that the customer has the right type.
If you don't have an index on customer.customerid, then the DBMS could quickly find the transaction, but then for each transaction it would have to sequentially search the customer table looking for a matching customerid. (This would likely be very slow.)
Suppose instead that the only indexes you have are on transaction.customerid and customer.type. Then the DBMS would likely use a completely different plan. It would probably scan the customer table for all customers with the correct type, then for each of these find all transactions for this customer, and sequentially search them for the right date.
The most important key to optimization is to figure out what indexes will really help and create those indexes. Extra, unused indexes are a burden on the database because it takes work to maintain them, and if they're never used this is wasted effort.
You can tell what indexes the DBMS will use for any given query with the EXPLAIN command. I use this all the time to determine if my queries are being optimized well or if I should be creating additional indexes. (Read the documentation on this command for an explanation of its output.)
Caveat: Remember that I said that the DBMS keeps statistics on the number of records and the number of different values and so on in each table. EXPLAIN may give you a completely different plan today than it gave yesterday if the data has changed. For example, if you have a query that joins two tables and one of these tables is very small while the other is large, it will be biased toward reading the small table first and then finding matching records in the large table. Adding records to a table can change which is larger, and thus lead the DBMS to change its plan. Thus, you should attempt to do EXPLAINS against a database with realistic data. Running against a test database with 5 records in each table is of far less value than running against a live database.
Well, there's much more that could be said, but I don't want to write a book here.
Let's say you're looking for a friend in another city. One way would be to go from door to door and ask whether this is the house you're looking for. Another way is to look at the map.
The index is the map to a table. It can tell the DB engine exactly where the thing you're looking for is. Thus, you index every column that you think you will have to search for, and leave out the columns that you are just reading data from, and never searching for.
Good technical reading about indices and about ORDER BY optimization. And if you want to see what exactly is happening, you want the EXPLAIN statement.
Don't think about optimizing databases. Think about optimizing queries.
Generally, you optimize one case at the expense of others. You just have to decide which cases you're interested in.
I don't know about MySql tools but in MS SqlServer you have a tool that shows all of the operations a query would take and how much of the processing time of the entire query would take.
Using this tool helped me to understand how queries are optimized by the query optimizer much more than I think any book could help because what the optimizer does is often not easy to understand. By tweaking the query and possibly the underlining database I could see how each change affected the query plan. There are certain key points in writing queries but to me it looks like you already have an idea of those so optimizing in your case is much more about this than any general rules. After a few years of db development I did look at a few books specifically aimed at database optimization on the SQL Server and found very little useful info.
Quick googling came up with this: http://www.mysql.com/products/enterprise/query.html which sounds like a similar tool.
This was of course on a query level, database level optimizations are again a different kettle of fish, but there you are looking at parameters such as how your database is divided on the hard drives etc. At least in SqlServer you can select to divide tables to different hdd's and even disk plates and this can have a big effect because the drives and drive heads can work in parallel. Another is how you can build your queries so that the database can run them in several threads and processors in parallel, but both of these issues again depend on the database engine and even version you are using.
[Caution: Most of this Answer does not apply to MySQL. I bring this up because the OP tagged the Question with mysql.]
"I'm interested particularly in how indices will affect joins"
As an example, I'll take the case of equijoin (SELECT FROM A,B WHERE A.x = B.y).
If there are no indexes at all (which is possible in theory but I think not in SQL), then basically the only way to compute the join is to take the entire table A and partition it over x, take the entire table y and partition it over y, then match the partitions, and finally for each pair of matching partitions compute the result rows. That's costly (or even outright impossible due to memory restrictions) for all but the smallest tables.
Same story if there do exist indexes on A and/or B, but not any of them has x resp. y as its first attribute.
If there does exist an index on x, but not on y (or conversely), then another possibility opens up : scan table B, for each row pick value y, lookup that value in the index and fetch the corresponding A rows to compute the join. Note that this still won't win you much if no other further restrictions apply (AND z = ...) - except in the case where there are only few matches between x and y values.
If ordered indexes (hash-based indexes are not ordered) exist on both x and y, then a third possibility opens up : do a matching scan on the indexes themselves (the indexes themselves are likely to be smaller than the tables themselves, so scanning the index itself will take a shorter time), and for the matching x/y values, compute the join of the corresponding rows.
That's the baseline. Variations arise for joins on x>y etc.