Is there ALWAYS a "base table" in any database query? - sql

Ok this is slightly theoretical so it would be great if an unbiased database enthusiast gave an opinion.
For the sake of argument let's agree that there is such a concept as a "base table" w.r.t. to a query, where one table is driving the majority of information of the result set. Imagine a query where there are three relations - TableA, TableB, and TableC
Let's say TableA has cardinality of 1 million records and TableC has 500 records and TableC has 10,000.
Let's say the query is like so -
SELECT A.Col1
, A.Col2
, A.Col3
, A.Col4
, A.Col5
FROM TableA A
LEFT JOIN TableB B ON B.ID = A.TableBID
LEFT JOIN TableC C ON C.ID = A.TableCID
Ok, clearly TableA is the base relation above. It is the biggest table, it is driving the result set by being joined "from", and visually the columns are even on the "left side" of the result set. (The left side thing actually was a criterion to my colleague).
Now, let's assume that TableA has 1 million rows again, TableB is a "junction" or "bridge" table and has like 500,000 rows and TableC has 1,000,000 rows. So assume the query is just an outer join to get all columns in TableA and TableC where a relationship exists like below...
SELECT A.*
, C.*
FROM TableC C
FULL OUTER JOIN TableB B ON C.ID = B.TableAID
FULL OUTER JOIN TableA A ON A.ID = B.TableCID
Ok so given the last query, can anyone tell me what the "base relation" is? I don't think there is one, but was hoping for another database person's opinion.

The term "base table" has a definition and it has nothing to do with what you describe. A "base table" is pretty much just a "table". That is, it is not a view, it is not a table valued function, it is not the result of a query. It is what gets stored in the database as an explicit table.
What you are seem to be grasping for seems more related to optimization strategies. I have used similar terminology -- in the context of optimization -- to describe the "driving table" being accessed by the optimizer. The purpose of this is to distinguish between different execution plans.
Consider the query:
from t1 join t2 using (col)
There are multiple different execution plans. Here are some methods and what might be considered the "driving table" (if any) for them:
for each row in t1
for each row in t2
compare col
--> t1 is the "driving table"
for each row in t2
for each row in t1
compare col
--> t2 is the "driving table"
for each row in t1
look up t2 value using index on t2(col)
--> t1 is the "driving table"
sort t1 by col
sort t2 by col
compare the rows in the two sorted sets
--> no "driving table"
hash t1 by col
hash t2 by col
compare the hash maps
--> no "driving table"
In other words, the "driving" table has little to do with the query structure. It is based on the optimization strategies used for the query. That said, left joins and right joins limit the optimization paths. So, in a nested loop or index-lookup situation, the "first" (or "last") table would be the driving table.

The concept of a "driving" table is really an assumption about how the DBMS is expected to execute a query internally. A rule-based query optimizer, in the absence of any index-related preferences, may treat the ordering of tables and joins in a query as significant when it comes to choosing the execution plan. Under a cost-based optimizer, there is no significance to the order of tables and joins so nothing about the structure of the query itself will tell you which table gets read first or in what order the join conditions get evaluated.
When conceptualizing a query it may help to have a mental image of one table being the starting point for the query but I think the answer to the question here must be no. Logically speaking there is no such thing as a driving table.

A base table is a given named table-valued variable--a database table. That's it. In a query expression its name is a leaf expression denoting its value. "Given table variable" would be more descriptive. A query can use literal notation for a table. It would be reasonable for a given named table-valued constant to also be called "base". It's nothing about some kind of "main" table.
The relational model is founded on a table holding the rows that make a true proposition (statement) from its (characteristic) predicate (statement template parameterized by column names). We give base table rows & get query expression rows.
A query expression that is a base table name comes with a predicate given by the designer.
/* (person, liked) rows where [liker] likes [liked] */
/* (person, liked) rows where Likes(liker, liked) */
SELECT * FROM Likes
A query expression that is a table literal has a certain predicate in terms of columns being equal to values.
/* (person) rows where
person = 'Bob'
*/
SELECT * FROM (VALUES ('Bob')) dummy (person)
Otherwise a query expression has a predicate built from its constituent table expression predicates according to its relation operator.
Every algebra operator corresponds to a certain logic operator.
NATURAL JOIN & AND
RESTRICTtheta & ANDtheta
UNION & OR
MINUS & AND NOT
PROJECTall butC & EXISTS C
etc
/* (person) rows where
(FOR SOME liked, Likes(person, liked))
OR person = 'Bob'
*/
SELECT liker AS person
FROM Likes
UNION
VALUES ('Bob')
/* (person, liked) rows where
FOR SOME [values for] l1.*, l2.*,
person = l1.liker AND liked = l2.liked
AND Likes(l1.liker, l1.liked)
AND Likes(l2.liker, l2.liked)
AND l1.liked = l2.liker
AND person = 'Bob'
AND NOT Likes(l1.liked, 'Ed')
*/
Likes l1 INNER JOIN Likes l2
ON l1.liked = l2.liker
WHERE l1.liker = 'Bob'
AND NOT (l1.liked, 'Ed') IN (SELECT * FROM Likes)
There's no difference to how a base, literal or operator-call query expression is used in determining a containing query expression's predicate.
Is there any rule of thumb to construct SQL query from a human-readable description?
Relational algebra - recode column values

Let me suggest a perspective where the base table is the first one in the FROM clause (ie not a JOINed table). In the case where a statement can be equally written with either one table or another as base table, we would say that there are two (or more) base tables.
In your first query, the base table is TableA. If you invert TableA and TableC in the query, you are not guaranteed to get the same results, because of the LEFT JOIN.
In the second query, as you are using FULL JOINs, all 3 tables could be inverted without changing the result, so this is indeed a use case of a query where all tables are base tables.

Related

Join table vs Join Subquery - which is more efficient?

I was recently going through a lot of SQL code where Join sections were filled with complex subqueries, and started wondering if there is any benefit of joining subquery with limited column selection vs joining entire table and selecting only necessary columns.
To ilustrate that:
Let's say we have 2 tables: Table1, Table2 each with columns (PK, FK, a, b ,c, d, e, f).
I want to join Table1 with Table2, but retrieve only a few fields from Table2.
Which is more efficient, what are the benefits of each?
SELECT
Table1.*,
Table2.a,
Table2.b
FROM Table1
LEFT JOIN Table2 ON Table1.PK = Table2.FK
OR
SELECT
Table1.*,
Table2sub.*
FROM Table1
LEFT JOIN (SELECT FK, a, b FROM Table2) AS Table2sub ON Table1.PK = Table2sub.FK
SQL is a descriptive language, not a procedural language. That is, a SQL query describes what the result set looks like, not how the result is produced.
In fact, what the engine runs is called a directed acyclic graph (DAG) -- and that looks nothing like a query. The SQL engine first parses the query, then compiles it, then optimizes it to produce the DAG.
SQL Server has a good optimizer. It is not going to be confused by subqueries. Some SQL compilers are not quite as smart and will materialize the subquery -- which could have a big impact on performance.
If you look at the execution plans, you will see that they are the same in this case.

Why does FULL JOIN order make a difference in these queries?

I'm using PostgreSQL. Everything I read here suggests that in a query using nothing but full joins on a single column, the order of tables joined basically doesn't matter.
My intuition says this should also go for multiple columns, so long as every common column is listed in the query where possible (that is, wherever both joined tables have the column in common). But this is not the case, and I'm trying to figure out why.
Simplified to three tables a, b, and c.
Columns in table a: id, name_a
Columns in table b: id, id_x
Columns in table c: id, id_x
This query:
SELECT *
FROM a
FULL JOIN b USING(id)
FULL JOIN c USING(id, id_x);
returns a different number of rows than this one:
SELECT *
FROM a
FULL JOIN c USING(id)
FULL JOIN b USING(id, id_x);
What I want/expect is hard to articulate, but basically, a I'd like a "complete" full merger. I want no null fields anywhere unless that is unavoidable.
For example, whenever there is a not-null id, I want the corresponding name column to always have the name_a and not be null. Instead, one of those example queries returns semi-redundant results, with one row having a name_a but no id, and another having an id but no name_a, rather than a single merged row.
When the joins are listed in the other order, I do get that desired result (but I'm not sure what other problems might occur, because future data is unknown).
Your queries are different.
In the first, you are doing a full join to b using a single column, id.
In the second, you are doing a full join to b using two columns.
Although the two queries could return the same results under some circumstances, there is not reason to think that the results would be comparable.
Argument order matters in OUTER JOINs, except that FULL NATURAL JOIN is symmetric. They return what an INNER JOIN (ON, USING or NATURAL) does but also the unmatched rows from the left (LEFT JOIN), right (RIGHT JOIN) or both (FULL JOIN) tables extended by NULLs.
USING returns the single shared value for each specified column in INNER JOIN rows; in NULL-extended rows another common column can have NULL in one table's version and a value in the other's.
Join order matters too. Even FULL NATURAL JOIN is not associative, since with multiple tables each pair of tables (either operand being an original or join result) can have a unique set of common columns, ie in general (A ⟗ B) ⟗ C ≠ A ⟗ (B ⟗ C).
There are a lot of special cases where certain additional identities hold. Eg FULL JOIN USING all common column names and OUTER JOIN ON equality of same-named columns are symmetric. Some cases involve CKs (candidate keys), FKs (foreign keys) and other constraints on arguments.
Your question doesn't make clear exactly what input conditions you are assuming or what output conditions you are seeking.

SQL Join between tables with conditions

I'm thinking about which should be the best way (considering the execution time) of doing a join between 2 or more tables with some conditions. I got these three ways:
FIRST WAY:
select * from
TABLE A inner join TABLE B on A.KEY = B.KEY
where
B.PARAM=VALUE
SECOND WAY
select * from
TABLE A inner join TABLE B on A.KEY = B.KEY
and B.PARAM=VALUE
THIRD WAY
select * from
TABLE A inner join (Select * from TABLE B where B.PARAM=VALUE) J ON A.KEY=J.KEY
Consider that tables have more than 1 milion of rows.
What your opinion? Which should be the right way, if exists?
Usually putting the condition in where clause or join condition has no noticeable differences in inner joins.
If you are using outer joins ,putting the condition in the where clause improves query time because when you use condition in the where clause of
left outer joins, rows which aren't met the condition will be deleted from the result set and the result set becomes smaller.
But if you use the condition in join clause of left outer joins ,no rows deletes and result set is bigger in comparison to using condition in the where clause.
for more clarification,follow the example.
create table A
(
ano NUMBER,
aname VARCHAR2(10),
rdate DATE
)
----A data
insert into A
select 1,'Amand',to_date('20130101','yyyymmdd') from dual;
commit;
insert into A
select 2,'Alex',to_date('20130101','yyyymmdd') from dual;
commit;
insert into A
select 3,'Angel',to_date('20130201','yyyymmdd') from dual;
commit;
create table B
(
bno NUMBER,
bname VARCHAR2(10),
rdate DATE
)
insert into B
select 3,'BOB',to_date('20130201','yyyymmdd') from dual;
commit;
insert into B
select 2,'Br',to_date('20130101','yyyymmdd') from dual;
commit;
insert into B
select 1,'Bn',to_date('20130101','yyyymmdd') from dual;
commit;
first of all we have normal query which joins 2 tables with each other:
select * from a inner join b on a.ano=b.bno
the result set has 3 records.
now please run below queries:
select * from a inner join b on a.ano=b.bno and a.rdate=to_date('20130101','yyyymmdd')
select * from a inner join b on a.ano=b.bno where a.rdate=to_date('20130101','yyyymmdd')
as you see above results row counts have no differences,and According to my experience there is no noticeable performance differences for data in large volume.
please run below queries:
select * from a left outer join b on a.ano=b.bno and a.rdate=to_date('20130101','yyyymmdd')
in this case,the count of output records will be equal to table A records.
select * from a left outer join b on a.ano=b.bno where a.rdate=to_date('20130101','yyyymmdd')
in this case , records of A which didn't met the condition deleted from the result set and as I said the result set will have less records(in this case 2 records).
According to above examples we can have following conclusions:
1-in case of using inner joins,
there is no special differences between putting condition in where clause or join clause ,but please try to put tables in from clause in order to have minimum intermediate result row counts:
(http://www.dba-oracle.com/art_dbazine_oracle10g_dynamic_sampling_hint.htm)
2-In case of using outer joins,whenever you don't care of exact result row counts (don't care of missing records of table A which have no paired records in table B and fields of table B will be null for these records in the result set),put the condition in the where clause to delete a set of rows which aren't met the condition and obviously improve query time by decreasing the result row counts.
but in special cases you HAVE TO put the condition in the join part.for example if you want that your result row count will be equal to table 'A' row counts(this case is common in ETL processes) you HAVE TO put the condition in the join clause.
3-avoiding subquery is recommended by lots of reliable resources and expert programmers.It usually increase the query time and you can use subquery just when its result data set is small.
I hope this will be useful:)
1M rows really isn't that much - especially if you have sensible indexes. I'd start off with making your queries as readable and maintainable as possible, and only start optimizing if you notice a perforamnce problem with the query (and as Gordon Linoff said in his comment - it's doubtful there would even be a difference between the three).
It may be a matter of taste, but to me, the third way seems clumsy, so I'd cross it out. Personally, I prefer using JOIN syntax for the joining logic (i.e., how A and B's rows are matched) and WHERE for filtering (i.e., once matched, which rows interest me), so I'd go for the first way. But again, it really boils down to personal taste and preferences.
You need to look at the execution plans for the queries to judge which is the most computationally efficient. As pointed out in the comments you may find they are equivalent. Here is some information on Oracle execution plans. Depending on what editor / IDE you use the may be a shortcut for this e.g. F5 in PL/SQL Developer.

Is there some equivalent to subquery correlation when making a derived table?

I need to flatten out 2 rows in a vertical table (and then join to a third table) I generally do this by making a derived table for each field I need. There's only two fields, I figure this isn't that unreasonable.
But I know that the rows I want back in the derived table, are the subset that's in my join with my third table.
So I'm trying to figure out the best derived tables to make so that the query runs most efficiently.
I figure the more restrictive I make the derived table's where clause, the smaller the derived table will be, the better response I'll get.
Really what I want is to correlate the where clause of the derived table with the join with the 3rd table, but you can't do that in sql, which is too bad. But I'm no sql master, maybe there's some trick I don't know about.
The other option is just to make the derived table(s) with no where clause and it just ends up joining the entire table twice (once for each field), and when I do my join against them the join filters every thing out.
So really what I'm asking I guess is what's the best way to make a derived table where I know pretty much specifically what rows I want, but sql won't let me get at them.
An example:
table1
------
id tag value
-- ----- -----
1 first john
1 last smith
2 first sally
2 last smithers
table2
------
id occupation
-- ----------
1 carpenter
2 homemaker
select table2.occupation, firsttable.first, lasttable.last from
table2, (select value as first from table1 where tag = 'first') firsttable,
(select value as last from table1 where tag = 'last') lasttable
where table2.id = firsttable.id and table2.id = lasttable.id
What I want to do is make the firsttable where clause where tag='first' and id = table2.id
DERIVED tables are not to store the intermediate results as you expect. These are just a way to make code simpler. Using derived table doesnt mean that the derived table expression will be executed first and output of that will be used to join with remaining tables.Optimizer will automaticaly faltten derived tables in most of the cases.
However,There are cases where the optimizer might want to store the results of the subquery and thus materilize instead of flattening.It usually happens when you have some kind of aggregate functions or like that.But in your case the query is too simple and thus optimizer will flatten query
Also,storing derived table expression wont make your query fast it will in turn could make it worse.Your real problem is too much normalization.Fix that query will be just a join of two tables.
Why you have this kind of normalization?Why you are storing col values as rows.Try to denormalize table1 so that it has two columns first and last.That will be best solution for this.
Also, do you have proper indexes on id and tag column? if yes then a merge join is quite good for your query.
Please provide index details on these tables and the plan generated by your query.
Your query will be used like an inner join query.
select table2.occupation, first.valkue as first, last.value as last
from
table2
inner join table1 first
on first.tag = 'first'
and first.id =table2.id
inner join table1 last
on last.tag = 'last'
and table2.id = last.id
I think what you're asking for is a COMMON TABLE EXPRESSION. If your platform doesn't implement them, then a temporary table may be the best alternative.
I'm a little confused. Your query looks okay . . . although it looks better with proper join syntax.
select table2.occupation, firsttable.first, lasttable.last
from table2 join
(select value as first from table1 where tag = 'first') firsttable
on table2.id = firsttable.id join
(select value as last from table1 where tag = 'last') lasttable
on table2.id = lasttable.id
This query does what you are asking it to do. SQL is a declarative language, not a procedural language. This means that you describe the result set and rely on the database SQL compiler to turn it into the right set of commands. (That said, sometimes how a query is structured does make it easier or harder for some engines to produce efficient query plans.)

Difference between "and" and "where" in joins

Whats the difference between
SELECT DISTINCT field1
FROM table1 cd
JOIN table2
ON cd.Company = table2.Name
and table2.Id IN (2728)
and
SELECT DISTINCT field1
FROM table1 cd
JOIN table2
ON cd.Company = table2.Name
where table2.Id IN (2728)
both return the same result and both have the same explain output
Firstly there is a semantic difference. When you have a join, you are saying that the relationship between the two tables is defined by that condition. So in your first example you are saying that the tables are related by cd.Company = table2.Name AND table2.Id IN (2728). When you use the WHERE clause, you are saying that the relationship is defined by cd.Company = table2.Name and that you only want the rows where the condition table2.Id IN (2728) applies. Even though these give the same answer, it means very different things to a programmer reading your code.
In this case, the WHERE clause is almost certainly what you mean so you should use it.
Secondly there is actually difference in the result in the case that you use a LEFT JOIN instead of an INNER JOIN. If you include the second condition as part of the join, you will still get a result row if the condition fails - you will get values from the left table and nulls for the right table. If you include the condition as part of the WHERE clause and that condition fails, you won't get the row at all.
Here is an example to demonstrate this.
Query 1 (WHERE):
SELECT DISTINCT field1
FROM table1 cd
LEFT JOIN table2
ON cd.Company = table2.Name
WHERE table2.Id IN (2728);
Result:
field1
200
Query 2 (AND):
SELECT DISTINCT field1
FROM table1 cd
LEFT JOIN table2
ON cd.Company = table2.Name
AND table2.Id IN (2728);
Result:
field1
100
200
Test data used:
CREATE TABLE table1 (Company NVARCHAR(100) NOT NULL, Field1 INT NOT NULL);
INSERT INTO table1 (Company, Field1) VALUES
('FooSoft', 100),
('BarSoft', 200);
CREATE TABLE table2 (Id INT NOT NULL, Name NVARCHAR(100) NOT NULL);
INSERT INTO table2 (Id, Name) VALUES
(2727, 'FooSoft'),
(2728, 'BarSoft');
SQL comes from relational algebra.
One way to look at the difference is that JOINs are operations on sets that can produce more records or less records in the result than you had in the original tables. On the other side WHERE will always restrict the number of results.
The rest of the text is extra explanation.
For overview of join types see article again.
When I said that the where condition will always restrict the results, you have to take into account that when we are talking about queries on two (or more) tables you have to somehow pair records from these tables even if there is no JOIN keyword.
So in SQL if the tables are simply separated by a comma, you are actually using a CROSS JOIN (cartesian product) which returns every row from one table for each row in the other.
And since this is a maximum number of combinations of rows from two tables then the results of any WHERE on cross joined tables can be expressed as a JOIN operation.
But hold, there are exceptions to this maximum when you introduce LEFT, RIGHT and FULL OUTER joins.
LEFT JOIN will join records from the left table on a given criteria with records from the right table, BUT if the join criteria, looking at a row from the left table is not satisfied for any records in the right table the LEFT JOIN will still return a record from the left table and in the columns that would come from the right table it will return NULLs (RIGHT JOIN works similarly but from the other side, FULL OUTER works like both at the same time).
Since the default cross join does NOT return those records you can not express these join criteria with WHERE condition and you are forced to use JOIN syntax (oracle was an exception to this with an extension to SQL standard and to = operator, but this was not accepted by other vendors nor the standard).
Also, joins usually, but not always, coincide with existing referential integrity and suggest relationships between entities, but I would not put as much weight into that since the where conditions can do the same (except in the before mentioned case) and to a good RDBMS it will not make a difference where you specify your criteria.
The join is used to reflect the entity relations
the where clause filters down results.
So the join clauses are 'static' (unless the entity relations change), while the where clauses are use-case specific.
There is no difference. "ON" is like a synonym for "WHERE", so t he second kind of reads like:
JOIN table2 WHERE cd.Company = table2.Name AND table2.Id IN (2728)
There is no difference when the query optimisation engine breaks it down to its relevant query operators.