Does a SQL JOIN's ON imply WHERE? - sql

When joining a very large table to a small table, I try to be as specific as possible in my join query. Am I going overboard, however?
Let's say I have SmallTable with one column and just three values: "Peter", "Paul", and "Mary". I'll end up joining a bunch of huge tables to this. Should I put a WHERE statement in my join in order to narrow the join's select statement? Or does a join imply the where condition?
SELECT
Username,
click.TotalClicks,
otherjoin.SneezePercent,
anotherjoin.Coats
FROM
SmallTable
LEFT JOIN (
SELECT
Person,
SUM(Clicks) AS TotalClicks
FROM
HugeTable
WHERE
Person LIKE 'Peter' OR Person LIKE 'Paul' OR Person LIKE 'Mary'
) click
ON click.Person = Username
LEFT JOIN (
...

I think the version you currently have is the optimal one, because your WHERE restriction will save the database from aggregating over names whose results you ultimately will be discarding anyway in the join, in the outer query. Your current use of LIKE might preclude an index, but the database also might be able to use an index in that WHERE clause, for even better performance.
The alternative to this, namely relying on the join with the small table, would filter out names you don't want, but by then the aggregation would have already been done on the entire large table.

Related

Semi-join vs Subqueries

What is the difference between semi-joins and a subquery? I am currently taking a course on this on DataCamp and i'm having a hard time making a distinction between the two.
Thanks in advance.
A join or a semi join is required whenever you want to combine two or more entities records based on some common conditional attributes.
Unlike, Subquery is required whenever you want to have a lookup or a reference on same table or other tables
In short, when your requirement is to get additional reference columns added to existing tables attributes then go for join else when you want to have a lookup on records from the same table or other tables but keeping the same existing columns as o/p go for subquery
Also, In case of semi join it can act/used as a subquery because most of the times we dont actually join the right table instead we mantain a check via subquery to limit records in the existing hence semijoin but just that it isnt a subquery by itself
I don't really think of a subquery and a semi-join as anything similar. A subquery is nothing more interesting than a query that is used inside another query:
select * -- this is often called the "outer" query
from (
select columnA -- this is the subquery inside the parentheses
from mytable
where columnB = 'Y'
)
A semi-join is a concept based on join. Of course, joining tables will combine both tables and return the combined rows based on the join criteria. From there you select the columns you want from either table based on further where criteria (and of course whatever else you want to do). The concept of a semi-join is when you want to return rows from the first table only, but you need the 2nd table to decide which rows to return. Example: you want to return the people in a class:
select p.FirstName, p.LastName, p.DOB
from people p
inner join classes c on c.pID = p.pID
where c.ClassName = 'SQL 101'
group by p.pID
This accomplishes the concept of a semi-join. We are only returning columns from the first table (people). The use of the group by is necessary for the concept of a semi-join because a true join can return duplicate rows from the first table (depending on the join criteria). The above example is not often referred to as a semi-join, and is not the most typical way to accomplish it. The following query is a more common method of accomplishing a semi-join:
select FirstName, LastName, DOB
from people
where pID in (select pID
from class
where ClassName = 'SQL 101'
)
There is no formal join here. But we're using the 2nd table to determine which rows from the first table to return. It's a lot like saying if we did join the 2nd table to the first table, what rows from the first table would match?
For performance, exists is typically preferred:
select FirstName, LastName, DOB
from people p
where exists (select pID
from class c
where c.pID = p.pID
and c.ClassName = 'SQL 101'
)
In my opinion, this is the most direct way to understand the semi-join. There is still no formal join, but you can see the idea of a join hinted at by the usage of directly matching the first table's pID column to the 2nd table's pID column.
Final note. The last 2 queries above each use a subquery to accomplish the concept of a semi-join.

Whether Inner Queries Are Okay?

I often see something like...
SELECT events.id, events.begin_on, events.name
FROM events
WHERE events.user_id IN ( SELECT contacts.user_id
FROM contacts
WHERE contacts.contact_id = '1')
OR events.user_id IN ( SELECT contacts.contact_id
FROM contacts
WHERE contacts.user_id = '1')
Is it okay to have query in query? Is it "inner query"? "Sub-query"? Does it counts as three queries (my example)? If its bad to do so... how can I rewrite my example?
Your example isn't too bad. The biggest problems usually come from cases where there is what's called a "correlated subquery". That's when the subquery is dependent on a column from the outer query. These are particularly bad because the subquery effectively needs to be rerun for every row in the potential results.
You can rewrite your subqueries using joins and GROUP BY, but as you have it performance can vary, especially depending on your RDBMS.
It varies from database to database, especially if the columns compared are
indexed or not
nullable or not
..., but generally if your query is not using columns from the table joined to -- you should be using either IN or EXISTS:
SELECT e.id, e.begin_on, e.name
FROM EVENTS e
WHERE EXISTS (SELECT NULL
FROM CONTACTS c
WHERE ( c.contact_id = '1' AND c.user_id = e.user_id )
OR ( c.user_id = '1' AND c.contact_id = e.user_id )
Using a JOIN (INNER or OUTER) can inflate records if the child table has more than one record related to a parent table record. That's fine if you need that information, but if not then you need to use either GROUP BY or DISTINCT to get a result set of unique values -- and that can cost you when you review the query costs.
EXISTS
Though EXISTS clauses look like correlated subqueries, they do not execute as such (RBAR: Row By Agonizing Row). EXISTS returns a boolean based on the criteria provided, and exits on the first instance that is true -- this can make it faster than IN when dealing with duplicates in a child table.
You could JOIN to the Contacts table instead:
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
-- exercise: without the GROUP BY, how many duplicate rows can you end up with?
This leaves the following question up to the database: "Should we look through all the contacts table and find all the '1's in the various columns, or do something else?" where your original SQL didn't give it much choice.
The most common term for this sort of query is "subquery." There is nothing inherently wrong in using them, and can make your life easier. However, performance can often be improved by rewriting queries w/ subqueries to use JOINs instead, because the server can find optimizations.
In your example, three queries are executed: the main SELECT query, and the two SELECT subqueries.
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
In your case, I believe the JOIN version will be better as you can avoid two SELECT queries on contacts, opting for the JOIN instead.
See the mysql docs on the topic.

INNER JOIN vs multiple table names in "FROM" [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
INNER JOIN versus WHERE clause — any difference?
What is the difference between an INNER JOIN query and an implicit join query (i.e. listing multiple tables after the FROM keyword)?
For example, given the following two tables:
CREATE TABLE Statuses(
id INT PRIMARY KEY,
description VARCHAR(50)
);
INSERT INTO Statuses VALUES (1, 'status');
CREATE TABLE Documents(
id INT PRIMARY KEY,
statusId INT REFERENCES Statuses(id)
);
INSERT INTO Documents VALUES (9, 1);
What is the difference between the below two SQL queries?
From the testing I've done, they return the same result. Do they do the same thing? Are there situations where they will return different result sets?
-- Using implicit join (listing multiple tables)
SELECT s.description
FROM Documents d, Statuses s
WHERE d.statusId = s.id
AND d.id = 9;
-- Using INNER JOIN
SELECT s.description
FROM Documents d
INNER JOIN Statuses s ON d.statusId = s.id
WHERE d.id = 9;
There is no reason to ever use an implicit join (the one with the commas). Yes for inner joins it will return the same results. However, it is subject to inadvertent cross joins especially in complex queries and it is harder for maintenance because the left/right outer join syntax (deprecated in SQL Server, where it doesn't work correctly right now anyway) differs from vendor to vendor. Since you shouldn't mix implicit and explict joins in the same query (you can get wrong results), needing to change something to a left join means rewriting the entire query.
If you do it the first way, people under the age of 30 will probably chuckle at you, but as long as you're doing an inner join, they produce the same result and the optimizer will generate the same execution plan (at least as far as I've ever been able to tell).
This does of course presume that the where clause in the first query is how you would be joining in the second query.
This will probably get closed as a duplicate, btw.
The nice part of the second method is that it helps separates the join condition (on ...) from the filter condition (where ...). This can help make the intent of the query more readable.
The join condition will typically be more descriptive of the structure of the database and the relation between the tables. e.g., the salary table is related to the employee table by the EmployeeID column, and queries involving those two tables will probably always join on that column.
The filter condition is more descriptive of the specific task being performed by the query. If the query is FindRichPeople, the where clause might be "where salaries.Salary > 1000000"... thats describing the task at hand, not the database structure.
Note that the SQL compiler doesn't see it that way... if it decides that it will be faster to cross join and then filter the results, it will cross join and filter the results. It doesn't care what is in the ON clause and whats in the WHERE clause. But, that typically wont happen if the on clause matches a foreign key or joins to a primary key or indexed column. As far as operating correctly, they are identical; as far as writing readable, maintainable code, the second way is probably a little better.
there is no difference as far as I know is the second one with the inner join the new way to write such statements and the first one the old method.
The first one does a Cartesian product on all record within those two tables then filters by the where clause.
The second only joins on records that meet the requirements of your ON clause.
EDIT: As others have indicated, the optimization engine will take care of an attempt on a Cartesian product and will result in the same query more or less.
A bit same. Can help you out.
Left join vs multiple tables in SQL (a)
Left join vs multiple tables in SQL (b)
In the example you've given, the queries are equivalent; if you're using SQL Server, run the query and display the actual exection plan to see what the server's doing internally.

How to avoid Cartesian product in an INNER JOIN query?

I have 6 tables, let's call them a,b,c,d,e,f. Now I want to search all the colums (except the ID columns) of all tables for a certain word, let's say 'Joe'. What I did was, I made INNER JOINS over all the tables and then used LIKE to search the columns.
INNER JOIN
...
ON
INNER JOIN
...
ON.......etc.
WHERE a.firstname
~* 'Joe'
OR a.lastname
~* 'Joe'
OR b.favorite_food
~* 'Joe'
OR c.job
~* 'Joe'.......etc.
The results are correct, I get all the colums I was looking for. But I also get some kind of cartesian product, I get 2 or more lines with almost the same results.
How can i avoid this? I want so have each line only once, since the results should appear on a web search.
UPDATE
I first tried to figure out if the SELECT DISTINCT thing would work by using this statement: pastie.org/970959 But it still gives me a cartesian product. What's wrong with this?
try SELECT DISTINCT?
On what condition do you JOIN this tables? Do you have foreign keys or something?
Maybe you should find that word on each table separately?
What kind of server are you using? Microsoft SQL Server has a full-text index feature (I think others have something like this too) which lets you search for keywords in a much less resource-intensive way.
Also consider using UNION instead of joining the tables.
Without seeing your tables, I can only really assume what's going on here is you have a one-to-many relationship somewhere. You probably want to do everything in a subquery, select out the distinct IDs, then get the data you want to display by ID. Something like:
SELECT a.*, b.*
FROM (SELECT DISTINCT a.ID
FROM ...
INNER JOIN ...
INNER JOIN ...
WHERE ...) x
INNER JOIN a ON x.ID = a.ID
INNER JOIN b ON x.ID = b.ID
A couple of things to note, however:
This is going to be sloooow and you probably want to use full-text search instead (if your RDBMS supports it).
It may be faster to search each table separately rather than to join everything in a Cartesian product first and then filter with ORs.
If your tables are entity type tables, for example a being persons and b being companies, I don't think you can avoid a cartesian product if you search for the results in this way (single query).
You say you want to search all the tables for a certain word, but you probably want to separate the results into the corresponding types. Right? Otherwise a web search would not make much sense.
So if you seach for 'Joe', you want to see persons containing the name 'Joe' and for example the company named 'Joe's gym'. Since you are searching for different entities so you should split the search into different queries.
If you really want to do this in one query, you will have to change your database structure to accommodate. You will need some form of 'search table' containing an entity ID (PK) and entity type, and a list of keywords you want that entity to be found with. For example:
EntityType, EntityID, Keywords
------------------------------
Person, 4, 'Joe', 'Doe'
Company, 12, 'Joe''s Gym', 'Gym'
Something like that?
However it's different when your search returns only one type of entity, say a Person, and you want to return the Persons for which you get a hit on that keyword (in any related table to that Person). Then you will need to select all the fields you want to show and group by them, leaving out the fields in which you are searching. Including them inevitably leads to a cartesian product.
I'm just brainstorming here, by the way. It hope it's helpful.

How to optimize an SQL query with many thousands of WHERE clauses

I have a series of queries against a very mega large database, and I have hundreds-of-thousands of ORs in WHERE clauses. What is the best and easiest way to optimize such SQL queries? I found some articles about creating temporary tables and using joins, but I am unsure. I'm new to serious SQL, and have been cutting and pasting results from one into the next.
SELECT doc_id, language, author, title FROM doc_text WHERE language='fr' OR language='es'
SELECT doc_id, ref_id FROM doc_ref WHERE doc_id=1234567 OR doc_id=1234570 OR doc_id=1234572 OR doc_id=1234596 OR OR OR ...
SELECT ref_id, location_id FROM ref_master WHERE ref_id=098765 OR ref_id=987654 OR ref_id=876543 OR OR OR ...
SELECT location_id, location_display_name FROM location
SELECT doc_id, index_code, FROM doc_index WHERE doc_id=1234567 OR doc_id=1234570 OR doc_id=1234572 OR doc_id=1234596 OR OR OR x100,000
These unoptimized query can take over 24 hours each. Cheers.
I think I just answered my own question... NESTED TABLES!
SELECT doc_text.doc_id, doc_text.language, doc_text.author, doc_text.title, doc_ref.ref_id, ref_master.location_id, location.location_display_name, doc_index.doc_id, doc_index.display_heading
FROM DOC_TEXT, DOC_REF, REF_MASTER, LOCATION, DOC_INDEX
WHERE
    doc_text.language='fr' OR doc_text.language='es'
AND
    doc_text.doc_id=doc_ref.doc_id
AND
    doc_ref.doc_id=ref_master.ref_id
AND
    ref_master.location_id=location.location_id
AND
    doc_text.doc_id=doc_index.doc_id
The easiest way to get that done is this:
Make indexes on the columns that are being filtered on (language, ref_id, doc_id, etc), at least double check their existence. Make them clustered if they are the primary index of the table.
Create helper tables that contain the conditions (add/delete conditions through INSERT/DELETE statements), index them too.
instead of 1000 "OR" components, make an INNER JOIN:
So...
SELECT doc_id, language, author, title
FROM doc_text
WHERE language='fr' OR language='es'
becomes
INSERT language_search (language) VALUES ('fr')
INSERT language_search (language) VALUES ('es')
/* and 50 more */
SELECT dt.doc_id, dt.language, dt.author, dt.title
FROM doc_text dt
INNER JOIN language_search ls ON dt.language = ls.language
Instead of having a lot of conditions on the same field, you can use the in keyword:
SELECT doc_id, ref_id FROM doc_ref WHERE doc_id in (1234567, 1234570, 1234572, 1234596, ...)
This will make the queries shorter, but it's not certain that the performance will differ much. You should make sure that you have indexes on the relevant fields, that usually makes a huge difference for the performance.
Edit
However, it seems that the reason that you have a lot of values to compare is that you are using the result from one query to create the next. This should of course be solved with a join instead of a dynamic query:
select
doc_text.doc_id, doc_text.language, doc_text.author, doc_text.title,
doc_ref.ref_id, ref_master.location_id, location.location_display_name,
doc_index.doc_id, doc_index.display_heading
from DOC_TEXT
inner join DOC_REF on doc_text.doc_id = doc_ref.doc_id
inner join REF_MASTER on doc_ref.doc_id = ref_master.ref_id
inner join LOCATION on ref_master.location_id = location.location_id
inner join DOC_INDEX on doc_text.doc_id = doc_index.doc_id
where
doc_text.language in ('fr', 'es')
I think your real problem is that you are not JOINing tables.
this is a guess, but I'll bet that you run a query and then get all the IDs in your application and then run another query WHERE all the rows match from the previous query. You would greatly improve performance by writing a query with a join:
SELECT
*
FROM YourTableA a
INNER JOIN YourTableB b ON a.ID=b.ID
WHERE a. .....
then process the single result set in your application.