How to avoid Cartesian product in an INNER JOIN query? - sql

I have 6 tables, let's call them a,b,c,d,e,f. Now I want to search all the colums (except the ID columns) of all tables for a certain word, let's say 'Joe'. What I did was, I made INNER JOINS over all the tables and then used LIKE to search the columns.
INNER JOIN
...
ON
INNER JOIN
...
ON.......etc.
WHERE a.firstname
~* 'Joe'
OR a.lastname
~* 'Joe'
OR b.favorite_food
~* 'Joe'
OR c.job
~* 'Joe'.......etc.
The results are correct, I get all the colums I was looking for. But I also get some kind of cartesian product, I get 2 or more lines with almost the same results.
How can i avoid this? I want so have each line only once, since the results should appear on a web search.
UPDATE
I first tried to figure out if the SELECT DISTINCT thing would work by using this statement: pastie.org/970959 But it still gives me a cartesian product. What's wrong with this?

try SELECT DISTINCT?

On what condition do you JOIN this tables? Do you have foreign keys or something?
Maybe you should find that word on each table separately?

What kind of server are you using? Microsoft SQL Server has a full-text index feature (I think others have something like this too) which lets you search for keywords in a much less resource-intensive way.
Also consider using UNION instead of joining the tables.

Without seeing your tables, I can only really assume what's going on here is you have a one-to-many relationship somewhere. You probably want to do everything in a subquery, select out the distinct IDs, then get the data you want to display by ID. Something like:
SELECT a.*, b.*
FROM (SELECT DISTINCT a.ID
FROM ...
INNER JOIN ...
INNER JOIN ...
WHERE ...) x
INNER JOIN a ON x.ID = a.ID
INNER JOIN b ON x.ID = b.ID
A couple of things to note, however:
This is going to be sloooow and you probably want to use full-text search instead (if your RDBMS supports it).
It may be faster to search each table separately rather than to join everything in a Cartesian product first and then filter with ORs.

If your tables are entity type tables, for example a being persons and b being companies, I don't think you can avoid a cartesian product if you search for the results in this way (single query).
You say you want to search all the tables for a certain word, but you probably want to separate the results into the corresponding types. Right? Otherwise a web search would not make much sense.
So if you seach for 'Joe', you want to see persons containing the name 'Joe' and for example the company named 'Joe's gym'. Since you are searching for different entities so you should split the search into different queries.
If you really want to do this in one query, you will have to change your database structure to accommodate. You will need some form of 'search table' containing an entity ID (PK) and entity type, and a list of keywords you want that entity to be found with. For example:
EntityType, EntityID, Keywords
------------------------------
Person, 4, 'Joe', 'Doe'
Company, 12, 'Joe''s Gym', 'Gym'
Something like that?
However it's different when your search returns only one type of entity, say a Person, and you want to return the Persons for which you get a hit on that keyword (in any related table to that Person). Then you will need to select all the fields you want to show and group by them, leaving out the fields in which you are searching. Including them inevitably leads to a cartesian product.
I'm just brainstorming here, by the way. It hope it's helpful.

Related

Does a SQL JOIN's ON imply WHERE?

When joining a very large table to a small table, I try to be as specific as possible in my join query. Am I going overboard, however?
Let's say I have SmallTable with one column and just three values: "Peter", "Paul", and "Mary". I'll end up joining a bunch of huge tables to this. Should I put a WHERE statement in my join in order to narrow the join's select statement? Or does a join imply the where condition?
SELECT
Username,
click.TotalClicks,
otherjoin.SneezePercent,
anotherjoin.Coats
FROM
SmallTable
LEFT JOIN (
SELECT
Person,
SUM(Clicks) AS TotalClicks
FROM
HugeTable
WHERE
Person LIKE 'Peter' OR Person LIKE 'Paul' OR Person LIKE 'Mary'
) click
ON click.Person = Username
LEFT JOIN (
...
I think the version you currently have is the optimal one, because your WHERE restriction will save the database from aggregating over names whose results you ultimately will be discarding anyway in the join, in the outer query. Your current use of LIKE might preclude an index, but the database also might be able to use an index in that WHERE clause, for even better performance.
The alternative to this, namely relying on the join with the small table, would filter out names you don't want, but by then the aggregation would have already been done on the entire large table.

Short Circuit Intersect / Inner Join feature in SQL on a condition, not field (PostgreSQL)

I am looking for the feature to join two tables on one condition (not field) in a "short-circuit way", provided the join operation is absolutely expensive (a.field::VARCHAR is contained within extensive b.field::TEXT).
I don't need duplicates, it's more of a "get rows of words.'word' field which are contained in any books.'content' field". If the first book contains it, skip checking if the other 2000-pages books also contains it.
If I am not wrong, neither INNER JOIN nor INTERSECT are useful in my situation:
For INTERSECT, I cannot intersect on a concrete condition like CONTAINS, so I need to retrieve all registries in both places, do the cartesian product and then filter by where
For INNER JOIN, as it returns duplicates, I infer the logic is not short-circuit, and it will check if my word is contained in each of the Books' entries
IN and EXISTS also seem not to work on custom conditions
Is there any way to perform my need in a optimal performant way?
EXPLAIN ANALYZE
SELECT * FROM words w
WHERE EXISTS ( SELECT 1 FROM books b
WHERE POSITION( w.word IN b.book) > 0
);

SQL Server: select over several tables / conditions

Okay, I'm relatively new to the more advanced uses of SQL Server.
I have several tables that I need to gather informations from, and several of these tables links to other tables where I need a specific information. As a result, I just want one row with all the information, preferential named with aliases.
For example:
Tab_Transcoders:
ID, VideoCamID, InputStreamID, OutputStreamID.
where InputStream links to another table where I need the row of the matching ID, where in this row are other ID's (e.g. StreamType_ID that belongs to a third table containing ID_StreamType and Description etc.)
Same with OutputStreamID, same with VideoCamID.
In the end, I need a row containing for example:
ID, VideoCamID, InputStreamID, InputStreamType, InputStreamTypeDesc,
OutputStreamID, OutputStreamType, OutputStreamDesc, VideoCamID, etc. etc. etc.
It is important for me that I can set aliases, as for example InputStreamID & OutputStreamID links to the same table where all my Streams are listed (with IP's, Descs..)
I can accomplish this with doing like 100 SELECTS & SUBSELECTS, but I don't think that's an appropriate way.
I read some informations about things like CURSOR, UNION, FETCH, JOIN etc. etc.. but I don't know which one I have to use for my purpose.
eli
I think you want something like the following....
Select
t.ID,
t.VideoCamID,
i.InputStreamID,
is.StreamType as InputStreamType,
is.StreamDesc as InputStreamDesc,
o.OutputStreamID,
os.StreamType as OutputStreamType,
os.StreamDesc as OutputStreamDesc,
v.VideoCamID
from
Tab_Transcoders t
inner join InputStreams i on i.InputStreamID=t.InputStreamId
inner join Streams is on is.StreamId=i.StreamId
inner join OutputStreams o on o.OutputStreamId=t.OutputStreamId
inner join Streams os on os.StreamID=o.StreamId
inner join VideoCams v on v.VideoCamId=t.VideoCamID
If there is a defined relationship between your tables, then Use Join.
e.g Customer Order
Order will have customer Id
Select Order.ID,Order.Quantity, Order.CustomerId, Customer.FullName, Customer.Address
From Orders Order
Join
Customer
On
Order.CustomerId = Customer.CustomerId
First start by getting data from two tables using the join and then if it works as per your requirement, add another required table in the join.
Read about SQl JOINS.. It is fairly simple.
I will recommend reading you some of the articles around CTE aka Common Table Expression.
Refer http://msdn.microsoft.com/en-us/library/ms190766%28v=sql.105%29.aspx.
Apart from this never use subqueries. Try to use inner join / any other join if possible.

Why is selecting specified columns, and all, wrong in Oracle SQL?

Say I have a select statement that goes..
select * from animals
That gives a a query result of all the columns in the table.
Now, if the 42nd column of the table animals is is_parent, and I want to return that in my results, just after gender, so I can see it more easily. But I also want all the other columns.
select is_parent, * from animals
This returns ORA-00936: missing expression.
The same statement will work fine in Sybase, and I know that you need to add a table alias to the animals table to get it to work ( select is_parent, a.* from animals ani), but why must Oracle need a table alias to be able to work out the select?
Actually, it's easy to solve the original problem. You just have to qualify the *.
select is_parent, animals.* from animals;
should work just fine. Aliases for the table names also work.
There is no merit in doing this in production code. We should explicitly name the columns we want rather than using the SELECT * construct.
As for ad hoc querying, get yourself an IDE - SQL Developer, TOAD, PL/SQL Developer, etc - which allows us to manipulate queries and result sets without needing extensions to SQL.
Good question, I've often wondered this myself but have then accepted it as one of those things...
Similar problem is this:
sql>select geometrie.SDO_GTYPE from ngg_basiscomponent
ORA-00904: "GEOMETRIE"."SDO_GTYPE": invalid identifier
where geometrie is a column of type mdsys.sdo_geometry.
Add an alias and the thing works.
sql>select a.geometrie.SDO_GTYPE from ngg_basiscomponent a;
Lots of good answers so far on why select * shouldn't be used and they're all perfectly correct. However, don't think any of them answer the original question on why the particular syntax fails.
Sadly, I think the reason is... "because it doesn't".
I don't think it's anything to do with single-table vs. multi-table queries:
This works fine:
select *
from
person p inner join user u on u.person_id = p.person_id
But this fails:
select p.person_id, *
from
person p inner join user u on u.person_id = p.person_id
While this works:
select p.person_id, p.*, u.*
from
person p inner join user u on u.person_id = p.person_id
It might be some historical compatibility thing with 20-year old legacy code.
Another for the "buy why!!!" bucket, along with why can't you group by an alias?
The use case for the alias.* format is as follows
select parent.*, child.col
from parent join child on parent.parent_id = child.parent_id
That is, selecting all the columns from one table in a join, plus (optionally) one or more columns from other tables.
The fact that you can use it to select the same column twice is just a side-effect. There is no real point to selecting the same column twice and I don't think laziness is a real justification.
Select * in the real world is only dangerous when referring to columns by index number after retrieval rather than by name, the bigger problem is inefficiency when not all columns are required in the resultset (network traffic, cpu and memory load).
Of course if you're adding columns from other tables (as is the case in this example it can be dangerous as these tables may over time have columns with matching names, select *, x in that case would fail if a column x is added to the table that previously didn't have it.
why must Oracle need a table alias to be able to work out the select
Teradata is requiring the same. As both are quite old (maybe better call it mature :-) DBMSes this might be historical reasons.
My usual explanation is: an unqualified * means everything/all columns and the parser/optimizer is simply confused because you request more than everything.

SELECT with ORs including table joins

I've got a database with three tables: Books (with book details, PK is CopyID), Keywords (list of keywords, PK is ID) and KeywordsLink which is the many-many link table between Books and Keywords with the fields ID, BookID and KeywordID.
I'm trying to make an advanced search form in my app where you can search on various criteria. At the moment I have it working with Title, Author and Publisher (all from the Book table). It produces SQL like:
SELECT * FROM Books WHERE Title Like '%Software%' OR Author LIKE '%Spolsky%';
I want to extend this search to also search using tags - basically to add another OR clause to search the tags. I've tried to do this by doing the following
SELECT *
FROM Books, Keywords, Keywordslink
WHERE Title LIKE '%Joel%'
OR (Name LIKE '%good%' AND BookID=Books.CopyID AND KeywordID=Keywords.ID)
I thought using the brackets might separate the 2nd part into its own kinda clause, so the join was only evaluated in that part - but it doesn't seem to be so. All it gives me is a long list of multiple copies of the one book that satisfies the Title LIKE '%Joel%' bit.
Is there a way of doing this using pure SQL, or would I have to use two SQL statements and combine them in my app (removing duplicates in the process).
I'm using MySQL at the moment if that matters, but the app uses ODBC and I'm hoping to make it DB agnostic (might even use SQLite eventually or have it so the user can choose what DB to use).
You need to join the 3 tables together, which gives you a tablular resultset. You can then check any columns you like, and make sure you get distinct results (i.e. no duplicates).
Like this:
select distinct b.*
from books b
left join keywordslink kl on kl.bookid = b.bookid
left join keywords k on kl.keywordid = k.keywordid
where b.title like '%assd%'
or k.keyword like '%asdsad%'
You should also try to avoid starting your LIKE values with a percent sign (%), as this means SQL Server can't use an index on that column and has to perform a full (and slow) table scan. This starts to make your query into a "starts with" query.
Maybe consider the full-text search options in SQL Server, also.
What you've done here is made a cartesian result set by having the tables joined with the commas but not having any join criteria. Switch your statements to use outer join statements and that should allow you to reference the keywords. I don't know your schema, but maybe something like this would work:
SELECT
*
FROM
Books
LEFT OUTER JOIN KeywordsLink ON KeywordsLink.BookID = Books.CopyID
LEFT OUTER JOIN Keywords ON Keywords.ID = KeywordsLink.KeywordID
WHERE Books.Title LIKE '%JOEL%'
OR Keywords.Name LIKE '%GOOD%'
Use UNION.
(SELECT Books.* FROM <first kind of search>)
UNION
(SELECT Books.* FROM <second kind of search>)
The point is that you could write two (or more) simple and efficient queries instead of one complicated query that tries to do everything at once.
If number of resulting rows is low, then UNION will have very little overhead (and you can use faster UNION ALL if you don't have duplicates or don't care about them).
SELECT * FROM books WHERE title LIKE'%Joel%' OR bookid IN
(SELECT bookid FROM keywordslink WHERE keywordid IN
(SELECT id FROM keywords WHERE name LIKE '%good%'))
Beware that older versions of MySQL didn't like subselects. I think they've fixed that.
You must also limit the product of the join by specifying something like
Books.FK1 = Keywords.FK1 and
Books.FK2 = Keywordslink.FK2 and
Keywords.FK3 = Keywordslink.FK3
But i don't know your exact data model so your solution may be slightly different.
I'm not aware of any way to accomplish a "conditional join" in SQL. I think you'll be best served with executing the two statements separately and combining them in the application. This approach is also more likely to stay DB-agnostic.
It looks like Neil Barnwell has covered the answer that I would have given, but I'll add one thing...
Books can have more than one author. If your data model is really designed as your query implies you might want to consider changing it to accommodate that fact.