Are joins the proper way to do cross table queries? - sql

I have a few tables in their third normal form and I need to do some cross table queries to get the information I need.
I looked at joins but it seems like it will create a new table. Is this the proper way to perform such queries? Or should I just do nested queries ? I guess it might make sense if I have to do these queries alot? I'm really not sure how well optimize these operations are. I'm using the sequelize ORM and I'm not sure I see any clear solution.

It seems to me you are asking about joins vs subqueries. These are to some extent different. But let's start with a couple of points.
A join creates a new relvar, not a new table. A relvar is a variable standing in for the relation output by the join operation. It is transient (as opposed to a view which would be persistent).
Joins and subqueries are not always perfect substitutes. Sometimes you will need both.
Your query output is also a relvar.
The above being said, generally where possible I think joins are preferable. The major reason is that a SQL query that can be written using the structure below is far easier (as you master the language) to both understand and debug than most alternatives, and also subqueries in column lists necessarily perform badly:
SELECT [column_list]
FROM [initial_table]
[join list]
WHERE [filters]
GROUP BY [grouping list]
HAVING [post-aggregation filters]
LIMIT [limit and offset]
If your query fits the above structure then you can usually expect that specific kinds of problems will occur in logic in specific parts of the query. On the other hand, with subqueries, you have to check these independently.

Related

SQL - Join Aggregated query or Aggregate/Sum after join?

I have a hard time figuring out what is best, or if there is difference at all,
however i have not found any material to help my understanding of this,
so i will ask this question, if not for me, then for others who might end up in the same situation.
Aggregating a sub-query before or after a join, in my specific situation the sub-query is rather slow due to fragmented data and bad normalization procedure,
I got a main query that is highly complex and a sub-query that is built from 3 small queries that is combined using union (will remove duplicate records)
i only need a single value from this sub-query (for each line), so at some point i will end up summing this value, (together with grouping the necessary control data with it so i can join)
what will have the greatest impact?
To sum sub-query before the join and then join with the aggregated version
To leave the data raw, and then sum the value together with the rest of the main query
remember there are thousands of records that will be summed for each line,
and the data is not native but built, and therefore may reside in memory,
(that is just a guess from the query optimizers perspective)
Usually I keep the group-by inside the subquery (referred as "inline view" in Oracle lingo).
This way the query is much more simple and clear.
Also I believe the execution plan is more efficient, because the data set to be aggregated is smaller and the resulting set of join keys is also smaller.
This is not a definitive answer though. If the row source that you are joining to the inline view has few matching rows, you may find that a early join reduces the aggregation effort.
The right anwer is: benchmark the queries for your particular data set.
I think in such a general way there is no right or wrong way to do it. The performance from a query like the one that you describe depends on many different factors:
what kind of join are you actually doing (and what algorithm is used in the background)
is the data to be joined small enough to fit into the memory of the machine joining it?
what query optimizations are you using, i.e. what DBMS (Oracle, MsSQL, MySQL, ...)
...
For your case I simply suggest benchmarking. I'm sorry if that does not seem like a satisfactory answer, but it is the way to go in many performance questions...
So set up a simple test using both your approaches and some test data, then pick whatever is faster.

Performance concerns with Joining to ad-hock sub query

When creating reports I find myself frequently joining to ad-hock sub queries like this:
SELECT * FROM Customer c
JOIN (SELECT CustomerId,ContractId FROM Contract WHERE StartDate > GetDate()) co ON co.CustomerId = c.CustomerId
Obviously this is just a qucik demo sql, but it illustrates the point.
I am wondering if there are some performance concerns with these types of queries. Are there other good alternatives? Typically the sub queries are more complicated with aggregates etc..
I know in this case it would make more sense to just join to the contact table directly, but I am just trying to illustrate the general idea without having to provide a convoluted example.
Any feedback would be appreciated :-)
Is this a common practice?
Not to be pedantic, but that's not technically an ad-hoc query, which would be defined as a query created by the end-users of the system. Instead, your example is really just a sub-query.
To address your question, if you find yourself joining to the same sub-query more than once, you should create a view, and then join to that view instead. This would allow you to simplify your code, and make any future changes to the sub-query in a single location.
If you only use these sub-queries in a single report, and there's nothing common between them across reports that could be placed in a view, then it's really just another way of writing your query. There's nothing necessarily bad about it, and it's not uncommon, but you might consider alternate approaches, such as replacing your subquery with additional joins - this may offer you a better execution plan, but that will vary on several other factors, including which RBDMS this is for.

SQL: multiple queries vs joins (specific case)

This question seems to have been asked a lot and the answer seems to be "it depends on the details". so I am asking for my specific case: Is it better for me to have multiple queries or use joins?
The details are as follows:
"products" table -- probably around 2000 rows, 15 or so columns
"tags" table -- probably around 10 rows, 3 columns
"types" table -- probably around 10 rows, 3 columns
I need the "tags" and "types" table to get the tag/type-id that is in the product table.
My gut says that if i join the tables i end up searching a much much larger set so its better to do multiple queries, but i am not really sure...
Thoughts?
No, join will probably outperform multiple queries. Your tables are extremely small.
Ask yourself what extra work would be involved in doing multiple queries... I don't know what you need this data for, but I assume you would, at some point, need to correlate the results - match Tags and Types to Products, wouldn't you? If you don't do that with a join, you just have to do it elsewhere with some other mechanism.
Further, your conception of this overlooks the fact that databases are designed for join scenarios. If you perform three isolated queries, the database has no opportunity to optimize its querying behavior across the results you're looking for. If you do it in one query with a join, it does have that opportunity.
Leave the problem of producing a resultset of ~2000 * 10 * 10 records and then filtering it up to the database, in my opinion - that's what it's good at. :)
The amount of data is too small to demonstrate one over the other, but multiple separate queries will use more with respect to transferring over the wire than a single query. There is packet overhead, and separate data sets risks difference if the data set changes between queries if not in the same transaction.
JOINs specifically might not be necessary, EXISTS or IN can be used if the supporting tables don't expose columns in the resultset. A JOIN between tables that are parent & child, and there can be more than one child to a parent will inflate the rows searched -- not necessarily the rows returned.
Assuming that everything has indexes on the primary keys (you should be doing that), then joins will be very efficient. The only case where joins would be worse is if you had some kind of external caching of query results (as some ORMs will do for you), your products table was much bigger, and you were querying at a sufficient rate to keep the results of the two smaller queries (but not the third) in cache. In that scenario multiple queries becomes faster because you're only making one of the three queries. But the difference is going to be hard to measure.
If the database is not on localhost but accessed over a network it's better to send one request, let the database do the work and retrieve the data at once. This will give you less network delay. So joins are preferred.

Methods of visualizing joins

Just wondering if anyone has any tricks (or tools) they use to visualize joins. You know, you write the perfect query, hit run, and after it's been running for 20 minutes, you realize you've probably created a cartesian join.
I sometimes have difficulty visualizing what's going to happen when I add another join statement and wondered if folks have different techniques they use when trying to put together lots of joins.
Always keep the end in mind.
Ascertain which are the columns you need
Try to figure out the minimum number of tables which will be needed to do it.
Write your FROM part with the table which will give max number of columns. eg FROM Teams T
Add each join one by one on a new line. Ensure whether you'll need OUTER, INNER, LEFT, RIGHT JOIN at each step.
Usually works for me. Keep in mind that it is Structured query language. Always break your query into logical lines and it's much easier.
Every join combines two resultsets into one. Each may be from a single database table or a temporary resultset which is the result of previous join(s) or of a subquery.
Always know the order that joins are processed, and, for each join, know the nature of the two temporary result sets that you are joining together. Know what logical entity each row in that resultset represents, and what attributes in that resultset uniquely identify that entity. If your join is intended to always join one row to one row, these key attributes are the ones you need to use (in join conditions) to implement the join. If your join is intended to generate some kind of cartesian product, then it is critical to understand the above to understand how the join conditions (whatever they are) will affect the cardinality of the new joined resultset.
Try to be consistent in the use of outer join directions. I try to always use Left Joins when I need an outer join, as I "think" of each join as "joining" the new table (to the right) to whatever I have already joined together (on the left) of the Left Join statement...
Run an explain plan.
These are always hierarchical trees (to do this, first I must do that). Many tools exist to make these plans into graphical trees, some in SQL browsers, (e.g, Oracle SQLDeveloper, whatever SQlServer's GUI client is called). If you don't have a tool, most plan text ouput includes a "depth" column, which you can use to indent the line.
What you want to look for is the cost of each row. (Note that for Oracle, though, higher costs can mean less time, if it allows Oracle to do a hash join rather than nested loops, and if the final result set has high cardinality (many, many rows).)
I have never found a better tool than thinking it through and using my own mind.
If the query is so complicated that you cannot do that, you may want to use either CTE's, views, or some other carefully organized subqueries to break it into logical pieces so you can easily understand and visualize each piece even if you cannot manage the whole.
Also, if your concern is effeciency, then SQL Server Management Studio 2005 or later lets you get estimated query execution plans without actually executing the query. This can give you very good ideas of where problems lie, if you are using MS SQL Server.

LEFT JOIN vs. multiple SELECT statements

I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?
This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.
There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.
I'm with you - a single SQL would be better
There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.
It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.
The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!
You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.
Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.
The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.
A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.
Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)
You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.
There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.