I would like to know if there's a really performance gain between those two options :
Option 1 :
I do a SQL Query with a join to select all User and their Ranks.
Option 2 :
I do one SQL Query to select all User
I fetch all user and do another SQL Query to get the Ranks of this User.
In code, option two is easier to realize for me. That's only because the way I design my Persistence layer.
So, I would like to know what's the impact on performance. After what limit I should consider to take Option 1 instead of Option 2 ?
Generally speaking, the DB server is always faster at joining than application code. Remember you will have to do an extra query with a network round trip for each join. However, if your first result set is small and your indexes are well tuned, this model can work fine.
If you are only doing this to re-use your ORM solution, then you may be fighting a losing battle. I have invariably found that I need read-only datasets that can only be produced with SQL, so I now use ORM for per-object CRUD operations and regular SQL for searches, reports, aggregates etc.
If ranks are static values, consider caching them in your application.
If you need users frequently and ranks only rarely, consider lazy-loading of ranks. (e.g., separate queries, but the second query gets used only occasionally).
Use the join if you're always going to need both sets of data, and they have to be current copies of the database.
Prototype any likely choices, and run performance tests.
EDIT: Further thoughts on your persistence layer, because I'm facing this one myself. Consider adding "persistence-like" classes that handle joins as their basic query, and are read-only. Whether this fits your particular scenario is for you to decide, but a lot of database access for many apps is based on joins, which can be rather large and complex. If you can handle these in a consistent manner with your persistent, updatable objects, it can be a big win for your overall architecture. Conceptually, it's a lot like having a view in the database, and querying the view instead of writing a join, but you're doing it all in code.
It depends upon how many users you anticipate. Option one will definitely be faster, but with a reasonable amount of data, the difference will be negligible.
In 99% situations join will be faster.
However there is one rare situations when it can be slower. If your are doing one to many join on table with large row size and you are hitting network bandwidth limit.
For example there is a blob column in T1 of 1MB size, you are joining T2 which consist 100 rows for each T1 row. The result set would be T1 row count multiple 100.
So if you are querying one T1 row with join it would be 100MB result set, if you fetch T1 row (1MB) and then do separate select to fetch 100 T2 for this T1 the result set will be 1MB.
Related
Considering this case where we have 2 tables. The requirement is to implement a function to select the top 10 records (ordered by some rules) from TABLE_A and TABLE_B, where table_a.id == table_b.a_id == X. There are two options:
Using JOIN to query the SQL;
Making 2 selection queries from db: SELECT * FROM table_a WHERE id = X and SELECT * FROM table_b WHERE a_id = X, fetching 10 records from each query (let's assume the ordering is correct in this case) in memory, then join them in the code (using a for loop and a hashtable or sth like that).
I've heard that JOIN might lower the system performance (was "db performance" here but that was wrong)(see follow up below for reference). Besides, in this case we only queries for 10 results at maximum, which is acceptable to load them all in memory then join them there.
My question is, is there a general guideline in the industry, to say under what circumstances would we recommend using JOIN in database layer instead of doing it in memory, and when to do the opposite?
============
Follow up:
So here's some reason/scenario I've read for "moving JOIN from database layer to
service layer":
If we are joining multiple tables, they will all be locked at once. And if the operation take times and the service requires low response time, it might block other executions;
Hard to maintain in the big system. Changes of the tables that are involved in JOIN might make the query broken.
There might be some historical reason for those complicated systems, that data might be migrated/created in different db (or db systems, say one table in DynamoDB and the other one in Postgres), which makes JOIN in the database layer impossible.
To answer simply, it depends.
Generally, it is preferable to do data operations closer to data, instead of bringing them higher up in the layers and handle data operations. You can see many PL/SQL based implementations, where they do operations closer to data. Languages like PL/SQL(ORACLE) or TSQL(SQL Server) are designed to do complex data operations.
But, if you have an application, which brings data from disparate systems and have to join between them, you have to do them in memory.
If we are joining multiple tables, they will all be locked at once. And if the operation take times and the service requires low response time, it might block other executions;
Readers are not blocking other readers. They have something called sharedlock. Also, once the read operation is over, shared lock is released. As, #TimBiegeleisen, you can create indexes to speed up read operations, based on the need. It is always preferable to read only needed columns(projection), needed rows(filtering).
Hard to maintain in the big system. Changes of the tables that are involved in JOIN might make the query broken.
As long as you are selecting only the needed columns, instead of SELECT *, you should not be having issues. If many changes are coming, you can considering creating SCHEMA BINDING View, to avoid schema changes to the underlying tables.
There might be some historical reason for those complicated systems, that data might be migrated/created in different db (or db systems, say one table in DynamoDB and the other one in Postgres), which makes JOIN in the database layer impossible.
Design the application for current need. Dont assume that something like that will happen in future and design for that and compromise on current application performance. If there is a definite future need, go for in-memory operations. Otherwise, better go for database JOINs.
This question seems to have been asked a lot and the answer seems to be "it depends on the details". so I am asking for my specific case: Is it better for me to have multiple queries or use joins?
The details are as follows:
"products" table -- probably around 2000 rows, 15 or so columns
"tags" table -- probably around 10 rows, 3 columns
"types" table -- probably around 10 rows, 3 columns
I need the "tags" and "types" table to get the tag/type-id that is in the product table.
My gut says that if i join the tables i end up searching a much much larger set so its better to do multiple queries, but i am not really sure...
Thoughts?
No, join will probably outperform multiple queries. Your tables are extremely small.
Ask yourself what extra work would be involved in doing multiple queries... I don't know what you need this data for, but I assume you would, at some point, need to correlate the results - match Tags and Types to Products, wouldn't you? If you don't do that with a join, you just have to do it elsewhere with some other mechanism.
Further, your conception of this overlooks the fact that databases are designed for join scenarios. If you perform three isolated queries, the database has no opportunity to optimize its querying behavior across the results you're looking for. If you do it in one query with a join, it does have that opportunity.
Leave the problem of producing a resultset of ~2000 * 10 * 10 records and then filtering it up to the database, in my opinion - that's what it's good at. :)
The amount of data is too small to demonstrate one over the other, but multiple separate queries will use more with respect to transferring over the wire than a single query. There is packet overhead, and separate data sets risks difference if the data set changes between queries if not in the same transaction.
JOINs specifically might not be necessary, EXISTS or IN can be used if the supporting tables don't expose columns in the resultset. A JOIN between tables that are parent & child, and there can be more than one child to a parent will inflate the rows searched -- not necessarily the rows returned.
Assuming that everything has indexes on the primary keys (you should be doing that), then joins will be very efficient. The only case where joins would be worse is if you had some kind of external caching of query results (as some ORMs will do for you), your products table was much bigger, and you were querying at a sufficient rate to keep the results of the two smaller queries (but not the third) in cache. In that scenario multiple queries becomes faster because you're only making one of the three queries. But the difference is going to be hard to measure.
If the database is not on localhost but accessed over a network it's better to send one request, let the database do the work and retrieve the data at once. This will give you less network delay. So joins are preferred.
Say tableA has 1 row to be returned but will have 100 columns returned while tableB has 100 rows to be returned but only one column from each. TableB has a foreign key for table A.
Will a left join of tableA to tableB return 100*100 cells of data while 2 separate queries return 100 + 100 cells of data or 50 times less data or is that a misunderstanding of how it works?
Is it ever more efficient to use many simple queries rather than fewer more complex ones?
First and foremost, I would question a table with 100 columns, and suggest that there is a possibly a better design for your schema. In the real world, this number of columns is less common, so typically the difference in the amount of data returned with one query vs. two becomes less significant. 100 columns in a table is not necessarily bad, just a flag that it shold be considered.
However, assuming your numbers are what they are to make clear the question, there are a few important variables to consider:
1 - What is the speed of the link between the db server and the application server? If it is very slow, then you are probably better off minimizing the amount data returned vs. the number of queries you run. If it is not slow, then you will likely expend more time in the execution of two queries than you would returning the increased payload. Which is better can only be determined by testing in your own environment.
2 - How efficient is the transport protocol itself? Perhaps there is some kind of compression of the data, or an even more clever algorithm that knows column 2 through 101 are duplicate for every row, so it only passes them once. Strategies like this in the transport protocol would mitigate any of your concerns. Again, this is why you need to test in your own envionment to know for sure.
As others have pointed out, you also need to consider what will be done with the data once you get it (e.g., JOINs, GROUPing, etc), but I am limiting my response to the specifics of your question around query count vs. payload size.
What is best at joining? A database engine or client code? Saying that, I use both techniques: it depends on the client and how data will be used.
Where the data requires some processing to, say, render on a web page I'd probably split header and details recordsets. We do use this because we have some business logic between DB and HTML
Where it's consumed simply and linearly, I'd join in the database to avoid unnecessary processing. For example, simple reports or exports
It depends, if you only take into account the SQL efficiency obviusly several simpler and smaller result queries will be more efficient.
But you need to take into account the whole process if the join will be made otherwise on the client or you need to filter results after the join, then probably the DBM will be more efficient that doing it on your code.
Coding is always a tradeoff between diferent systems, DB vs Client, RAM vs CPU... you need to be conscious about this and try to find the perfect solution.
In this case probably 2 queries outperform 1 but that is not a general solution.
I think that your question basically is about database normalization. In general, it is advisable to normalize a database into multiple tables (using primary and foreign keys) and to join them as needed upon queries. This is better for insert/update performance and for keeping the data consistent, and usually results in smaller database sizes as well.
As for the row numbers returned, only a cross join would actually return 100*100 rows; any inner or outer join will not create all combinations, but rather tie together rows on the given conditions, and for outer joins preserve rows which could not be matched. Wikipedia has some samples in its JOIN article.
For very query-intense applications, the performance may be better when using less normlized tables. However, as always with optimizations, I'd only consider going into that direction after seeing real measurable problems (e.g. with a profiling tool).
In general, try to keep the number of roundtrips to the database low; a large number of single simple queries will suffer from the overhead of talking to the DB engine (network etc.). If you need to execute complex series of statements, consider using stored procedures.
Generally fewer queries makes for better performance, as long as the queries return data that is actually related. There is no point in trying to put unrelated data into the same query just to reduce the number or queries.
There are of course exceptions, and your example may be one of them. However, it depends on more than the number of fields returnes, like what the fields actually return, i.e. the actual amount of data.
As an example of how the number of queries affects performance, I can mention a solution that I have (sadly enough) seen many times. In that solution the programmer would first get a number of records from one table, then loop through the records and run another query for each record to get the related records from another table. This clearly results in a lot of queries, and a solution having either one or two queries would be much more efficient.
“Is it ever more efficient to use many simple queries rather than fewer more complex ones?”
The query that requires the least amount of data to traverse, and gives you no more than what you need is the more efficient one. Beyond this, there can be RDBMS specific conditions that can be more efficient on one RDBMS system than another. At the very low level, when you deal with less data, then your results can be retrieved much quicker, so efficient queries are queries that only work with the least amount of data needed to get you the result you are looking for.
I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?
This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.
There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.
I'm with you - a single SQL would be better
There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.
It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.
The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!
You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.
Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.
The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.
A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.
Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)
You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.
There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.
What is better as far as performance goes?
There is only one way to know: Time it.
In general, I think a single join enables the database to do a lot of optimizations, as it can see all the tables it needs to scan, overhead is reduced, and it can build up the result set locally.
Recently, I had about 100 select-statements which I changed into a JOIN in my code. With a few indexes, I was able to go from 1 minute running time to about 0.6 seconds.
Do not try to write your own join loop as a bunch of selects. Your database server has many clever algorithms for doing joins. Further, your database server can use statistics and estimated cost of access to dynamically pick a join algorithm.
The database server's join algorithm is -- usually -- better than anything you might concoct. They know more about physical I/O, caching and what-not.
This allows you to focus on your problem domain.
A single join will usually outperform multiple single selects. However, there are too many different cases that fit your question. It isn't wise to lump them together under a single simple rule.
More important, a single join will usually be easier for the next programmer to understand and to revise, provided that you and the next programmer "speak the same language" when you use SQL. I'm talking about the language of sets of tuples.
And equally important is that database physical design and query design need to focus first on the questions that will result in a ten for one speed improvement, not on a 10% speed imporvement. If you were doing thousands of simple selects versus a single join, you might get a ten for one advantage. If you are doing three or four simple selects, you won't see a big improvement one way or the other.
One thing to consider besides what has been said, is that the selects will return more data through the network than the joins probably will. If the network connection is already a bottleneck, this could make it much worse, especially if this is done frequently. That said, your best bet in any performacne situation is to test, test, test.
It all depends on how the database will optimize the joins, and the use of indexes.
I had a slow and complex query with lots of joins. Then i subdivided it into 2 or 3 less complex querys. The performance gain was astonishing.
But in the end, "it depends", you have to know where´s the bottleneck.
As has been said before, there is no right answer without context.
The answer to this is dependent on (from the top of my head):
the amount of joining
the type of joining
indexing
the amount of re-use you could have for any of the separate pieces to be joined
the amount of data to be processed
the server setup
etc.
If you are using SQL Server (I am not sure if this is available with other RDBMSs) I would suggest that you bundle an execution plan with you query results. This will give you the ability to see exactly how your query(s) are being executed and what is causing any bottlenecks.
Until you know what SQL Server is actually doing I wouldn't hazard a guess about which query is better.
If your database has lots of data .... and there are multiple joins then please use indexing for better performance.
If there are left/right outer joins in this case , then use multiple selects.
It all depends on your db size, your query, the indexes (which include primary and foreign keys also) ... One cannot reach on conclusion with yes/no on your question.