For example, we have a web application that uses PostgreSQL. The application has AuthorService that implements CRUD operations for Author entity. AuthorService uses "authors" table in database.
Now we need to implement BookService, which should fetch data from "books" table. BookService must join the Author entity.
If we use SQL JOIN in the BookService, then we need to repeat some logic (code) from the AuthorService in the BookService, since the AuthorService contains the access control logic for the Author entity and logic for generating the URLs of the author's photos (S3 signed URL)
OR we can use the AuthorService inside the BookService to fetch the data and after we can join this data in the application instead of PostgreSQL (we can write a loop that join entities), but in this case we may have performance problems.
Which option is better?
I feel the right place to do the JOIN is in the database, even if it might mean some extra code needed from the application side as you have said so.
Joining inside the application layer would blank out any database optimizations which the database optimizer is capable of making use had "join" been inside the db. The database the optimizer chooses the option to return back records on the basis of statistics on the tables/columns/histograms values and a whole lot of optimizations .
Take for example a looping logic. If we have a small table called dept and a large table called emp and if we are to perform a query join on the two in the db. It is most likely going to use a nested loop which might be more efficient since the large table needs to be traversed just once to get all matching records.And if the dept table is wide(many columns) the optimizer can choose to use an index and get the same output in an efficent manner
In case both of the tables are large the optimizer may choose a hash join or sorted join.
Consider the alternative, in your application if you were to join, you would be using just the looping logic all the time(mostly a nested loop) or if you are to implement an sophisticated algorithm of doing the "join" you would be duplicating all of the effort which has gone into making the database.
So best option in my humble opinion - Use db for any SET related operations (JOIN,FILTER,AGGREGATION)
Related
Going through a book, Learning SQL by Alan Beaulieu. On topic of inner joins, it tells that whatever be the order of tables in a INNER JOIN, results are same and gives reason as follows:
If you are confused about why all three versions of the account/employee/customer query
yield the same results, keep in mind that SQL is a nonprocedural language, meaning
that you describe what you want to retrieve and which database objects need to be
involved, but it is up to the database server to determine how best to execute your
query. Using statistics gathered from your database objects, the server must pick one
of three tables as a starting point (the chosen table is thereafter known as the driving
table), and then decide in which order to join the remaining tables. Therefore, the order
in which tables appear in your from clause is not significant.
So does it imply that if statistics gathered from database objects change, then results would also change?
So does it imply that if statistics gathered from database objects change, then results would also change?
No. The same query will always produce the same results (provided, of course, that the underlying data is the same). What the author is explaining is that the database may choose a strategy or another to process the query (starting from one table or another, using a this or that algorithm to join the rows, and so on). That decision is made based on many factors, some of them being based on information that is available in the statistics.
The key point is that SQL is a declarative language, not a procedural language: you don't get to chose how the database handles the query, you just tell it what result you want.
However, regardless of the algorithm that the database chooses, the result is guaranteed to be consistent.
Note that there are edge case where the database does not guarantee that results are the same for consecutive executions of the same query (like a query without a row limiting clause but without an order by): it's the responsibility of the client to provide a query whose results are properly defined (the language does gives you enough rope to hang yourself, if you really want to).
Considering this case where we have 2 tables. The requirement is to implement a function to select the top 10 records (ordered by some rules) from TABLE_A and TABLE_B, where table_a.id == table_b.a_id == X. There are two options:
Using JOIN to query the SQL;
Making 2 selection queries from db: SELECT * FROM table_a WHERE id = X and SELECT * FROM table_b WHERE a_id = X, fetching 10 records from each query (let's assume the ordering is correct in this case) in memory, then join them in the code (using a for loop and a hashtable or sth like that).
I've heard that JOIN might lower the system performance (was "db performance" here but that was wrong)(see follow up below for reference). Besides, in this case we only queries for 10 results at maximum, which is acceptable to load them all in memory then join them there.
My question is, is there a general guideline in the industry, to say under what circumstances would we recommend using JOIN in database layer instead of doing it in memory, and when to do the opposite?
============
Follow up:
So here's some reason/scenario I've read for "moving JOIN from database layer to
service layer":
If we are joining multiple tables, they will all be locked at once. And if the operation take times and the service requires low response time, it might block other executions;
Hard to maintain in the big system. Changes of the tables that are involved in JOIN might make the query broken.
There might be some historical reason for those complicated systems, that data might be migrated/created in different db (or db systems, say one table in DynamoDB and the other one in Postgres), which makes JOIN in the database layer impossible.
To answer simply, it depends.
Generally, it is preferable to do data operations closer to data, instead of bringing them higher up in the layers and handle data operations. You can see many PL/SQL based implementations, where they do operations closer to data. Languages like PL/SQL(ORACLE) or TSQL(SQL Server) are designed to do complex data operations.
But, if you have an application, which brings data from disparate systems and have to join between them, you have to do them in memory.
If we are joining multiple tables, they will all be locked at once. And if the operation take times and the service requires low response time, it might block other executions;
Readers are not blocking other readers. They have something called sharedlock. Also, once the read operation is over, shared lock is released. As, #TimBiegeleisen, you can create indexes to speed up read operations, based on the need. It is always preferable to read only needed columns(projection), needed rows(filtering).
Hard to maintain in the big system. Changes of the tables that are involved in JOIN might make the query broken.
As long as you are selecting only the needed columns, instead of SELECT *, you should not be having issues. If many changes are coming, you can considering creating SCHEMA BINDING View, to avoid schema changes to the underlying tables.
There might be some historical reason for those complicated systems, that data might be migrated/created in different db (or db systems, say one table in DynamoDB and the other one in Postgres), which makes JOIN in the database layer impossible.
Design the application for current need. Dont assume that something like that will happen in future and design for that and compromise on current application performance. If there is a definite future need, go for in-memory operations. Otherwise, better go for database JOINs.
I have a large database that I want to set up a generalized query method for subtables (or a join between subtables). However, the tables I'm interested in are sub-tables of a parent table that is an unknown number of tables deep of relationships from that parent table, depending on the table I'm querying.
Is there a means by which your can get SQL to automatically join all of the interim tables between the two tables of interest? Or narrow a query to only a subset of parent table?
For example this set of relationships:
Folder_Table->System_Table->Items_Table->Items_Class->Items_attributes->Items_Methods->Method_Data->Method_History
I want to be able to generically do searches or joins of any of the sub-tables, where the results are for only a single folder of Folder_table, without having to do a series of explicit joins to X table levels deep... which would significantly increase the complexity of building generic queries interfaces at runtime.
No, there is not.
What you're asking for is the famous "figure out what I want done and do it" function, which would be the golden panacea of programming languages or databases.
SQL is explicit. You need to specify the path by explicitly listing the tables to join and how to join them.
Now, could you make such a function for your specific case? Sure. You would build into it the knowledge of either your specific table structures, or the way to obtain the information needed to automatically find the path between table A and table B. However, there is no such built-in function that already exists, just waiting for you to use it. So if you want such a function, you're going to have to write it yourself.
Bonus questions:
What if there's multiple paths between A and B?
I'm working with an architecture that requires mixing large (static) tables joined with very frequently changing data.
To emphasize the point, imagine SO website user access inner joined to their user database).
SELECT * FROM UserProfile INNER JOIN OnlineUser (on UserProfiles.id = OnlineUser.id)
where UserProfile reside in a large SQL table and OnlineUser is dynamic data on the Webserver.
Putting everything memory takes up a lot of room and putting everything in the database would really tax the server (shudder to think of it). Is there a better way of doing this?
God Jon Skeet says LINQ can't cope with doing a join between an in-memory collection and a database table. He suggests a contains clause or list, both of which wouldn't be appropriate in this case.
Edit:
An in-memory table (PINTABLE) in SQL Server could do this. Since that feature has been deprecated, is it safe to assume SQL server 2008 will figure out to keep it in memory to reduce IO?
You have to do the actual join operation somewhere. Either you bring the OnlineUser data to the SQL Server and perform the join there, or you bring the UserProfile data into memory and perform the join there.
Usually it's better to let the database server sort out the data needed and only read that into memory. One way to solve the problem is to create a temp table on the SQL Server, where you put the data fields required for the join operation (in your example OnlineUser.id).
Then execute a SQL query to get the data required into memory. Make sure that you have indexes that speeds up the filtering. Once retrieved to memory, match the two collections (e.g. by having them sorted on the same key and then use the linq Zip operator)
I have a data warehouse containing typical star schemas, and a whole bunch of code which does stuff like this (obviously a lot bigger, but this is illustrative):
SELECT cdim.x
,SUM(fact.y) AS y
,dim.z
FROM fact
INNER JOIN conformed_dim AS cdim
ON cdim.cdim_dim_id = fact.cdim_dim_id
INNER JOIN nonconformed_dim AS dim
ON dim.ncdim_dim_id = fact.ncdim_dim_id
INNER JOIN date_dim AS ddim
ON ddim.date_id = fact.date_id
WHERE fact.date_id = #date_id
GROUP BY cdim.x
,dim.z
I'm thinking of replacing it with a view (MODEL_SYSTEM_1, say), so that it becomes:
SELECT m.x
,SUM(m.y) AS y
,m.z
FROM MODEL_SYSTEM_1 AS m
WHERE m.date_id = #date_id
GROUP BY m.x
,m.z
But the view MODEL_SYSTEM_1 would have to contain unique column names, and I'm also concerned about performance with the optimizer if I go ahead and do this, because I'm concerned that all the items in the WHERE clause across different facts and dimensions get optimized, since the view would be across a whole star, and views cannot be parametrized (boy, wouldn't that be cool!)
So my questions are -
Is this approach OK, or is it just going to be an abstraction which hurts performance and doesn't give my anything but a lot nicer syntax?
What's the best way to code-gen these views, eliminating duplicate column names (even if the view later needs to be tweaked by hand), given that all the appropriate PK and FKs are in place? Should I just write some SQL to pull it out of the INFORMATION_SCHEMA or is there a good example already available.
Edit: I have tested it, and the performance seems the same, even on the bigger processes - even joining multiple stars which each use these views.
The automation is mainly because there are a number of these stars in the data warehouse, and the FK/PK has been done properly by the designers, but I don't want to have to pick through all the tables or the documentation. I wrote a script to generate the view (it also generates abbreviations for the tables), and it works well to generate the skeleton automagically from INFORMATION_SCHEMA, and then it can be tweaked before committing the creation of the view.
If anyone wants the code, I could probably publish it here.
I’ve used this technique on several data warehouses I look after. I have not noticed any performance degradation when running reports based off of the views versus a table direct approach but have never performed a detailed analysis.
I created the views using the designer in SQL Server management studio and did not use any automated approach. I can’t imagine the schema changing often enough that automating it would be worthwhile anyhow. You might spend as long tweaking the results as it would have taken to drag all the tables onto the view in the first place!
To remove ambiguity a good approach is to preface the column names with the name of the dimension it belongs to. This is helpful to the report writers and to anyone running ad hoc queries.
Make the view or views into into one or more summary fact tables and materialize it. These only need to be refreshed when the main fact table is refreshed. The materialized views will be faster to query and this can be a win if you have a lot of queries that can be satisfied by the summary.
You can use the data dictionary or information schema views to generate SQL to create the tables if you have a large number of these summaries or wish to change them about frequently.
However, I would guess that it's not likely that you would change these very often so auto-generating the view definitions might not be worth the trouble.
If you happen to use MS SQL Server, you could try an Inline UDF which is as close to a parameterized view as it gets.