Best practices for multiple table joins UNION vs JOIN - sql

I'm using a query which brings ~74 fields from different database tables.
The query consists of 10 FULL JOINS and 15 LEFT JOINS.
Each join condition is based on different fields.
The query fetches data from a main table which contains almost 90% foreign keys.
I'm using those joins for the foreign keys but some of the data doesn't require those joins because it's type of data(as logic) doesn't use those information.
Let me give an example:
Each employee can have multiple Tasks.There are four types of tasks(1,2,3,4).
Each TaskType has different meaning. When running the query , I'm getting data for all those tasktypes and then do some logic for showing them separately.
My question is : It is better to use UNION ALL and split all the 4 different cases into queries? This way I could use only the required joins for each case in each union.
Thanks,

I would think it depends strongly on the size (row number count) of the main table an e.g. the task tables.
Say if your main table has tens of millions of rows and the tasks are smaller, then a union with all tasks will necessitate table scans every time, whereas a join with the smaller task tables can do this with one table scan.

Related

Performance over PostgreSQL conditional join - Query optimization

Let's assume I have three tables, subscriptions that has a field called type, which can only have 2 values;
FREE
PREMIUM.
The other two tables are called premium_users and free_users. I'd like to perfom a LEFT JOIN, starting from the subscriptions table but the thing is that depending on the value of the field type I will ONLY find the matching row in one or the other table, i.e. if type equals 'FREE', then the matching row will ONLY be in free_users table and vice versa.
I'm thinking of some ways to do this, such as LEFT JOINING both tables and then using a COALESCE function the get the non null value, or with a UNION, with two different queries using a INNER JOIN on both queries, but I'm not quite sure which would be the best way in terms of performance. Also, as you would guess, the free_users table is almost five times larger than the premium_users table. Another thing you should know, is that I'm joining by user_id field, which is PK in both free_users and premium_users
So, my question is: which would be the most performant way to do a JOIN that depending on the value of type column will match to one table or another. Would this solution be any different if instead of two tables there were three, or even more?
Disclaimer: This DB is a PostgreSQL and is already up and running in production and as much as I'd like to have a single users table it won't happen in the short term.
What is the best in terms of performance? Well, you should try on your data and your systems.
My recommendation is two left joins:
select s.*,
coalesce(fu.name, pu.name) as name
from subscriptions s left join
free_users fu
on fu.free_id = s.subscription_id and
s.type = 'free' left join
premium_users pu
on pu.premium_id = s.suscription_id and
s.type = 'premium';
You want indexes on free_users(free_id) and premium_users(premium_id). These are probably "free" because these ids should be the primary keys in the table.
If you use union all, then the optimizer may not use indexes for the joins. And not using indexes could have a dastardly impact on performance.

Performance of JOINS in SAP HANA Calculation View

For Example:
I have 4 columns (A,B,C,D).
I thought that instead of connecting each and every column in join I should make a concatenated column in both projection(CA_CONCAT-> A+B+C+D) and make a join on this, Just to check on which method performance is better.
It was working faster earlier but in few CV's this method is slower sometimes, especially at time of filtering!
Can any one suggest which is an efficient method?
I don't think the JOIN conditions with concatenated fields will work better in performance.
Although we say in general there is not a need for index on column tables on HANA database, the column tables have a structure that works with an index on every column.
So if you concatenate 4 columns and produce a new calculated field, first you loose the option to use these index on 4 columns and the corresponding joining columns
I did not check the execution plan, but it will probably make a full scan on these columns
In fact I'm surprised you have mentioned that it worked faster, and experienced problems only on a few
Because concatenation or applying a function on a database column is even only by itself a workload over the SELECT process. It might include implicit type cast operation, which might bring additional workload more than expected
First I would suggest considering setting your table to column store and check the new performance.
After that I would suggest to separate the JOIN to multiple JOINs if you are using OR condition in your join.
Third, INNER JOIN will give you better performance compare to LEFT JOIN or LEFT OUTER JOIN.
Another thing about JOINs and performance, you better use them on PRIMARY KEYS and not on each column.
For me, both the time join with multiple fields is performing faster than join with concatenated fields. For filtering scenario, planviz shows when I join with multiple fields, filter gets pushed down to both the tables. On the other hand, when I join with concatenated field only one table gets filtered.
However, if you put filter on both the fields (like PRODUCT from Tab1 and MATERIAL from Tab2), then you can push the filter down to both the tables.
Like:
Select * from CalculationView where PRODUCT = 'A' and MATERIAL = 'A'

Comparing two partition's data in hive

I have 9 million records in each of my partition in hive and I have two partitions. The table has 20 columns. Now I want to compare the dataset between the partitions based upon an id column. which is the best way to do it considering the fact that self join with 9 million records will create performence issues.
Can you try the SMB join - its mostly like a merging two sorted lists. However in this case you will need to create two more tables.
Another option would be to write an UDF to do the same - that would be project by itself. The first option is easier.
Did you try the self join and have it fail? I don't think it should be an issue as long as you specify the join condition correctly. 9 million rows is actually not that much for Hive. It can handle large joins by using the join condition as a reduce key, so it doesn't actually do the full cartesian product.
select a.foo, b.foo
from my_table a
full outer join my_table b
on a.id <=> b.id
where a.partition = 'x' and b.partition = 'y'
To do a full comparison of 2 tables (or comparing 2 partitions of the same table), my experience has shown me that using some checksum mechanism is a more effective and reliable solution than Joining the tables (which gives performance problems as you mentioned, and also gives some difficulties when keys are repeated for instance).
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq.
In your case, you would use that program specifying that the "2 tables to compare" are the same and using the "--source-where" and "--destination-where" to indicate which partitions you want to compare. The "--group-by-column" option might also be useful to specify the "id" column.

JOIN VIEWs containing UNION ALL: performance disaster?

Consider the following standard table/subtable relationship in a MS SQL Server DB:
PARENT
-ID (PK)
-OTHER_FIELD
CHILD
-ID (PK)
-PARENT_ID (FK)
-OTHER_FIELD
Now also consider that there exist other versions of these tables in the DB (with the same structure): PARENT_2/CHILD_2, PARENT_3/CHILD_3, PARENT_4/CHILD_4.
I want to create a standard mechanism to select data from all of the tables combined in a simple query. The first thing that came to mind is:
PARENT_VIEW = PARENT UNION ALL PARENT_2 UNION ALL PARENT_3 UNION ALL PARENT_4
CHILD_VIEW = CHILD UNION ALL CHILD_2 UNION ALL CHILD_3 UNION ALL CHILD_4
But I'm concerned that queries containing JOINs against these views will be horrendous due to each row in the PARENT_VIEW having to scan each row in the CHILD_VIEW for each JOIN. There are actually multiple different CHILD tables (resulting in multiple JOINs) and I'll be dealing with large amounts of data in due course. I don't think this will scale.
Ideally I want to preserve the structure of the tables (opposed to flattening them out).
Queries made against the combined tables would always be filtered using WHERE conditions.
I'm aware of the alternative approach of writing my queries for each PARENT/CHILD table separately and using UNION ALL on each result set so the JOIN would be much more efficient.
I also considered a conditional join, but have not yet explored this further.
Any suggestions how to proceed (ideally with some DBMS science to back it up)?

SQL Query with multiple possible joins (or condition in join)

I have a problem where I have to try to find people who have old accounts with an outstanding balance, but who have created a new account. I need to match them by comparing SSNs. The problem is that we have primary and additional contacts, so 2 potential SSNs per account. I need to match it even if they where primary at first, but now are secondary etc.
Here was my first attempt, I'm just counting now to get the joins and conditions down. I'll select actual data later. Basically the personal table is joined once to active accounts, and another copy to delinquent accounts. The two references to the personal table are then compared based on the 4 possible ways SSNs could be related.
select count(*)
from personal pa
join consumer c
on c.cust_nbr = pa.cust_nbr
and c.per_acct = pa.acct
join personal pu
on pu.ssn = pa.ssn
or pu.ssn = pa.addl_ssn
or pu.addl_ssn = pa.ssn
or pu.addl_ssn = pa.addl_ssn
join uncol_acct u
on u.cust_nbr = pu.cust_nbr
and u.per_acct = pu.acct
where u.curr_bal > 0
This works, but it takes 20 minutes to run. I found this question Is having an 'OR' in an INNER JOIN condition a bad idea? so I tried re-writing it as 4 queries (one per ssn combination) and unioning them. This took 30 minutes to run.
Is there a better way to do this, or is it just a really inefficient process no mater how you do it?
Update: After playing with some options here, and some other experimenting I think I found the problem. Our software vendor encrypts the SSNs in the database and provides a view that decrypts them. Since I have to work from that view it takes a really long time to decrypt and then compare.
If you run separate joins and then union then, then you might have problems. What if the same record pair fulfills at least two conditions? You will have duplicates in your result then.
I believe your first approach is feasible, but do not forget that you are joining four tables. If the number of rows is A, B, C, D in the respective tables, then the RDBMS will have to check a maximum of A * B * C * D records. If you have many records in your database, then this will take a lot of time.
Of course, you can optimize your query by adding indexes to some columns and that would be a good idea if they are not indexed already. But do not forget that if you add an index to a column, then the RDBMS will be quicker to read from there, but slower to write there. If your operations are mostly reads (select), then you should index your columns, but not blindly, study indexing a bit before you start doing it.
Also, if you are joining four tables, personal, consumer, personal (again) and uncol_acct, then you might do something like this:
Write a query, which contains two subqueries, each of them named as t1 and t2, respectively. The first subquery joins personal and consumer and will name the result as t1. The second query will join the second occurrence of personal with uncol_acct and the where clause will be inside your second join. As described before, your query will contain two subqueries, named t1 and t2, respectively. Your query will join t1 and t2. This way you opimise, as your main query will consider only the pairing of valid t1 and t2.
Also, if your where clause is outside as in your example query, then the 4-dimensional join will be executed and only after that will the where be taken into consideration. This is why the where clause should be inside the second sub-query, so the where clause will run before the main join. Also, you can create a subquery inside the second subquery to calculate the where if the condition is fulfilled rarely.
Cheers!