SQL Query with multiple possible joins (or condition in join) - sql

I have a problem where I have to try to find people who have old accounts with an outstanding balance, but who have created a new account. I need to match them by comparing SSNs. The problem is that we have primary and additional contacts, so 2 potential SSNs per account. I need to match it even if they where primary at first, but now are secondary etc.
Here was my first attempt, I'm just counting now to get the joins and conditions down. I'll select actual data later. Basically the personal table is joined once to active accounts, and another copy to delinquent accounts. The two references to the personal table are then compared based on the 4 possible ways SSNs could be related.
select count(*)
from personal pa
join consumer c
on c.cust_nbr = pa.cust_nbr
and c.per_acct = pa.acct
join personal pu
on pu.ssn = pa.ssn
or pu.ssn = pa.addl_ssn
or pu.addl_ssn = pa.ssn
or pu.addl_ssn = pa.addl_ssn
join uncol_acct u
on u.cust_nbr = pu.cust_nbr
and u.per_acct = pu.acct
where u.curr_bal > 0
This works, but it takes 20 minutes to run. I found this question Is having an 'OR' in an INNER JOIN condition a bad idea? so I tried re-writing it as 4 queries (one per ssn combination) and unioning them. This took 30 minutes to run.
Is there a better way to do this, or is it just a really inefficient process no mater how you do it?
Update: After playing with some options here, and some other experimenting I think I found the problem. Our software vendor encrypts the SSNs in the database and provides a view that decrypts them. Since I have to work from that view it takes a really long time to decrypt and then compare.

If you run separate joins and then union then, then you might have problems. What if the same record pair fulfills at least two conditions? You will have duplicates in your result then.
I believe your first approach is feasible, but do not forget that you are joining four tables. If the number of rows is A, B, C, D in the respective tables, then the RDBMS will have to check a maximum of A * B * C * D records. If you have many records in your database, then this will take a lot of time.
Of course, you can optimize your query by adding indexes to some columns and that would be a good idea if they are not indexed already. But do not forget that if you add an index to a column, then the RDBMS will be quicker to read from there, but slower to write there. If your operations are mostly reads (select), then you should index your columns, but not blindly, study indexing a bit before you start doing it.
Also, if you are joining four tables, personal, consumer, personal (again) and uncol_acct, then you might do something like this:
Write a query, which contains two subqueries, each of them named as t1 and t2, respectively. The first subquery joins personal and consumer and will name the result as t1. The second query will join the second occurrence of personal with uncol_acct and the where clause will be inside your second join. As described before, your query will contain two subqueries, named t1 and t2, respectively. Your query will join t1 and t2. This way you opimise, as your main query will consider only the pairing of valid t1 and t2.
Also, if your where clause is outside as in your example query, then the 4-dimensional join will be executed and only after that will the where be taken into consideration. This is why the where clause should be inside the second sub-query, so the where clause will run before the main join. Also, you can create a subquery inside the second subquery to calculate the where if the condition is fulfilled rarely.
Cheers!

Related

Performance over PostgreSQL conditional join - Query optimization

Let's assume I have three tables, subscriptions that has a field called type, which can only have 2 values;
FREE
PREMIUM.
The other two tables are called premium_users and free_users. I'd like to perfom a LEFT JOIN, starting from the subscriptions table but the thing is that depending on the value of the field type I will ONLY find the matching row in one or the other table, i.e. if type equals 'FREE', then the matching row will ONLY be in free_users table and vice versa.
I'm thinking of some ways to do this, such as LEFT JOINING both tables and then using a COALESCE function the get the non null value, or with a UNION, with two different queries using a INNER JOIN on both queries, but I'm not quite sure which would be the best way in terms of performance. Also, as you would guess, the free_users table is almost five times larger than the premium_users table. Another thing you should know, is that I'm joining by user_id field, which is PK in both free_users and premium_users
So, my question is: which would be the most performant way to do a JOIN that depending on the value of type column will match to one table or another. Would this solution be any different if instead of two tables there were three, or even more?
Disclaimer: This DB is a PostgreSQL and is already up and running in production and as much as I'd like to have a single users table it won't happen in the short term.
What is the best in terms of performance? Well, you should try on your data and your systems.
My recommendation is two left joins:
select s.*,
coalesce(fu.name, pu.name) as name
from subscriptions s left join
free_users fu
on fu.free_id = s.subscription_id and
s.type = 'free' left join
premium_users pu
on pu.premium_id = s.suscription_id and
s.type = 'premium';
You want indexes on free_users(free_id) and premium_users(premium_id). These are probably "free" because these ids should be the primary keys in the table.
If you use union all, then the optimizer may not use indexes for the joins. And not using indexes could have a dastardly impact on performance.

Performance of JOINS in SAP HANA Calculation View

For Example:
I have 4 columns (A,B,C,D).
I thought that instead of connecting each and every column in join I should make a concatenated column in both projection(CA_CONCAT-> A+B+C+D) and make a join on this, Just to check on which method performance is better.
It was working faster earlier but in few CV's this method is slower sometimes, especially at time of filtering!
Can any one suggest which is an efficient method?
I don't think the JOIN conditions with concatenated fields will work better in performance.
Although we say in general there is not a need for index on column tables on HANA database, the column tables have a structure that works with an index on every column.
So if you concatenate 4 columns and produce a new calculated field, first you loose the option to use these index on 4 columns and the corresponding joining columns
I did not check the execution plan, but it will probably make a full scan on these columns
In fact I'm surprised you have mentioned that it worked faster, and experienced problems only on a few
Because concatenation or applying a function on a database column is even only by itself a workload over the SELECT process. It might include implicit type cast operation, which might bring additional workload more than expected
First I would suggest considering setting your table to column store and check the new performance.
After that I would suggest to separate the JOIN to multiple JOINs if you are using OR condition in your join.
Third, INNER JOIN will give you better performance compare to LEFT JOIN or LEFT OUTER JOIN.
Another thing about JOINs and performance, you better use them on PRIMARY KEYS and not on each column.
For me, both the time join with multiple fields is performing faster than join with concatenated fields. For filtering scenario, planviz shows when I join with multiple fields, filter gets pushed down to both the tables. On the other hand, when I join with concatenated field only one table gets filtered.
However, if you put filter on both the fields (like PRODUCT from Tab1 and MATERIAL from Tab2), then you can push the filter down to both the tables.
Like:
Select * from CalculationView where PRODUCT = 'A' and MATERIAL = 'A'

Poor performance with stacked joins

I'm not sure I can provide enough details for an answer, but my company is having a performance issue with an older mssql view. I've narrowed it down to the right outer joins, but I'm not familiar with the structure of joins following joins without a "ON" with each one, as in the code snippet below.
How do I write the joins below to either improve performance or to the simpler format of Join Tablename on Field1 = field2 format ?
FROM dbo.tblObject AS tblObject_2
JOIN dbo.tblProspectB2B PB ON PB.Object_ID = tblObject_2.Object_ID
RIGHT OUTER JOIN dbo.tblProspectB2B_CoordinatorStatus
RIGHT OUTER JOIN dbo.tblObject
INNER JOIN dbo.vwDomain_Hierarchy
INNER JOIN dbo.tblContactUser
INNER JOIN dbo.tblProcessingFile WITH ( NOLOCK )
LEFT OUTER JOIN dbo.enumRetentionRealization AS RR ON RR.RetentionRealizationID = dbo.tblProcessingFile.RetentionLeadTypeID
INNER JOIN dbo.tblLoan
INNER JOIN dbo.tblObject AS tblObject_1 WITH ( NOLOCK ) ON dbo.tblLoan.Object_ID = tblObject_1.Object_ID ON dbo.tblProcessingFile.Loan_ID = dbo.tblLoan.Object_ID ON dbo.tblContactUser.Object_ID = dbo.tblLoan.ContactOwnerID ON dbo.vwDomain_Hierarchy.Object_ID = tblObject_1.Domain_ID ON dbo.tblObject.Object_ID = dbo.tblLoan.ContactOwnerID ON dbo.tblProspectB2B_CoordinatorStatus.Object_ID = dbo.tblLoan.ReferralSourceContactID ON tblObject_2.Object_ID = dbo.tblLoan.ReferralSourceContactID
Your last INNER JOIN has a number of ON statements. Per this question and answer, such syntax is equivalent to a nested subquery.
That is one of the worst queries I have ever seen. Since I cannot figure out how it is supposed to work without the underlying data, this is what I suggest to you.
First find a good sample loan and write a query against this view to return where loan_id = ... Now you have a data set you chan check you changes against more easily than the, possibly, millions of records this returns. Make sure these results make sense (that right join to tbl_objects is bothering me as it makes no sense to return all the objects records)
Now start writing your query with what you think should be the first table (I would suggest that loan is the first table, if it not then the first table is Object left joined to loan)) and the where clause for the loan id.
Check your results, did you get the same loan information as teh view query with the where clause added?
Then add each join one at a time and see how it affects the query and whether the results appear to be going off track. Once you have figured out a query that gives the same results with all the tables added in, then you can try for several other loan ids to check. Once those have checked out, then run teh whole query with no where clause and check against the view results (if it is a large number you may need to just see if teh record counts match and visually check through (use order by on both things in order to make sure your results are in the same order). In the process try to use only left joins and not that combination of right and left joins (its ok to leave teh inner ones alone).
I make it a habit in complex queries to do all the inner joins first and then the left joins. I never use right joins in production code.
Now you are ready to performance tune.
I woudl guess the right join to objects is causing a problem in that it returns teh whole table and the nature of that table name and teh other joins to the same table leads me to believe that he probably wanted a left join. Without knowing the meaning of the data, it is hard to be sure. So first if you are returning too many records for one loan id, then consider if the real problem is that as tables have grown, returning too many records has become problematic.
Also consider that you can often take teh view and replace it with code to get the same results. Views calling views are a poor technique that often leads to performance issues. Often the views on top of the other views call teh same tables and thus you end up joining to them multiple times when you don;t need to.
Check your Explain plan or Execution plan depending on what database backend you have. Analysis of this should show where you might have missing indexes.
Also make sure that every table in the query is needed. This especially true when you join to a view. The view may join to 12 other tables but you only need the data from one of them and it can join to one of your tables. MAke sure that you are not using select * but only returning teh fields the view actually needs. You have inner joins so, by definition, select * is returning fields you don't need.
If your select part of teh view has a distinct in it, then consider if you can weed down the multiple records you get that made distinct needed by changing to a derived table or adding a where clause. To see what is causing the multiples, you may need to temporarily use select * to see all the columns and find out which one is not uniques and is causing the issue.
This whole process is not going to be easy or fun. Just take it slowly, work carefully and methodically and you will get there and have a query that is understandable and maintainable in the end.

Optimize Joins (more than 2 tables, where filter)

Couple of Sybase database query questions:
If I do an join and have a where clause, would the filter be applied prior to the actual join? In other words, is it faster than join without any where conditions?
I have an example involving 3 tables (with columns listed below):
A: O1,....
B: E1,E2,...
C: O1, E2, E2
So my join looks like:
select A.*, B* from B,C,A
where
C.E1=B.E1 and C.E2=B.E2 and C.O1=A.O1
and A.O2 in (...)
and B.E3 in (...)
Would my joins be any significantly faster if I eliminated C and added O1 to table B instead?
B:E1,E2,O1....
First, you should use proper join syntax:
select A.*, B.*
from B join C
on C.E1=B.E1 and C.E2=B.E2 join
A
on C.O1=A.O1
where B.E3 in (...)
The "," means cross join and it is prone to problems since it is easily missed. Your query becomes much different if you say "from B C, A". Also, it gives you access to outer joins, and makes the statement much more readable.
The answer to your question is "yes". Your query will run faster if there are fewer tables being joined. If you are doing a join on a primary key, and the tables fit into memory, then the join is not expensive. However, it is still faster to just find the data in the record in B.
As I say this, there could be some boundary cases in some databases where this is not necessarily true. For instance, if there is only one value in the column, and the column is a long string, then adding the column onto pages could increase the number of pages needed for B. Such extreme cases are unlikely, and you should see performance improvements.
Speed will generally depends on number or rows the SQL Server has to read.
I don't think it makes any difference using a where clause or a join
Depends .. on how many rows are in the eliminated table
It can depend on the order you add the joins or where clauses in, e.g. if there are only a few rows in C, and you add that first as a table or where, it immediately cuts down on the number of matches that are possible in. If, however there are millions of rows in C, then you have to read millions to find the matches.
Modern optimizers can rearrange your query to be more efficient but dont rely on it.
What can really cut down the number of rows read is adding indexes to the join columns - if you have an index on A.O1 AND on C.O1 then it can cut down massivley on the number of reads.

What is going on inside when I run left join?

I have a question regarding left join on SQL: I would like to know how SQL servers perform left join?
Let's say I have two tables.
PEOPLE
id
name
PHONE
id
person_id
phone
When I execute:
select name, phone
from people
left join phone on people.id = phone.person_id
...I would like to know how SQL servers process the query string.
My guess is:
select all rows of people
start matching phone rows with on condition. In this case, people.id = phone_person_id.
display 'phone' value as null if not found since it is left join.
Am I correct??
In addition, what books should I read to get this kind of information?
What happens (at least in postgresql, but probably other are similar) is that the query planner will take your query and generate multiple strategies for execution, then it will rank those strategy depending on the knowledge it has of the tables (through statistics), and then choose the best execution strategy and execute it.
Look here for several very common strategies : http://en.wikipedia.org/wiki/Join_%28SQL%29
Basically:
Nested Loop: For each people, do an independent lookup in phone (like a lot of small inner SELECTs)
Merge Join: Scan both table in // if they have index allowing that
Hash Join: First take the entire phone table content and load it in a hash table, then for each people inspect that table and get the corresponding phone from that temporary hashtable.
For all joins, first the query processor looks at the two sets that are being joined, and based on the data in those tables (as represented by cached database statictics), it combines the rows in both tables, using one of the available merge techniques ( Merge Join, Hash Join, or nested Loops). When done the combined set has a row in it for every combination of a row from one side with a row fropm the other side, which satisfies the join condition.
Then, if it's an outer join, all the rows from the 'Inner' side of the join which did not satisfy the join condition (there was no matching record on the 'Outer' Side) are added back into the combined set, with null values for all columns which would have come from the outer side..
Execute your query in SQL Server Management Studio and under the Query menu, choose the option to get the Actual Execution Plan.
After your execute your query, the execution plan will come back as a diagram on a separate tab. It reads from left to right. This plan will show you what happens under the hood.
Read more here
http://msdn.microsoft.com/en-us/library/aa178423%28SQL.80%29.aspx
http://www.simple-talk.com/sql/performance/execution-plan-basics/