Tips or tricks for translating sql joins from literal language to SQL syntax? - sql

I often know exactly what I want, and know how tables are related in order to get it, but I have a real hard time translating that literal language knowledge to SQL syntax when it comes to joins. Do you have any tips or tricks you can share that have worked for you in the past?
This is a basic, but poor example:
"I have Categories, which have one-to-many Products, which have one-to-many Variants, which have one-to-many Sources. I need all Sources that belong to Category XYZ."
I imagine doing something where you cross out certain language terms and replace them with SQL syntax. Can you share how you formulate your queries based upon some concept similar to that? Thanks!

Use SQL Query Designer to easily buid Join queries from the visual table collection right there, then if you want to learn how it works, simply investigate it, that's how I learned it.
You won't notice how charming it is till you try it.
Visual Representation of SQL Joins - A walkthrough explaining SQL JOINs.
Complete ref of SQL-Server Join, Inner Join, Left Outer Join, Right Outer Join, Full Outer Join, in SQL-Server 2005 (View snapshot bellow).
ToTraceString of Entity Frameork' ObjectQuery (that you add Include shapings to it) is also a good way to learn it.
SQL-Server Join types (with detailed examples for each join type):
INNER JOIN - Match rows between the two tables specified in the INNER JOIN statement based on one or more columns having matching data. Preferably the join is based on referential integrity enforcing the relationship between the tables to ensure data integrity.
Just to add a little commentary to the basic definitions above, in general the INNER JOIN option is considered to be the most common join needed in applications and/or queries. Although that is the case in some environments, it is really dependent on the database design, referential integrity and data needed for the application. As such, please take the time to understand the data being requested then select the proper join option.
Although most join logic is based on matching values between the two columns specified, it is possible to also include logic using greater than, less than, not equals, etc.
LEFT OUTER JOIN - Based on the two tables specified in the join clause, all data is returned from the left table. On the right table, the matching data is returned in addition to NULL values where a record exists in the left table, but not in the right table.
Another item to keep in mind is that the LEFT and RIGHT OUTER JOIN logic is opposite of one another. So you can change either the order of the tables in the specific join statement or change the JOIN from left to right or vice versa and get the same results.
RIGHT OUTER JOIN - Based on the two tables specified in the join clause, all data is returned from the right table. On the left table, the matching data is returned in addition to NULL values where a record exists in the right table but not in the left table.
Self Join - In this circumstance, the same table is specified twice with two different aliases in order to match the data within the same table.
CROSS JOIN - Based on the two tables specified in the join clause, a Cartesian product is created if a WHERE clause does filter the rows. The size of the Cartesian product is based on multiplying the number of rows from the left table by the number of rows in the right table. Please heed caution when using a CROSS JOIN.
FULL JOIN - Based on the two tables specified in the join clause, all data is returned from both tables regardless of matching data.

I think most people approach its:
Look for substantives, as they can point to potential tables
Look for adjectives, cause they probably are fields
Relationships between substantives gives JOIN rules
Nothing better than to draw these structures in a paper sheet.

Write and debug a query which returns the fields from the table having the majority of—or the most important—data. Add constraints which depend only on that table, or which are independent of all tables.
Add a new where term which relates another table.
Repeat 2 until done.
I've yet to use the join operator in a query, even after 20+ years of writing SQL queries. One can almost always write them in the form
select field, field2, field3, <etc.>
from table
where field in (select whatever from table2 where whatever) and
field2 in (select whatever from table2 where whatever) and ...
or
select field, field2, field3, <etc.>
from table1, table2, ...
where table1.field = table2.somefield and
table1.field2 = table3.someotherfield and ...
Like someone else wrote, just be bold and practice. It will be like riding a bicycle after 4 or 5 times creating such a query.

One word: Practice.
Open up the query manager and start running queries until you get what you want. Look up similar examples and adapt them to your situation. You will always have to do some trial and error with the queries to get them right.

SQL is very different from imperative programming.
1) To design tables, consider Entities (the real THINGS in the world of concern), Relationships (between the Entities), and Attributes (the values associated with an Entity).
2) to write a Select statement consider a plastic extrusion press:
a) you put in raw records From tables, Where conditions exist in the records
b) you may have to join tables to get at the data you need
c) you craft extrusion nozzles to make the plastic into the shapes you want. These are the individual expressions of the select List.
d) you may want the n-ary sets of list data to come to you in a certain order, you can apply an Order By clause.
3) crafting the List expressions is the most like imperative programming after you discover the if(exp,true-exp,false-exp) function.

Look at the ERD.
Logical or physical version, it will show what tables are related to one another. This way, you can see what table(s) you need to join to in order to get from point/table a to point/table b, and what criteria.

Related

Performance over PostgreSQL conditional join - Query optimization

Let's assume I have three tables, subscriptions that has a field called type, which can only have 2 values;
FREE
PREMIUM.
The other two tables are called premium_users and free_users. I'd like to perfom a LEFT JOIN, starting from the subscriptions table but the thing is that depending on the value of the field type I will ONLY find the matching row in one or the other table, i.e. if type equals 'FREE', then the matching row will ONLY be in free_users table and vice versa.
I'm thinking of some ways to do this, such as LEFT JOINING both tables and then using a COALESCE function the get the non null value, or with a UNION, with two different queries using a INNER JOIN on both queries, but I'm not quite sure which would be the best way in terms of performance. Also, as you would guess, the free_users table is almost five times larger than the premium_users table. Another thing you should know, is that I'm joining by user_id field, which is PK in both free_users and premium_users
So, my question is: which would be the most performant way to do a JOIN that depending on the value of type column will match to one table or another. Would this solution be any different if instead of two tables there were three, or even more?
Disclaimer: This DB is a PostgreSQL and is already up and running in production and as much as I'd like to have a single users table it won't happen in the short term.
What is the best in terms of performance? Well, you should try on your data and your systems.
My recommendation is two left joins:
select s.*,
coalesce(fu.name, pu.name) as name
from subscriptions s left join
free_users fu
on fu.free_id = s.subscription_id and
s.type = 'free' left join
premium_users pu
on pu.premium_id = s.suscription_id and
s.type = 'premium';
You want indexes on free_users(free_id) and premium_users(premium_id). These are probably "free" because these ids should be the primary keys in the table.
If you use union all, then the optimizer may not use indexes for the joins. And not using indexes could have a dastardly impact on performance.

Performance of JOINS in SAP HANA Calculation View

For Example:
I have 4 columns (A,B,C,D).
I thought that instead of connecting each and every column in join I should make a concatenated column in both projection(CA_CONCAT-> A+B+C+D) and make a join on this, Just to check on which method performance is better.
It was working faster earlier but in few CV's this method is slower sometimes, especially at time of filtering!
Can any one suggest which is an efficient method?
I don't think the JOIN conditions with concatenated fields will work better in performance.
Although we say in general there is not a need for index on column tables on HANA database, the column tables have a structure that works with an index on every column.
So if you concatenate 4 columns and produce a new calculated field, first you loose the option to use these index on 4 columns and the corresponding joining columns
I did not check the execution plan, but it will probably make a full scan on these columns
In fact I'm surprised you have mentioned that it worked faster, and experienced problems only on a few
Because concatenation or applying a function on a database column is even only by itself a workload over the SELECT process. It might include implicit type cast operation, which might bring additional workload more than expected
First I would suggest considering setting your table to column store and check the new performance.
After that I would suggest to separate the JOIN to multiple JOINs if you are using OR condition in your join.
Third, INNER JOIN will give you better performance compare to LEFT JOIN or LEFT OUTER JOIN.
Another thing about JOINs and performance, you better use them on PRIMARY KEYS and not on each column.
For me, both the time join with multiple fields is performing faster than join with concatenated fields. For filtering scenario, planviz shows when I join with multiple fields, filter gets pushed down to both the tables. On the other hand, when I join with concatenated field only one table gets filtered.
However, if you put filter on both the fields (like PRODUCT from Tab1 and MATERIAL from Tab2), then you can push the filter down to both the tables.
Like:
Select * from CalculationView where PRODUCT = 'A' and MATERIAL = 'A'

Poor performance with stacked joins

I'm not sure I can provide enough details for an answer, but my company is having a performance issue with an older mssql view. I've narrowed it down to the right outer joins, but I'm not familiar with the structure of joins following joins without a "ON" with each one, as in the code snippet below.
How do I write the joins below to either improve performance or to the simpler format of Join Tablename on Field1 = field2 format ?
FROM dbo.tblObject AS tblObject_2
JOIN dbo.tblProspectB2B PB ON PB.Object_ID = tblObject_2.Object_ID
RIGHT OUTER JOIN dbo.tblProspectB2B_CoordinatorStatus
RIGHT OUTER JOIN dbo.tblObject
INNER JOIN dbo.vwDomain_Hierarchy
INNER JOIN dbo.tblContactUser
INNER JOIN dbo.tblProcessingFile WITH ( NOLOCK )
LEFT OUTER JOIN dbo.enumRetentionRealization AS RR ON RR.RetentionRealizationID = dbo.tblProcessingFile.RetentionLeadTypeID
INNER JOIN dbo.tblLoan
INNER JOIN dbo.tblObject AS tblObject_1 WITH ( NOLOCK ) ON dbo.tblLoan.Object_ID = tblObject_1.Object_ID ON dbo.tblProcessingFile.Loan_ID = dbo.tblLoan.Object_ID ON dbo.tblContactUser.Object_ID = dbo.tblLoan.ContactOwnerID ON dbo.vwDomain_Hierarchy.Object_ID = tblObject_1.Domain_ID ON dbo.tblObject.Object_ID = dbo.tblLoan.ContactOwnerID ON dbo.tblProspectB2B_CoordinatorStatus.Object_ID = dbo.tblLoan.ReferralSourceContactID ON tblObject_2.Object_ID = dbo.tblLoan.ReferralSourceContactID
Your last INNER JOIN has a number of ON statements. Per this question and answer, such syntax is equivalent to a nested subquery.
That is one of the worst queries I have ever seen. Since I cannot figure out how it is supposed to work without the underlying data, this is what I suggest to you.
First find a good sample loan and write a query against this view to return where loan_id = ... Now you have a data set you chan check you changes against more easily than the, possibly, millions of records this returns. Make sure these results make sense (that right join to tbl_objects is bothering me as it makes no sense to return all the objects records)
Now start writing your query with what you think should be the first table (I would suggest that loan is the first table, if it not then the first table is Object left joined to loan)) and the where clause for the loan id.
Check your results, did you get the same loan information as teh view query with the where clause added?
Then add each join one at a time and see how it affects the query and whether the results appear to be going off track. Once you have figured out a query that gives the same results with all the tables added in, then you can try for several other loan ids to check. Once those have checked out, then run teh whole query with no where clause and check against the view results (if it is a large number you may need to just see if teh record counts match and visually check through (use order by on both things in order to make sure your results are in the same order). In the process try to use only left joins and not that combination of right and left joins (its ok to leave teh inner ones alone).
I make it a habit in complex queries to do all the inner joins first and then the left joins. I never use right joins in production code.
Now you are ready to performance tune.
I woudl guess the right join to objects is causing a problem in that it returns teh whole table and the nature of that table name and teh other joins to the same table leads me to believe that he probably wanted a left join. Without knowing the meaning of the data, it is hard to be sure. So first if you are returning too many records for one loan id, then consider if the real problem is that as tables have grown, returning too many records has become problematic.
Also consider that you can often take teh view and replace it with code to get the same results. Views calling views are a poor technique that often leads to performance issues. Often the views on top of the other views call teh same tables and thus you end up joining to them multiple times when you don;t need to.
Check your Explain plan or Execution plan depending on what database backend you have. Analysis of this should show where you might have missing indexes.
Also make sure that every table in the query is needed. This especially true when you join to a view. The view may join to 12 other tables but you only need the data from one of them and it can join to one of your tables. MAke sure that you are not using select * but only returning teh fields the view actually needs. You have inner joins so, by definition, select * is returning fields you don't need.
If your select part of teh view has a distinct in it, then consider if you can weed down the multiple records you get that made distinct needed by changing to a derived table or adding a where clause. To see what is causing the multiples, you may need to temporarily use select * to see all the columns and find out which one is not uniques and is causing the issue.
This whole process is not going to be easy or fun. Just take it slowly, work carefully and methodically and you will get there and have a query that is understandable and maintainable in the end.

What is going on inside when I run left join?

I have a question regarding left join on SQL: I would like to know how SQL servers perform left join?
Let's say I have two tables.
PEOPLE
id
name
PHONE
id
person_id
phone
When I execute:
select name, phone
from people
left join phone on people.id = phone.person_id
...I would like to know how SQL servers process the query string.
My guess is:
select all rows of people
start matching phone rows with on condition. In this case, people.id = phone_person_id.
display 'phone' value as null if not found since it is left join.
Am I correct??
In addition, what books should I read to get this kind of information?
What happens (at least in postgresql, but probably other are similar) is that the query planner will take your query and generate multiple strategies for execution, then it will rank those strategy depending on the knowledge it has of the tables (through statistics), and then choose the best execution strategy and execute it.
Look here for several very common strategies : http://en.wikipedia.org/wiki/Join_%28SQL%29
Basically:
Nested Loop: For each people, do an independent lookup in phone (like a lot of small inner SELECTs)
Merge Join: Scan both table in // if they have index allowing that
Hash Join: First take the entire phone table content and load it in a hash table, then for each people inspect that table and get the corresponding phone from that temporary hashtable.
For all joins, first the query processor looks at the two sets that are being joined, and based on the data in those tables (as represented by cached database statictics), it combines the rows in both tables, using one of the available merge techniques ( Merge Join, Hash Join, or nested Loops). When done the combined set has a row in it for every combination of a row from one side with a row fropm the other side, which satisfies the join condition.
Then, if it's an outer join, all the rows from the 'Inner' side of the join which did not satisfy the join condition (there was no matching record on the 'Outer' Side) are added back into the combined set, with null values for all columns which would have come from the outer side..
Execute your query in SQL Server Management Studio and under the Query menu, choose the option to get the Actual Execution Plan.
After your execute your query, the execution plan will come back as a diagram on a separate tab. It reads from left to right. This plan will show you what happens under the hood.
Read more here
http://msdn.microsoft.com/en-us/library/aa178423%28SQL.80%29.aspx
http://www.simple-talk.com/sql/performance/execution-plan-basics/

union with join to common table

I've got a query that presently looks something like this (I'll simplify it to get to the point):
select shipment.f1, type1detail.f2
from shipment
join type1detail using (shipmentid)
where shipment.otherinfo='foo'
union
select shipment.f1, type2detail.f2
from shipment
join type2detail using (shipmentid)
where shipment.otherinfo='foo'
That is, we have two types of detail that can be on a shipment, and I need to find all the records for both types for all the shipments meeting a given condition. (Imagine that the two detail tables have some fields with matching data. In real life there are several more joins in there; as I say I'm trying to simplify the query to get to the question I'm trying to ask today.)
But the "where" clause requires a full-file scan, so it's pretty inefficient to do this full file scan twice. Logically, the database engine should be able to find the shipment records that meet the condition once, and then find all the records from type1detail and type2detail with shipmentid matching those values.
But how do I say this in SQL?
select shipment.f1, detail.f2
from shipment
join
(select shipmentid, f2
from type1detail
union
selection shipmdentid, f2
from type2detail
) d using (shipmentid)
where shipment.otherinfo='foo'
would required a full file scan of type1detail, union'ed with a full file scan of type2detail, and then join this to shipment, which would be worse than reading shipment twice. (As each of the detail tables are bigger than shipment.)
This seems like a straightforward thing to want to do, but I just can't think of way that I could express it.
I'm using Postgres if you know a solution particular to that engine, but I was hoping for a generic SQL solution.
I would use a temp table to dump shipmentid's into that match your "otherinfo" where clause. Then use that temp table in your joins (not replacing any existing join, but adding another one) so that the join criteria is only using shipmentids and no where clause is needed (unless you are also filtering on columns not in the shipment table, but the point is you still scanned that table only once).
If you run this query more than occasionally, you're probably better off indexing the otherinfo column to avoid the full scans altogether. I don't work much with Postgres, but it looks like there is at least one full text indexing option already.