SAS Enterprise: left join and right join difference? - sql

I joined a new company that uses SAS Enterprise Guide.
I have 2 tables, table A has 100 row, and table B has over 30M rows (50-60 columns).
I tried to do a right join from A (100) to B (30M), it took over 2 hours and no result come back. I want to ask, will it help if I do a left join? I used the GUI and created the following query.
30M Record <- 100 Record ?
or
100 Record -> 30M Record ?
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CASE_NUMBER AS
SELECT t2.EMPGRPCOM,
t2.SEQINVNUM,
t2.SBSID,
t2.SBSLASTNAME,
t2.SBSFIRSTNAME,
t2.PMTDUEDATE,
t2.PREMAMT,
t2.ITEMDESC,
t2.EFFDATE,
t2.PAYAMT,
t2.MCAIDRATECD,
t2.REBILLIND,
t2.BILLTYPE
FROM WORK.'CASE NUMBER'n t1
LEFT JOIN DW.BILLING t2 ON (t1.CaseNumber = t2.SBSID)
WHERE t2.LOB = 'MD' AND t2.PMTDUEDATE BETWEEN '1Jan2015:0:0:0'dt AND '31Dec2017:0:0:0'dt AND t2.SITEID = '0001';
QUIT;

Left join and Right join, all other things aside, are equivalent - if you implement them the same way, anyway. I.E.,
select a.*
from a
left join
b
on a.id=b.id
;
vs
select a.*
from b
right join
a
on b.id=a.id
;
Same exact query, no difference, same time used. SQL is an interpreted language, meaning the SQL interpreter looks at what you send it and figures out what the best way to do it is - so it sees both queries and knows in both cases to do the same thing.
You can read about this in all sorts of articles, this one is a good starting point, or if that link ages just search for "right join vs left join".
Now, what you might want to consider is writing this in a different way, namely not using SQL; this kind of query SQL should be good at but sometimes isn't for some reason. I would write it as a hash table search, where the smaller case_number dataset is loaded to memory, then data step iterate over the larger table and check if it's found in the smaller dataset - if so, then great, return it.
I'd also think about whether left/right join is what you want, vs. inner join. Seems to me that if you're returning solely t2 values, right/left join isn't correct (when t1 is the "primary"): you'll just get empty rows for the non-matches. Either return a t1 variable, or use inner join.

Related

Snowflake inner join bad performance over left join

I have 2 equivalent queries in Snowflake - one with left join and the other with inner join:
SELECT *
FROM A
INNER JOIN B ON a.id=b.id;
SELECT *
FROM A
LEFT JOIN B ON a.id=b.id
WHERE b.id IS NOT NULL;
The inner join does not finish after an hour while the left join takes only few seconds. Why would it happen?
EDIT:
There are two possibilities.
One is that there is really a difference in performance between the queries. That would occur because Snowflake chooses very different execution plans for the two queries. Of course the queries are not the same, but I might expect the execution plans to be similar. You can check explain to investigate this.
The second is that the queries actually have quite similar performance, but you start seeing results from the first query quickly. Why? Because every row in the first table is going to be in the result set -- even if there is no match. By contrast, if there is no match at all between the two tables, the second query has to process all the data before it knows there is no match.

Optimizing OUTER JOIN queries using filters from WHERE clause.(Query Planner)

I am writing a distributed SQL query planner(Query Engine). Data will be fetched from RDBMS(PostgreSQL) nodes involving network I/O.
I want to optimize JOIN queries.
Logical Order of Execution is:
Do JOIN(make use of ON clause)
Apply WHERE clause on the joined result.
I was thinking about applying Filter(WHERE clause specific to a table) first itself, and then do join.
In what cases would that result in wrong results?
Example:
SELECT *
FROM tableA
LEFT JOIN tableB ON(tableA.col1 = tableB.col1)
LEFT JOIN tableC ON(tableB.col2 = tableC.col1)
WHERE tableA.colY < 100 AND tableB.colX > 50
Logical Execution:
joinResult = (tableA left join tableB ON() ) left join tableC ON()
Filter joinResult using given WHERE clause.
Proposed Execution:
filteredA = tableA WHERE tableA.colY < 100
filteredB = tableB WHERE tableB.colX > 50
Result = (filteredA left join filteredB ON(..))left join tableC ON(..)
Can I optimize any query like this? That is filtering the table first and then applying join above that.
Edit:
Some people are confusing and talking about this specific example. I am not talking about this specific example query, I am writing a query planner and I want to handle all type of queries
Please note that, each of the tables is sharded and stored in different machines, and the current execution model is to fetch each of the tables and then do join locally. So if I apply the WHERE filter before fetching, it would be better.
This is actually a complex topic.
We can filter the table in some cases. We can also reorder outer joins and then push the filter quals inside.
I was going through a research paper regarding this, but I haven't completed it yet(may not complete it also).
So for now, for those who are looking for answers, you could probably go through this research paper particularly section 2.2. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.43.2531&rep=rep1&type=pdf
For now I'm relying on PostgreSQL's planner and taking its output and reconstructing the query for my requirements.

Is the order of joining tables indifferent as long as we chose proper join types?

Can we achieve desired results of joining tables by executing joins in whatever order? Suppose we want to left join two tables A and B (order AB). We can get the same results with right join of B and A (BA).
What about 3 tables ABC. Can we get whatever results by only changing order and joins types? For example A left join B inner join C. Can we get it with BAC order? What about if we have 4 or more tables?
Update.
The question Does the join order matter in SQL? is about inner join type. Agreed that then the order of join doesn't matter. The answer provided in that question does not answer my question whether it is possible to get desired results of joining tables with whatever original join types (join types here) by choosing whatever order of tables we like, and achieve this goal only by manipulating with join types.
In an inner join, the ordering of the tables in the join doesn't matter - the same rows will make up the result set regardless of the order they are in the join statement.
In either a left or right outer join, the order DOES matter. In A left join B, your result set will contain one row for every record in table A, irrespective of whether there is a matching row in table B. If there are non matching rows, this is likely to be a different result set to B left join A.
In a full outer join, the order again doesn't matter - rows will be produced for each row in each joined table no matter what their order.
Regarding A left join B vs B right join A - these will produce the same results. In simple cases with 2 tables, swapping the tables and changing the direction of the outer join will result in the same result set.
This will also apply to 3 or more tables if all of the outer joins are in the same direction - A left join B left join C will give the same set of results as C right join B right join A.
If you start mixing left and right joins, then you will need to start being more careful. There will almost always be a way to make an equivalent query with re-ordered tables, but at that point sub-queries or bracketing off expressions might be the best way to clarify what you are doing.
As another commenter states, using whatever makes your purpose most clear is usually the best option. The ordering of the tables in your query should make little or no difference performance wise, as the query optimiser should work this out (although the only way to be sure of this would be to check the execution plans for each option with your own queries and data).

multiple sql joins not producing desired results

I'm new to sql and trying to tweak someone else's huge stored procedure to get a subset of the results. The code below is maybe 10% of the whole procedure. I added the lp.posting_date, last left join, and the where clause. Trying to get records where the posting date is between the start date and the end date. Am I doing this right? Apparently not because the results are unaffected by the change. UPDATE: I CHANGED THE LAST JOIN. The results are correct if there's only one area allocation term. If there is more than one area allocation term, the results are duplicated for each term.
SELECT Distinct
l.lease_id ,
l.property_id as property_id,
l.lease_number as LeaseNumber,
l.name as LeaseName,
lty.name as LeaseType,
lst.name as LeaseStatus,
l.possession_date as PossessionDate,
l.rent as RentCommencementDate,
l.store_open_date as StoreOpenDate,
msr.description as MeasureUnit,
l.comments as Comments ,
lat.start_date as atStartDate,
lat.end_date as atEndDate,
lat.rentable_area as Rentable,
lat.usable_area as Usable,
laat.start_date as aatStartDate,
laat.end_date as aatEndDate,
MK.Path as OrgPath,
CAST(laa.percentage as numeric(9,2)) as Percentage,
laa.rentable_area as aaRentable,
laa.usable_area as aaUsable,
laa.headcounts as Headcount,
laa.area_allocation_term_id,
lat.area_term_id,
laa.area_allocation_id,
lp.posting_date
INTO #LEASES FROM la_tbl_lease l
INNER JOIN #LEASEID on l.lease_id=#LEASEID.lease_id
INNER JOIN la_tbl_lease_term lt on lt.lease_id=l.lease_id and lt.IsDeleted=0
LEFT JOIN la_tlu_lease_type lty on lty.lease_type_id=l.lease_type_id and lty.IsDeleted=0
LEFT JOIN la_tlu_lease_status lst on lst.status_id= l.status_id
LEFT JOIN la_tbl_area_group lag on lag.lease_id=l.lease_id
LEFT JOIN fnd_tlu_unit_measure msr on msr.unit_measure_key=lag.unit_measure_key
LEFT JOIN la_tbl_area_term lat on lat.lease_id=l.lease_id and lat.isDeleted=0
LEFT JOIN la_tbl_area_allocat_term laat on laat.area_term_id=lat.area_term_id and laat.isDeleted=0
LEFT JOIN dbo.la_tbl_area_allocation laa on laa.area_allocation_term_id=laat.area_allocation_term_id and laa.isDeleted=0
LEFT JOIN vw_FND_TLU_Menu_Key MK on menu_type_id_key=2 and isActive=1 and id=laa.menu_id_key
INNER JOIN la_tbl_lease_projection lp on lp.lease_projection_id = #LEASEID.lease_projection_id
where lp.posting_date <= laat.end_date and lp.posting_date >= laat.start_date
As may have already been hinted at you should be careful when using the WHERE clause with an OUTER JOIN.
The idea of the OUTER JOIN is to optionally join that table and provide access to the columns.
The JOINS will generate your set and then the WHERE clause will run to restrict your set. If you are using a condition in the WHERE clause that says one of the columns in your outer joined table must exist / equal a value then by the nature of your query you are no longer doing a LEFT JOIN since you are only retrieving rows where that join occurs.
Shorten it and copy it out as a new query in ssms or whatever you are using for testing. Use an inner join unless you want to preserve the left side set even when there is no matching lp.lease_id. Try something like
if object_id('tempdb..#leases) is not null
drop table #leases;
select distinct
l.lease_id
,l.property_id as property_id
,lp.posting_date
into #leases
from la_tbl_lease as l
inner join la_tbl_lease_projection as lp on lp.lease_id = l.lease_id
where lp.posting_date <= laat.end_date and lp.posting_date >= laat.start_date
select * from #leases
drop table #leases
If this gets what you want then you can work from there and add the other left joins to the query (getting rid of the select * and 'drop table' if you copy it back into your proc). If it doesn't then look at your Boolean date logic or provide more detail for us. If you are new to sql and its procedural extensions, try using the object explorer to examine the properties of the columns you are querying, and try selecting the top 1000 * from the tables you are using to get a feel for what the data looks like when building queries. -Mike
You can try the BETWEEN operator as well
Where lp.posting_date BETWEEN laat.start_date AND laat.end_date
Reasoning: You can have issues wheres there is no matching values in a table. In that instance on a left join the table will populate with null. Using the 'BETWEEN' operator insures that all returns have a value that is between the range and no nulls can slip in.
As it turns out, the problem was easier to solve and it was in a different place in the stored procedure. All I had to do was add one line to one of the cursors to include area term allocations by date.

Successive LEFT SQL joins back to original

I need to redo sql statement in legacy Foxpro application and don't understand whether it is meaningful at all. Syntax is a bit specific - it extracts data from temporary table into the same temporary table ( overwriting) with some joins.
SELECT aa.*,b.spa_date FROM (ALIAS()) aa INNER JOIN jobs ON aa.seq=jobs.seq ;
LEFT JOIN job2 ON jobs.job_no=job2.rucjob;
left join jobs b on b.job_no=job2.job_no;
WHERE jobs.qty1<>0 INTO CURSOR (ALIAS())
Since only one field is added from joined tables ( spa_date ) is there any point in 2 left joins or I am missing something. Isn't it equivalent to
SELECT aa.*,jobs.spa_date FROM (ALIAS()) aa INNER JOIN jobs ON aa.seq=jobs.seq ;
WHERE jobs.qty1<>0 INTO CURSOR (ALIAS())
They are different because b.spa_date come from the second left join. You may be missing filtered rows without both left joins.
You would need to know the intent of the original query and perhaps rewrite it to make more sense but I'd say the two queries are different.