spark left join didn't work unless persist is provided - apache-spark-sql

When I use spark left join for two dataframes, the left join doesn't work unless the persist function is added to both dataframes. If not added, the second table didn't even enter into the SQL plan and the column values from the second table are simply the copies from the first table.

Related

Why i am not getting data after applying join query in table that doesnt have data?

I have applied a join query on two tables.
One table has data but the other one does not.
When i try to join table that does not have data, it is not returning anything.
I expected data to come at least from the table which has data.
$allTours = DB::table("virtual_tours")
->join('destinations','virtual_tours.destination_id','=','destinations.id')
->join('virtual_tour_comments','virtual_tours.id','=','virtual_tour_comments.virtual_tour_id')
->get();
Virtual tour comments doensn't have data
You're using a regular join, which becomes an inner join. Meaning data without a match on your conditions gets filtered out. Inner joins are meant to be used to find matches.
Join explanations
What you want is a left join, where data gets always joined but columns simply become NULL if they're not present and you'll get all results.
Since you're using laravel, here's the reference for it:
Laravel Joins

SAS SQL join criteria

I am working on a project where I have inherited an SQL Join that uses join
criteria in a format I have not seen before. The basic format of the join
is this:
Proc Sql;
create table mytest as
select t1.var1,
t1.var2,
t1.var3
from mysource1 t1
left join mysource2 t2 on
(t1.var1 = t2.var1), myparam t3;
quit;
The bit I am confused about is why myparam is included as a join
condition within the ON statement of the LEFT JOIN. The contents of
'myparam' is derived from the SAS Parameter File we have defined on our
system and contains just one row, with two columns. One contains month
start date, the other month end date.
None of the columns in this parameter file are in the other two source
tables and none of the columns in the parameter file appear in the final
output (they aren't referenced in the SELECT statement so they won't do).
I'm guessing that including the 'myparam' dataset in this context is
somehow using the date values within in it to cut the data in mysource1 and
mysource2, but could someone please provide confirmation that this is the
case and the exact mechanism at work please?
Thanks
This is an unusual construction for a join in SAS, but it's basically a Cartesian product. The myparam table isn't part of the LEFT JOIN condition but a new table, starting a new join. Any table included using a comma and no join condition causes it to be joined with all rows from one table matching to all rows in the other. This can be dangerous when two large tables are used (as the amount of rows is multiplied) but in your case the myparam table has one row, so it's only 1 x n.
However, saying all that, the query you have come across doesn't use any values from myparam (or mysource2 for that matter), so I don't see why these tables are being joined on at all. I'm fairly certain the following query would be equivalent:
proc sql;
select var1,var2,var3
from mysource1;
quit;
I'm aware this answer might come across as incomplete, so please feel free to comment...

Why am I loosing table data from adding additional join statement in SQL Server?

So I have table X with let's say 50 rows of data in it and it is being joined to another table with only 2 rows in it. When I "normal" JOIN them together only the overlapping data will show, i.e. the two rows of data that are found in the larger collection of 50. In my case I want all 50 rows to persist so I use a LEFT JOIN and I am returned all 50 rows like planned.
Now I want to start adding and joining in other tables to get additional data about these rows and have them display togehter, whenever there is no data, I am fine with getting null. Now I'm adding a new JOIN and it's only going to find data for those two rows and not for the other 48, which is perfectly fine and I would like NULL to be displayed where no match is found. My problem is that rather than doing that, displaying null, it is instead removing the 48 rows will partial null data in their columns entirely and only showing the two that match, WHY?
I can provide code if needed, I thought this may be easier to understand.
If you inner join the third table, there are only 2 rows that can match (I assume you don't match on nulls). Only 2 rows from the second table have non-null values in them, so only those 2 can produce a result if you do an inner join with the third table.
You will need another left join in this case. This will return all the rows from the second table, including those that are all null (as the result of the first left join).

avoid null tables in left join

If I join two tables together with left join and one of the tables is completely empty, I get a bunch of empty columns in the joined table.
Here is a fiddle demonstrating what I mean.
I would like the resulting joined table to not contain all those null columns
The number of columns that a query returns is fixed. It cannot change depending on whether a table is empty or not. So the answer is nope.

When is a good situation to use a full outer join?

I'm always discouraged from using one, but is there a circumstance when it's the best approach?
It's rare, but I have a few cases where it's used. Typically in exception reports or ETL or other very peculiar situations where both sides have data you are trying to combine.
The alternative is to use an INNER JOIN, a LEFT JOIN (with right side IS NULL) and a RIGHT JOIN (with left side IS NULL) and do a UNION - sometimes this approach is better because you can customize each individual join more obviously (and add a derived column to indicate which side is found or whether it's found in both and which one is going to win).
I noticed that the wikipedia page provides an example.
For example, this allows us to see
each employee who is in a department
and each department that has an
employee, but also see each employee
who is not part of a department and
each department which doesn't have an
employee.
Note that I never encountered the need of a full outer join in practice...
I've used full outer joins when attempting to find mismatched, orphaned data, from both of my tables and wanted all of my result set, not just matches.
Just today I had to use Full Outer Join. It is handy in situations where you're comparing two tables. For example, the two tables I was comparing were from different systems so I wanted to get following information:
Table A has any rows that are not in Table B
Table B has any rows that are not in Table A
Duplicates in either Table A or Table B
For matching rows whether values are different (Example: The table A and Table B both have Acct# 12345, LoanID abc123, but Interest Rate or Loan Amount is different
In addition, I created an additional field in SELECT statement that uses a CASE statement to 'comment' why I am flagging this row. Example: Interest Rate does not match / The Acct doesn't exist in System A, etc.
Then saved it as a view. Now, I can use this view to either create a report and send it to users for data correction/entry or use it to pull specific population by 'comment' field I created using a CASE statement (example: all records with non-matching interest rates) in my stored procedure and automate correction, etc.
If you want to see an example, let me know.
The rare times i have used it has been around testing for NULLs on both sides of the join in case i think data is missing from the initial INNER JOIN used in the SQL i'm testing on.
They're handy for finding orphaned data but I rarely use then in production code. I wouldn't be "always discouraged from using one" but I think in the real world they are less frequently the best solution compared to inners and left/right outers.
In the rare times that I used Full Outer Join it was for data analysis and comparison purpose such as when comparing two customers tables from different databases to find out duplicates in each table or to compare the two tables structures, or to find out null values in one table compared to the other, or finding missing information in one tables compared to the other.
For example, suppose you have two tables: one containing customer data and another containing order data. A full outer join would allow you to see all customers and all orders, even if some customers have no orders or some orders have no corresponding customer. This can help you identify any gaps in the data and ensure that all relevant information is included in the result set.
It's important to note that a full outer join can produce a huge result set since it includes all rows from both tables. This can be inefficient in terms of performance, so it's best to use a full outer join only when it is necessary to include all rows from both tables.
SELECT *
FROM table1
FULL OUTER JOIN table2
ON table1.column_name = table2.column_name;
This will return all rows from both table1 and table2, filling in NULL values for missing matches on either side.