I am using Apache Spark 1.6.1, and am trying to perform a simple left outer join between two tables. One of these tables are loaded from an HDFS folder directly, and the other by directly querying a Hive table.
This is what I have done:
data = sqlContext.read.format("parquet").option("basePath","/path_to_folder").load("path_to_folder/partition_date=20170501
print data.count()
data2 = sqlContext.sql("select * from hive_table")
print data2.count()
join5 = data.join(data2,data["session_id"] == data2["key"],"left_outer")
print join5.count()
The first two counts return the correct count number (around 2 million)
But the count after the join, returns around 2 billion.
In a left join, I should never have more records than the table on the left side. I assume this join is behaving like a cross join.
How is this possible?
Thank you.
Related
I have a SQL query Like below -
A left JOIN B Left Join C Left JOIN D
Say table A is a big table whereas tables B, C, D are small.
Will Spark join will execute like-
A with B and subsequent results will be joined with C then D
or,
Spark will automatically optimize i.e it will join B, C and D and then
results will be joined with A.
My question is what is order of execution or join evaluation? Does it go left to right or right to left?
Spark can optimize join order, if it has access to information about cardinialities of those joins.
For example, if those are parquet tables or cached dataframes, then it has estimates on total counts of the tables, and can reorder join order to make it less expensive. If a "table" is a jdbc dataframe, Spark may not have information on row counts.
Spark Query Optimizer can also choose a different join type in case if it has statistics (e.g. it can broadcast all smaller tables, and run broadcast hash join instead of sort merge join).
If statistics aren't available, then it'll will just follow the order as in the SQL query, e.g. from left to right.
Update:
I originally missed that all the joins in your query are OUTER joins (left is equivalent to left outer).
Normally outer joins can't be reordered, because this would change result of the query. I said "normally" because sometimes Spark Optimizer can convert an outer join to an inner join (e.g. if you have a WHERE clause that filters out NULLs - see conversion logic here).
For completeness of the answer, reordering of joins is driven by two different codepaths, depending is Spark CBO is enabled or not (spark.sql.cbo.enabled first appeared in Spark 2.2 and is off by default). If spark.sql.cbo.enabled=true and spark.sql.cbo.joinReorder.enabled=true (also off by default), and statistics are available/collected manually through ANALYZE TABLE .. COMPUTE STATISTICS then reordering is based on estimated cardinality of the join I mentioned above.
Proof that reordering only works for INNER JOINS is here (on example of CBO).
Update 2: Sample queries that show that reordering of outer joins produce different results, so outer joins are never reordered :
The order of interpretation of joins does not matter for inner joins. However, it can matter for outer joins.
Your logic is equivalent to:
FROM ((A LEFT JOIN
B
) ON . . . LEFT JOIN
C
ON . . . LEFT JOIN
)
D
ON . . .
The simplest way to think about chains of LEFT JOIN is that they keep all rows in the first table and columns from matching rows in the subsequent tables.
Note that this is the interpretation of the code. The SQL optimizer is free to rearrange the JOINs in any order to arrive at the same result set (although with outer joins this is generally less likely than with inner joins).
I want to join three tables in Spark using inner joins only. I can do it in two ways:
Way 1:
Step1: dataframeA = TableA inner join TableB on [condition] inner join TableC on [condition]
Step2: dataframeA.saveAsTable
Way 2:
Step1: dataframeA = TableA inner join TableB on [condition]
Step2: TableC -> convert to Dataframe -> dataframeB
Step3: dataframeA join dataframeA on [condition].saveAsTable
Which way is faster and will it make any difference if I join tables based on their sizes (like first join bigger tables than join smaller one)?
I am pretty sure spark is going to evaluate the statements the same.
To confirm you can compare the df.explain() between the 2 methods.
For this example this comes down to coding style and readability.
Your optimization here comes in how you are storing the upstream files to reduce reads ( File Type, Partition Strategy, File Size, Incremental Loading)
I joined a new company that uses SAS Enterprise Guide.
I have 2 tables, table A has 100 row, and table B has over 30M rows (50-60 columns).
I tried to do a right join from A (100) to B (30M), it took over 2 hours and no result come back. I want to ask, will it help if I do a left join? I used the GUI and created the following query.
30M Record <- 100 Record ?
or
100 Record -> 30M Record ?
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CASE_NUMBER AS
SELECT t2.EMPGRPCOM,
t2.SEQINVNUM,
t2.SBSID,
t2.SBSLASTNAME,
t2.SBSFIRSTNAME,
t2.PMTDUEDATE,
t2.PREMAMT,
t2.ITEMDESC,
t2.EFFDATE,
t2.PAYAMT,
t2.MCAIDRATECD,
t2.REBILLIND,
t2.BILLTYPE
FROM WORK.'CASE NUMBER'n t1
LEFT JOIN DW.BILLING t2 ON (t1.CaseNumber = t2.SBSID)
WHERE t2.LOB = 'MD' AND t2.PMTDUEDATE BETWEEN '1Jan2015:0:0:0'dt AND '31Dec2017:0:0:0'dt AND t2.SITEID = '0001';
QUIT;
Left join and Right join, all other things aside, are equivalent - if you implement them the same way, anyway. I.E.,
select a.*
from a
left join
b
on a.id=b.id
;
vs
select a.*
from b
right join
a
on b.id=a.id
;
Same exact query, no difference, same time used. SQL is an interpreted language, meaning the SQL interpreter looks at what you send it and figures out what the best way to do it is - so it sees both queries and knows in both cases to do the same thing.
You can read about this in all sorts of articles, this one is a good starting point, or if that link ages just search for "right join vs left join".
Now, what you might want to consider is writing this in a different way, namely not using SQL; this kind of query SQL should be good at but sometimes isn't for some reason. I would write it as a hash table search, where the smaller case_number dataset is loaded to memory, then data step iterate over the larger table and check if it's found in the smaller dataset - if so, then great, return it.
I'd also think about whether left/right join is what you want, vs. inner join. Seems to me that if you're returning solely t2 values, right/left join isn't correct (when t1 is the "primary"): you'll just get empty rows for the non-matches. Either return a t1 variable, or use inner join.
I have a question which is similar but not quite the same to others I've seen in on stackoverflow
Let's say I have the following query
select *
from BobsFriends bf
left outer join JennysFriends jf on bf.friendID = jf.friendID
does bf.friendID = jf.friendID and jf.friendID = bf.friendID return different results? I would think that these would return the same result but
select *
from JennysFriends bf
left outer join BobsFriends jf on bf.friendID = jf.friendID
would return different ones.
The order of the columns when comparing for the JOIN doesn't matter, but when doing an OUTER JOIN the order of the tables definitely matters.
For a LEFT OUTER JOIN you are going to always get the results from the first table, with data from the second table, if available.
For a RIGHT OUTER JOIN you are always going to get the results from the second table, with data from the first table, if available.
Left outer join produces a complete set of records from Table A (the "left side" table), with the matching records (where available) in Table B (the "right side" one). If there is no match, the right side will contain null.
This question already has answers here:
SQL left join vs multiple tables on FROM line?
(12 answers)
Closed 8 years ago.
I'm curious as to why we need to use LEFT JOIN since we can use commas to select multiple tables.
What are the differences between LEFT JOIN and using commas to select multiple tables.
Which one is faster?
Here is my code:
SELECT mw.*,
nvs.*
FROM mst_words mw
LEFT JOIN (SELECT no as nonvs,
owner,
owner_no,
vocab_no,
correct
FROM vocab_stats
WHERE owner = 1111) AS nvs ON mw.no = nvs.vocab_no
WHERE (nvs.correct > 0 )
AND mw.level = 1
...and:
SELECT *
FROM vocab_stats vs,
mst_words mw
WHERE mw.no = vs.vocab_no
AND vs.correct > 0
AND mw.level = 1
AND vs.owner = 1111
First of all, to be completely equivalent, the first query should have been written
SELECT mw.*,
nvs.*
FROM mst_words mw
LEFT JOIN (SELECT *
FROM vocab_stats
WHERE owner = 1111) AS nvs ON mw.no = nvs.vocab_no
WHERE (nvs.correct > 0 )
AND mw.level = 1
So that mw.* and nvs.* together produce the same set as the 2nd query's singular *. The query as you have written can use an INNER JOIN, since it includes a filter on nvs.correct.
The general form
TABLEA LEFT JOIN TABLEB ON <CONDITION>
attempts to find TableB records based on the condition. If the fails, the results from TABLEA are kept, with all the columns from TableB set to NULL. In contrast
TABLEA INNER JOIN TABLEB ON <CONDITION>
also attempts to find TableB records based on the condition. However, when fails, the particular record from TableA is removed from the output result set.
The ANSI standard for CROSS JOIN produces a Cartesian product between the two tables.
TABLEA CROSS JOIN TABLEB
-- # or in older syntax, simply using commas
TABLEA, TABLEB
The intention of the syntax is that EACH row in TABLEA is joined to EACH row in TABLEB. So 4 rows in A and 3 rows in B produces 12 rows of output. When paired with conditions in the WHERE clause, it sometimes produces the same behaviour of the INNER JOIN, since they express the same thing (condition between A and B => keep or not). However, it is a lot clearer when reading as to the intention when you use INNER JOIN instead of commas.
Performance-wise, most DBMS will process a LEFT join faster than an INNER JOIN. The comma notation can cause database systems to misinterpret the intention and produce a bad query plan - so another plus for SQL92 notation.
Why do we need LEFT JOIN? If the explanation of LEFT JOIN above is still not enough (keep records in A without matches in B), then consider that to achieve the same, you would need a complex UNION between two sets using the old comma-notation to achieve the same effect. But as previously stated, this doesn't apply to your example, which is really an INNER JOIN hiding behind a LEFT JOIN.
Notes:
The RIGHT JOIN is the same as LEFT, except that it starts with TABLEB (right side) instead of A.
RIGHT and LEFT JOINS are both OUTER joins. The word OUTER is optional, i.e. it can be written as LEFT OUTER JOIN.
The third type of OUTER join is FULL OUTER join, but that is not discussed here.
Separating the JOIN from the WHERE makes it easy to read, as the join logic cannot be confused with the WHERE conditions. It will also generally be faster as the server will not need to conduct two separate queries and combine the results.
The two examples you've given are not really equivalent, as you have included a sub-query in the first example. This is a better example:
SELECT vs.*, mw.*
FROM vocab_stats vs, mst_words mw
LEFT JOIN vocab_stats vs ON mw.no = vs.vocab_no
WHERE vs.correct > 0
AND mw.level = 1
AND vs.owner = 1111