Theta join in Hive

Theta join in Hive - hive

I have a theta join in SAS which need to be translate into Hive.
SAS:
select a.id,b.name from employee a
left outer join company b
on ( a.id=b.id and a.joindate>=b.joindate and a.releasedate < b.releasedate)
Since this is not inner join, I am not getting proper results if I add non equi-join in the where condition (all non matched records from left table are missing).
Tried below in Hive:
select a.id,b.name from employee a
left outer join company b
on ( a.id=b.id)
where a.joindate>=b.joindate and a.releasedate < b.releasedate
Any suggestions?

Just as you may have realized, left join keeps all the items from the Preserved Row Table (employee), whereas your where filters out those items if a.joindate<b.joindate or a.releasedate >= b.releasedate.
Those on conditions are logically interpreted as:
For each item li from the left table, whenever an item ri from the right table is found to meet the on conditions, make a new item ni whose columns values are the combination of li and ri. Then put ni in the result set. This may duplicate rows from the left table.
If li fails to find a match, make one copy of it, fill the right columns with nulls. Put this item in the result set too.
So we can emulate this behavior by:
Loosen the on conditions by keeping just equality conditions within the on clause;
Filter out extra rows indroduced by the loosened on conditions, i.e., rows that fail to meet conditions other than equality conditions.
The resultant query may look like this:
select
a.id,
b.name
from employee a
left outer join company b
on (a.id=b.id)
where (b.id is null
or (a.joindate>=b.joindate and a.releasedate<b.releasedate))

Related

Get full join in google big query keeping the all frequecy combination in bigquery giving me only left join for all kind of join

I am trying to join 2 table using Id such that the frequency of all combination are present. But when I am using the join (left, right) I am still getting the inner join or left join output.
these are the table a
b
I am expecting output
I tried the actual code
select
act.action,
dvr.dateofdel,
dvr.output
FROM internal.actions as act
Right join internal.deliveries as dvr
ON dvr.id= act.id
I tried multiple joins but still same outcome ..

Your query does not match your sample data.
But based on your problem statement, I suspect that you want:
select
act.ID_log,
act.ID_send_message,
act.action_date,
act.action,act.ID_email,
dvr.delivery_date,
act.email
from internal.actions as act
left join internal.deliveries as dvr
on dvr.ID_send_message= act.ID_send_message
and dvr.delivery_date >= '2017-01-01'
and dvr.delivery_date < '2018-01-01'
where act.ID_send_message != 0
This will bring all records from act that satisfy the condition in the where clause, along with information coming from dvr; when there is no match in dvr, the corresponding columns will show null values. The important part in the query is that all conditions on the left joined table should be listed in the on clause of the join (rather than in the where clause).

Why are the two queries different (left join on ... and ... as opposed to using where clause)

I'm wondering why the following two queries produce different results (the first query has more rows than second).
SELECT * FROM A
JOIN ...
JOIN ...
JOIN C ON ...
LEFT JOIN B ON B.id = A.id AND B.otherId = C.otherId
As opposed to:
SELECT * FROM A
JOIN ...
JOIN ...
JOIN C ON ...
LEFT JOIN B ON B.id = A.id
WHERE B.otherId = C.otherId
Please help me understand. In the second query, the left join has only 1 condition so shouldn't it include all the results from the first query and more (where the extra rows have unmatched otherId). Then the WHERE clause should ensure that the otherId matches, like in the first query. Why are they different?

The WHERE is performed first by the Query engine before performing the JOIN.
The reasoning being why do the expensive JOIN, if we are going to filter some rows later.
The query engines are pretty good at optimizing the query you write.
Also you will see this effect only in OUTER JOINs. In inner joins both WHERE and JOIN conditions behave the same.

The second query returns less rows because your where clause was filtering the records out, and this is essentially changing the query from a left outer join to an inner join. So, you need to be careful where you place your filters in, but this will not matter if you were to do an inner join.

You've received correct answers, but allow me to delve a little deeper into the difference between join criteria and filtering criteria. Take a simple query with a left join:
select a.Key, a.NonKey1, b.NonKey2
from a
left join b on b.Key = a.Key;
This lists out all of NonKey1 values from table a and any NonKey2 fields from table b with matching key values or NULL where there is no match. A common variant is to look at only those rows in a that do not have a match in b:
select a.Key, a.NonKey1, b.NonKey2
from a
left join b on b.Key = a.Key
where b.Key is null;
Careful! If you accidentally write where b.Key is not null you've just changed your outer join into a regular inner join. Do that sometime and see if QA can catch it. On second thought, don't. (Also, having b.NonKey2 in the selection list is meaningless as it can only ever be NULL, but let's leave it there for the moment.) The join is based on the key fields of both tables matching. After the joining is complete, all rows with a successful join are discarded and only the results without a match remain. That means b.Key in the join criteria cannot be NULL and in the filtering criteria must be NULL for a row to be added to the result set. Fine, that's what we wanted. But consider what would happen if we moved the check to become part of the join criteria.
select a.Key, a.NonKey1, b.NonKey2
from a
left join b on b.Key = a.Key and b.Key is null;
The result is everything from a with nothing at all from b. Probably not what we wanted. If you think about it, you will see we could just as well have written on 0 = 1 and gotten the same result. What we've done is move a value from one context where NULL means one thing (success) to a context where NULL means something entirely different (failure).
So, in computer languages as in human languages, be careful of context. It can completely change the meaning of what you're trying to say.

Left join or select from multiple table using comma (,) [duplicate]

This question already has answers here:
SQL left join vs multiple tables on FROM line?
(12 answers)
Closed 8 years ago.
I'm curious as to why we need to use LEFT JOIN since we can use commas to select multiple tables.
What are the differences between LEFT JOIN and using commas to select multiple tables.
Which one is faster?
Here is my code:
SELECT mw.*,
nvs.*
FROM mst_words mw
LEFT JOIN (SELECT no as nonvs,
owner,
owner_no,
vocab_no,
correct
FROM vocab_stats
WHERE owner = 1111) AS nvs ON mw.no = nvs.vocab_no
WHERE (nvs.correct > 0 )
AND mw.level = 1
...and:
SELECT *
FROM vocab_stats vs,
mst_words mw
WHERE mw.no = vs.vocab_no
AND vs.correct > 0
AND mw.level = 1
AND vs.owner = 1111

First of all, to be completely equivalent, the first query should have been written
SELECT mw.*,
nvs.*
FROM mst_words mw
LEFT JOIN (SELECT *
FROM vocab_stats
WHERE owner = 1111) AS nvs ON mw.no = nvs.vocab_no
WHERE (nvs.correct > 0 )
AND mw.level = 1
So that mw.* and nvs.* together produce the same set as the 2nd query's singular *. The query as you have written can use an INNER JOIN, since it includes a filter on nvs.correct.
The general form
TABLEA LEFT JOIN TABLEB ON <CONDITION>
attempts to find TableB records based on the condition. If the fails, the results from TABLEA are kept, with all the columns from TableB set to NULL. In contrast
TABLEA INNER JOIN TABLEB ON <CONDITION>
also attempts to find TableB records based on the condition. However, when fails, the particular record from TableA is removed from the output result set.
The ANSI standard for CROSS JOIN produces a Cartesian product between the two tables.
TABLEA CROSS JOIN TABLEB
-- # or in older syntax, simply using commas
TABLEA, TABLEB
The intention of the syntax is that EACH row in TABLEA is joined to EACH row in TABLEB. So 4 rows in A and 3 rows in B produces 12 rows of output. When paired with conditions in the WHERE clause, it sometimes produces the same behaviour of the INNER JOIN, since they express the same thing (condition between A and B => keep or not). However, it is a lot clearer when reading as to the intention when you use INNER JOIN instead of commas.
Performance-wise, most DBMS will process a LEFT join faster than an INNER JOIN. The comma notation can cause database systems to misinterpret the intention and produce a bad query plan - so another plus for SQL92 notation.
Why do we need LEFT JOIN? If the explanation of LEFT JOIN above is still not enough (keep records in A without matches in B), then consider that to achieve the same, you would need a complex UNION between two sets using the old comma-notation to achieve the same effect. But as previously stated, this doesn't apply to your example, which is really an INNER JOIN hiding behind a LEFT JOIN.
Notes:
The RIGHT JOIN is the same as LEFT, except that it starts with TABLEB (right side) instead of A.
RIGHT and LEFT JOINS are both OUTER joins. The word OUTER is optional, i.e. it can be written as LEFT OUTER JOIN.
The third type of OUTER join is FULL OUTER join, but that is not discussed here.

Separating the JOIN from the WHERE makes it easy to read, as the join logic cannot be confused with the WHERE conditions. It will also generally be faster as the server will not need to conduct two separate queries and combine the results.
The two examples you've given are not really equivalent, as you have included a sub-query in the first example. This is a better example:
SELECT vs.*, mw.*
FROM vocab_stats vs, mst_words mw
LEFT JOIN vocab_stats vs ON mw.no = vs.vocab_no
WHERE vs.correct > 0
AND mw.level = 1
AND vs.owner = 1111

SQL DB Question

Question about SQL View. Trying to develop a view from two tables. The two tables have same Primary Keys, execpt the 1st table has all of them, the 2nd has some, but not all. When I INNER Join them, I get a recordset but its not complete, because the 2nd table doesnt have all the records in it. Is there a way in my view to write logic stating that if the key isnt in there int he table #2 to insert a zero so the entire record set is shown in the view? I wan tto show ALL the records in the view even if theres nothing to inner join.
My example below:
SELECT dbo.Baan_view1b.Number, dbo.Baan_view1b.description, dbo.Baan_view1b.system, dbo.Baan_view1b.Analyst, dbo.Baan_view1b.[User],
dbo.Baan_view1b.[Date Submitted], dbo.Baan_view1b.category, dbo.Baan_view1b.stage, MAX(dbo.notes.percent_developed) AS Expr1
FROM dbo.Baan_view1b INNER JOIN
dbo.notes ON dbo.Baan_view1b.Number = dbo.notes.note_number
GROUP BY dbo.Baan_view1b.Number, dbo.Baan_view1b.description, dbo.Baan_view1b.system, dbo.Baan_view1b.Analyst, dbo.Baan_view1b.[User],
dbo.Baan_view1b.[Date Submitted], dbo.Baan_view1b.category, dbo.Baan_view1b.stage
HAVING (NOT (dbo.Baan_view1b.stage LIKE 'Closed'))

what you are looking for is the Left Join (left outer join) and not the inner join
SELECT dbo.Baan_view1b.Number, dbo.Baan_view1b.description, dbo.Baan_view1b.system, dbo.Baan_view1b.Analyst,
dbo.Baan_view1b.[User], dbo.Baan_view1b.[Date Submitted], dbo.Baan_view1b.category, dbo.Baan_view1b.stage,
MAX(dbo.notes.percent_developed) AS Expr1
FROM dbo.Baan_view1b
LEFT OUTER JOIN dbo.notes
ON dbo.Baan_view1b.Number = dbo.notes.note_number
WHERE NOT dbo.Baan_view1b.stage LIKE 'Closed'
GROUP BY dbo.Baan_view1b.Number, dbo.Baan_view1b.description, dbo.Baan_view1b.system, dbo.Baan_view1b.Analyst,
dbo.Baan_view1b.[User], dbo.Baan_view1b.[Date Submitted], dbo.Baan_view1b.category, dbo.Baan_view1b.stage
Also, changing the HAVING Clause to a WHERE clause makes the query more efficient.

Yes, you can do this. Assuming that baan_view1b has all the records and notes has only some, change
FROM dbo.Baan_view1b INNER JOIN dbo.notes
to say
FROM dbo.Baan_view1b LEFT OUTER JOIN dbo.notes
INNER JOIN (or just plain JOIN) tells the database engine to take records from Baan_view1b, match them up with records in notes, and include a row in the output for every pair of records that match. As you have seen, it excludes records from Baan_view1b that don't have matches in the notes table.
LEFT OUTER JOIN instead tells the engine to take ALL the records from Bann_view1b (because it's on the left side of the JOIN keywords). Then, it will match up records from notes wherever it can. However, you are guaranteed a row in the output for every row in the left-hand table regardless of whether it can be matched.
If, as is usual, you asked for column values from both tables, the columns from the table on the right-hand side of the JOIN will have NULL values in the missing rows.

Change the inner join to a left outer join.
(Or a right outer join or a full outer join if you feel fancy.)

You need a outer join. This shows all records that have a matching key as well as the ones that don't. The inner join only shows records that have matching join keys.
Enjoy!

You need to do a Left Outer Join as other posters have already mentioned. More information can be found here.

How can I exclude values from a third query (Access)

I have a query that shows me a listing of ALL opportunities in one query
I have a query that shows me a listing of EXCLUSION opportunities, ones we want to eliminate from the results
I need to produce a query that will take everything from the first query minus the second query...
SELECT DISTINCT qryMissedOpportunity_ALL_Clients.*
FROM qryMissedOpportunity_ALL_Clients INNER JOIN qryMissedOpportunity_Exclusions ON
([qryMissedOpportunity_ALL_Clients].[ClientID] <> [qryMissedOpportunity_Exclusions].[ClientID])
AND
([qryMissedOpportunity_Exclusions].[ClientID] <> [qryMissedOpportunity_Exclusions].[BillingCode])
The initial query works as intended and exclusions successfully lists all the hits, but I get the full listing when I query with the above which is obviously wrong. Any tips would be appreciated.
EDIT - Two originating queries
qryMissedOpportunity_ALL_Clients (1)
SELECT MissedOpportunities.MOID, PriceList.BillingCode, Client.ClientID, Client.ClientName, PriceList.WorkDescription, PriceList.UnitOfWork, MissedOpportunities.Qty, PriceList.CostPerUnit AS Our_PriceList_Cost, ([MissedOpportunities].[Qty]*[PriceList].[CostPerUnit]) AS At_Cost, MissedOpportunities.fBegin
FROM PriceList INNER JOIN (Client INNER JOIN MissedOpportunities ON Client.ClientID = MissedOpportunities.ClientID) ON PriceList.BillingCode = MissedOpportunities.BillingCode
WHERE (((MissedOpportunities.fBegin)=#10/1/2009#));
qryMissedOpportunity_Exclusions
SELECT qryMissedOpportunity_ALL_Clients.*, MissedOpportunity_Exclusions.Exclusion, MissedOpportunity_Exclusions.Comments
FROM qryMissedOpportunity_ALL_Clients INNER JOIN MissedOpportunity_Exclusions ON (qryMissedOpportunity_ALL_Clients.BillingCode = MissedOpportunity_Exclusions.BillingCode) AND (qryMissedOpportunity_ALL_Clients.ClientID = MissedOpportunity_Exclusions.ClientID)
WHERE (((MissedOpportunity_Exclusions.Exclusion)=True));
One group needs to see everything, the other needs to see things they havn't deamed as "valid" missed opportunity as in, we've seen it, verified why its there and don't need to bother critiquing it every single month.

Generally you can exclude a table by doing a left join and comparing against null:
SELECT t1.* FROM t1 LEFT JOIN t2 on t1.id = t2.id where t2.id is null;
Should be pretty easy to adopt this to your situation.

Looking at your query rewritten to use table aliases so I can read it...
SELECT DISTINCT c.*
FROM qryMissedOpportunity_ALL_Clients c
JOIN qryMissedOpportunity_Exclusions e
ON c.ClientID <> e.ClientID
AND e.ClientID <> e.BillingCode
This query will produce a cartesian product of sorts... each and every row in qryMissedOpportunity_ALL_Clients will match and join with every row in qryMissedOpportunity_Exclusions where ClientIDs do not match... Is this what you want?? Generally join conditions are based on a column in one table being equal to the value of a column in the other table... Joining where they are not equal is unusual ...
Second, the second iniquality in the join conditions is between columns in the same table (qryMissedOpportunity_Exclusions table) Are you sure this is what you want? If it is, it is not a join condition, it is a Where clause condition...
Second, your question mentions two queries, but there is only the one query (above) in yr question. Where is the second one?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas