How to do a conditional join in hive

How to do a conditional join in hive - hive

I have two hive tables and I want to do a join only if both the tables have data in them. I don't want the join to happen if one of the table is empty.
I tried exploring case statement with the intention that I would do something like
select count(*) as val
case
when val > 0 then <do join of table1 and table2 here>
else
<do nothing>
end
from table2
However it looks like hive wont allow to perform an evaluation within the case statement, so this approach does not work. Anyone has any input on how to perform this in hive.

select *
from TableA as a
left join TableB as b
on b.A_Id = a.A_Id
where
b.A_Id is not null or
not exists (select top 1 A_Id from TableB)
Here is the Source which I came across.

Related

SQL antijoin with multiple keys

I'd like to implement an antijoin on two table but using two keys so that the result is all rows in Table A that do not contain the combinations of [key_1, key_2] found in Table B. How can I write this query in SQL?

If you want an anti-left join, the logic is:
select a.*
from tablea a
left join tableb b on b.key_1 = a.key_1 and b.key_2 = a.key_2
where b.key_1 is null
As for me, I like to implement such logic with not exists, because I find that it is more expressive about the intent:
select a.*
from tablea a
where not exists (
select 1 from tableb b where b.key_1 = a.key_1 and b.key_2 = a.key_2
)
The not exists query would take advantage of an index on tableb(key_1, key_2).

select a.*
from table_a a
left anti join table_b b on a.key_1 = b.key_1 and a.key_2 = b.key_2;

BigQuery Full outer join producing "left join" results

I have 2 tables, both of which contain distinct id values. Some of the id values might occur in both tables and some are unique to each table. Table1 has 10,910 rows and Table2 has 11,304 rows
When running a left join query:
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id
I get a total of 10,896 rows or 10,896 ids shared across both tables.
However, when I run a FULL OUTER JOIN on the 2 tables like this:
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id
I get total of 10,896 rows, but I was expecting all 10,910 rows from table1.
I am wondering if there is an issue with my query syntax.

As you are using EACH - it looks like you are running your queries in Legacy SQL mode.
In BigQuery Legacy SQL - COUNT(DISTINCT) function is probabilistic - gives statistical approximation and is not guaranteed to be exact.
You can use EXACT_COUNT_DISTINCT() function instead - this one gives you exact number but a little more expensive on back-end
Even better option - just use Standard SQL
For your specific query you will only need to remove EACH keyword and it should work as a charm
#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id
and
#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN table2 b on a.id = b.id

I added the original query as a subquery and counted ids and produced the expected results. Still a little strange, but it works.
SELECT EXACT_COUNT_DISTINCT(a.id)
FROM
(SELECT a.id AS a.id,
b.id AS b.id
FROM table1 a FULL OUTER JOIN EACH table2 b on a.id = b.id))

It is because you count in both case the number of non-null lines for table a by using a count(distinct a.id).
Use a count(*) and it should works.

You will have to add coalesce... BigQuery, unlike traditional SQL does not recognize fields unless used explicitly
SELECT COUNT(DISTINCT coalesce(a.id,b.id))
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id
This query will now take full effect of full outer join :)

Is it possible to use IF or CASE in sql FROM statement

I have a long stored procedure and I would like to make a slight modification to the procedure without having to create a new one(for maintenance purposes).
Is it possible to use a IF or CASE in the FROM statement of the select statement to join other tables?
Like this:
from tableA a
join tableB b a.indexed = c.indexed
IF #Param='Y'
BEGIN
join tableC c a.indexed = c.indexed
END
It didn't seem to work for me. But I am wondering if this is even possible and/or if this even makes sense to do.
Thanks.

No, it is not possible. You can only accomplish this through the use of dynamic SQL.
The Curse and Blessings of Dynamic SQL
An Intro to Dynamic SQL
I would not advise using Dynamic SQL, there are most likely better ways to perform this operation but you would have to provide more info.

You can achieve something like it if you have a left outer join
Consider
declare #param bit = 1
select a.*, b.*, c.* from a
inner join b on a.id = b.a_id
left outer join c on b.id = c.b_id and #param = 1
This will return all columns from a, b, c.
Now try with
declare #param bit = 0
This will return all columns from a and b, and nulls for columns of c.
It won't work if both joins are inner.

No this is not possible. Your best bet would probably be to select from both tables and only include the data your care about. If you provide an example of what you are trying to do I can provide a better answer.
Attempt at an example:
SELECT t1.id, COALESCE(t2.name, t3.name)
FROM Table1 as t1
LEFT JOIN Table2 as t2
ON t1.id = t2.id
LEFT JOIN Table2 as t3
ON t1.id = t3.id

While what you proposed is not possible, you can play with your where conditions:
from tableA a
inner join tableB b ON a.indexed = c.indexed
left join tableC c ON a.indexed = c.indexed AND 1 = CASE #Param WHEN 'Y' THEN 1 ELSE 0 END
More performant would be to just doing a big
IF #Param='Y' THEN
from tableA a
inner join tableB b ON a.indexed = c.indexed
ELSE
from tableA a
inner join tableB b ON a.indexed = c.indexed
left join tableC c ON a.indexed = c.indexed

You haven't revealed you SELECT clause. The essence of what you want is as follows:
SELECT indexed
FROM tableA
INTERSECT
SELECT indexed
FROM tableB
INTERSECT
SELECT indexed
FROM tableC
WHERE #Param = 'Y'
Then use this table expression as dictated by your SELECT clause e.g. say you only want to project tableA:
WITH T
AS
(
SELECT indexed
FROM tableA
INTERSECT
SELECT indexed
FROM tableB
INTERSECT
SELECT indexed
FROM tableC
WHERE #Param = 'Y'
)
SELECT *
FROM tableA
WHERE indexed IN ( SELECT indexed FROM T );

SQL Case statement to check for NULLS and Non-existent records

I am doing a join between two tables and want to select the columns based on whether they have a record or not. I'm trying to avoid having multiple of the same field and am trying to condense them into single columns. Something like:
Select
id = (CASE WHEN a.id IS NULL THEN b.id ELSE a.id END),
name = (CASE WHEN a.name IS NULL THEN b.name ELSE a.name END)
From Table1 a
Left Join Table2 b
On a.id = b.id
Where a.id = #id
I'd like id to populate from Table1 if a record exists, but if not pull from Table2. The previous code returns no records because there are no NULL values in Table1 so my question is how do I run a check to see if any records even exist? Also if anyone knows of a better way to accomplish what I am trying to do I appreciate guidance and constructive criticism.
EDIT
It looks like COALESCE will work for what I'm trying to accomplish. I'd like to give a little more info on exactly what I am working with and get some advice on whether I am using the best method.
I have a bloated table Table2 and it is in production. I'm working on building new web applications for this system but can't justify a complete database redesign so I am trying to do one "on the fly". I've created a new table Table1 and I am writing stored procedures for the following methods Get(Select), Set(Update), Add(Insert), Remove(Delete). This way, to my code, it will seem that I am working with a single table that is not bloated. My code will simply call one of the SP methods and then the stored procedure will handle the data between the old table and the new. I am currently working on the Get method and I need to check the old table Table2 for a record if it doesn't exist in Table1.
Thanks to the suggestions here my query currently looks like this:
Select
id = coalesce(a.id, b.student_number),
first_name = coalesce(a.first_name, b.first_name),
last_name = coalesce(a.last_name, b.last_name),
//etc
From Table1 a
Full Outer Join Table2 b
On a.id = b.student_number
Where (a.id = #id Or b.student_number = #id)
This works for what I'm trying to accomplish, I'd like to throw it out there to the experienced crowd for any tips or suggestions if there are better or more correct ways to go about this.
Thanks

I suspect your problem may come from doing a left join. Try again using a full outer join, like this:
Select
id = coalesce(a.id, b.id),
name = coalesce(a.name, b.name)
From Table1 a
full outer Join Table2 b
On a.id = b.id
Where a.id = #id

Select id = coalesce(a.id, b.id),
name = coalesce(a.name, b.name)
From Table2 b
Left Join Table1 a On a.id = b.id
Where b.id = #id
You may need to use ISNULL or CASE instead of COALESCE depending on your database platform.

First, you don't need a case statement for that:
Select ISNULL(a.id,b.id) AS id, ISNULL(a.name,b.name) AS name,
From Table1 a
Left Join Table2 b
On a.id = b.id
Where a.id = #id
Second, if I get it right, the id field can contain nulls, and in that case you are screwed. I mean, the ID is a unique value that identify a row, if it can be null, you can't identify that row.
But if what you want is getting records from Table1 and Table2 and avoid duplicates, a simple UNION will work fine, since it discards duplicates:
select id, name
from Table1
where id = #id
union
select id, name
from Table2
where id = #id

You could do something like:
select id, name from Table1 a where a.id not in (select id from Table2)
UNION
select id, name from Table2 b
This would give you all the records from table1 that didn't have a corresponding match in table2 plus all of table2's records. The union would then combine the results.

In your first CASE statement, a.id and b.id will always be same value, except for instances in which a.id has a value and b.id generates a NULL value because of the LEFT JOIN. There will never be a row in the result set with a NULL a.id value and a non-NULL b.id value. You could just use a.id for this column.
For the second CASE statement, you may find the name column in either or both tables with a value (and, of course, the values may be different). You said you want to "condense" the these column values; the SQL function for that is COALESCE:
COALESCE(a.id, b.id)
which returns the first non-NULL value (a.id if it isn't NULL, otherwise b.id). It won't tip you off to different names in the two tables.

SQL LEFT outer join with only some rows from the right?

I have two tables TABLE_A and TABLE_B having the joined column as the employee number EMPNO.
I want to do a normal left outer join. However, TABLE_B has certain records that are soft-deleted (status='D'), I want these to be included. Just to clarify, TABLE_B could have active records (status= null/a/anything) as well as deleted records, in this case i don't want that employee in my result. If however there are only deleted records of the employee in TABLE_B i want the employee to be included in the result.I hope i'm making my requirement clear. (I could do a lengthy qrslt kind of thingy and get what I want, but I figure there has to be a more optimized way of doing this using the join syntax). Would appreciate any suggestions(even without the join). His newbness is trying the following query without the desired result:
SELECT TABLE_A.EMPNO
FROM TABLE_A
LEFT OUTER JOIN TABLE_B ON TABLE_A.EMPNO = TABLE_B.EMPNO AND TABLE_B.STATUS<>'D'
Much appreciate any help.

Just to clarify -- all records from TABLE_A should appear, unless there are rows in table B with statues other than 'D'?
You'll need at least one non-null column on B (I'll use 'B.ID' as an example, and this approach should work):
SELECT TABLE_A.EMPNO
FROM TABLE_A
LEFT OUTER JOIN TABLE_B ON
(TABLE_A.EMPNO = TABLE_B.EMPNO)
AND (TABLE_B.STATUS <> 'D' OR TABLE_B.STATUS IS NULL)
WHERE
TABLE_B.ID IS NULL
That is, reverse the logic you might think -- join onto TABLE_B only where you have rows that would exclude TABLE_A entries, and then use the IS NULL at the end to exclude those. This means that only those which didn't match (those with no row in TABLE_B, or with only 'D' rows) get included.
An alternative might be
SELECT TABLE_A.EMPNO
FROM TABLE_A
WHERE NOT EXISTS (
SELECT * FROM TABLE_B
WHERE TABLE_B.EMPNO = TABLE_A.EMPNO
AND (TABLE_B.STATUS <> 'D' OR TABLE_B.STATUS IS NULL)
)

The following query will get you the employee records that aren't deleted, or only the employ only has deleted records.
select
a.*
from
table_a a
left join table_b b on
a.empno = b.empno
where
b.status <> 'D'
or (b.status = 'D' and
(select count(distinct status) from table_b where empno = a.empno) = 1)
This is in ANSI SQL, but if I knew your RDBMS, I could give a more specific solution that may be a bit more elegant.

ah crud, this apparently works ><
SELECT TABLE_A.EMPNO
FROM TABLE_A
LEFT OUTER JOIN TABLE_B ON TABLE_A.EMPNO = TABLE_B.EMPNO
where TABLE_B.STATUS<>'D'
If you guys have any extra info to chime in with though, please feel free.
UPDATE:
Saw this question after sometime and thought i'll add more helpful info: This link has good info regarding ANSI syntax - http://www.oracle-base.com/articles/9i/ANSIISOSQLSupport.php
In particular this part from the linked page is informative:
Extra filter conditions can be added to the join to using AND to form a complex join. These are often necessary when filter conditions are required to restrict an outer join. If these filter conditions are placed in the WHERE clause and the outer join returns a NULL value for the filter column the row would be thrown away. if the filter condition is coded as part of the join the situation can be avoided.

SELECT A.*, B.*
FROM
Table_A A
INNER JOIN Table_B B
ON A.EmpNo = B.EmpNo
WHERE
NOT EXISTS (
SELECT *
FROM Table_B X
WHERE
A.EmpNo = X.EmpNo
AND X.Status <> 'D'
)
I think this does the trick. The left join is not needed because you only want to include employees with all (and at least one) deleted rows.

This is how I understand the question. You need to include only those employees for which either of the following is true:
an employee has only (soft-)deleted rows in TABLE_B;
an employee has only non-deleted rows in TABLE_B;
an employee has no rows in TABLE_B at all.
In other words, if an employee has both deleted and non-deleted rows in TABLE_B, omit that employee, otherwise include them.
This is how I think it could be solved:
SELECT DISTINCT a.EMPNO
FROM TABLE_A a
LEFT JOIN TABLE_B b1 ON a.EMPNO = b1.EMPNO
LEFT JOIN TABLE_B b2 ON b1.EMPNO = b2.EMPNO
AND (b1.STATUS = 'D' AND (b2.STATUS <> 'D' OR b2 IS NULL) OR
b2.STATUS = 'D' AND (b1.STATUS <> 'D' OR b1 IS NULL))
WHERE b2.EMPNO /* or whatever non-nullable column there is */ IS NULL
Alternatively, though, you could use grouping:
SELECT a.EMPNO
FROM TABLE_A a
LEFT JOIN TABLE_B b ON a.EMPNO = b1.EMPNO
GROUP BY a.EMPNO
HAVING 0 IN (COUNT(CASE b.STATUS WHEN 'D' THEN 1 ELSE NULL END),
COUNT(CASE b.STATUS WHEN 'D' THEN NULL ELSE 1 END))

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas