Join Tables on Date Range in Hive - sql

I need to join tableA to tableB on employee_id and the cal_date from table A need to be between date start and date end from table B. I ran below query and received below error message, Would you please help me to correct and query. Thank you for you help!
Both left and right aliases encountered in JOIN 'date_start'.
select a.*, b.skill_group
from tableA a
left join tableB b
on a.employee_id= b.employee_id
and a.cal_date >= b.date_start
and a.cal_date <= b.date_end

RTFM - quoting LanguageManual Joins
Hive does not support join conditions that are not equality conditions
as it is very difficult to express such conditions as a map/reduce
job.
You may try to move the BETWEEN filter to a WHERE clause, resulting in a lousy partially-cartesian-join followed by a post-processing cleanup. Yuck. Depending on the actual cardinality of your "skill group" table, it may work fast - or take whole days.

If your situation allows, do it in two queries.
First with the full join, which can have the range; Then with an outer join, matching on all the columns, but include a where clause for where one of the fields is null.
Ex:
create table tableC as
select a.*, b.skill_group
from tableA a
, tableB b
where a.employee_id= b.employee_id
and a.cal_date >= b.date_start
and a.cal_date <= b.date_end;
with c as (select * from TableC)
insert into tableC
select a.*, cast(null as string) as skill_group
from tableA a
left join c
on (a.employee_id= c.employee_id
and a.cal_date = c.cal_date)
where c.employee_id is null ;

MarkWusinich had a great solution but with one major issue. If table a has an employee ID twice within the date range table c will also have that employee_ID twice (if b was unique if not more) creating 4 records after the join. As such if A is not unique on employee_ID a group by will be necessary. Corrected below:
with C as
(select a.employee_id, b.skill_group
from tableA a
, tableB b
where a.employee_id= b.employee_id
and a.cal_date >= b.date_start
and a.cal_date <= b.date_end
group by a.employee_id, b.skill_group
) C
select a.*, c.skill_group
from tableA a
left join c
on a.employee_id = c.employee_id
and a.cal_date = c.cal_date;
Please note: If B was somehow intentionally not distinct on (employee_id, skill_group), then my query above would also have to be modified to appropriately reflect that.

Related

SQL - select * given count from another table

I'm trying to select * from two tables (a and b) using a join (column a.id and b.id), given that the count of a column (b.owner) in b is lower than 3, i.e. the occurence of a person's name can be max 2.
I've tried:
SELECT a.*, COUNT(b.owner) AS b_count
FROM a LEFT JOIN b on a.id = b.id
GROUP BY b.owner HAVING COUNT(b_count) <3
As im pretty new to SQL, im pretty stuck here. How can i resolve this issue? The result should be all columns for owners who do not appear more than twice in the data.
The query you are trying to run is not working due to the columns missing in the GROUP BY clause.
As you are outputting all columns from table a (with SELECT a.*), you need to include all those columns in the GROUP BY statement, so that the database understand the group of fields to group by and perform the aggregation required (in your case COUNT(b.owner)).
Example
Considering that your table a has 3 columns below:
CREATE TABLE persons (
id INTEGER,
name VARCHAR(50),
birthday DATE,
PRIMARY KEY (id)
);
.. and your table b the following and referencing the first table as below:
CREATE TABLE sales (
id INTEGER,
person_id INTEGER,
sale_value DECIMAL,
PRIMARY KEY (id),
FOREIGN KEY (person_id) REFERENCES persons(id)
);
.. you should query it aggregating the COUNT() by those 3 columns:
SELECT a.id, a.name, a.birthday, COUNT(b.person_id) AS b_count
FROM persons a
LEFT JOIN sales b ON a.id = b.person_id
GROUP BY a.id, a.name, a.birthday
HAVING COUNT(b.person_id) < 3
Alternative
In case the total of records on the 2nd table is not important to you, you could use a different "strategy" here to avoid performing the JOIN between the tables (useful when joining two huge tables) and rewriting all the columns from a on the SELECT+GROUP BY.
By identifying the records that has less than the 3 occurrences firstly:
SELECT b.person_id
FROM sales b
GROUP BY b.person_id
HAVING COUNT(b.id) < 3;
.. and using it in the WHERE clause to retrieve all the columns from the 1st table only for the ids that resulted from the previous query:
SELECT a.*
FROM persons a
WHERE a.id IN (....other query here....);
.. the execution happens in a more chronological and, perhaps, easier way to visualize while getting more familiar with SQL:
SELECT a.*
FROM persons a
WHERE a.id IN (SELECT b.person_id
FROM sales b
GROUP BY b.person_id
HAVING COUNT(b.id) < 3);
DB Fiddle here
In Standard SQL, you can use:
SELECT a.*, COUNT(b.owner) AS b_count
FROM a LEFT JOIN
b
ON a.id = b.id
GROUP BY a.id
HAVING COUNT(b.owner) < 3;
This may not work in all databases (and it assumes that a.id is unique/primary key). An alternative would be to use a correlated subquery:
SELECT a.*
FROM (SELECT a.*,
(SELECT COUNT(*)
FROM b
WHERE a.id = b.id
) as b_count
FROM a
) a
WHERE b_count < 3;

Is there a way to print all of the rows from two tables using full outer join?

Here there are two tables. Table A and Table B I tried joining these two tables using the outer join to get all of the rows which is the resultant_table from both tables and it isn't working for some reason the screenshot at the end shows the error that I'm getting when I happen to run the query. I wanted the output as showed in the resultant table.
Here is the script that i used,
SELECT table_b.date,
table_b.student,
table_b.location,
table_b.sub_division,
table_a.part_time_pay,
table_b.days_worked
FROM table_a
FULL OUTER JOIN table_b
ON table_a.date = table_b.date
AND table_a.student = table_b.student;
It is doing exactly what you specify. Use coalesce() to combine values from the two tables:
SELECT COALESCE(a.date, b.date) as date,
COALESCE(a.student, b.student) as student,
b.location, b.sub_division,
a.part_time_pay, b.days_worked
FROM table_a a FULL JOIN
table_b b
ON a.date = b.date AND
a.student = b.student;
I'm not sure how you want to handle LOCATION, and SUBDIVISION. What if they have different values? I might think you want to put them in the JOIN conditions and then:
SELECT COALESCE(a.date, b.date) as date,
COALESCE(a.student, b.student) as student,
COALESCE(a.location, b.location) as location,
COALESCE(a.sub_division, b.sub_division) as sub_division,
a.part_time_pay, b.days_worked
FROM table_a a FULL JOIN
table_b b
ON a.date = b.date AND
a.student = b.student AND
a.location = b.location AND
a.sub_division = b.sub_division;

SQL select record right after a particular date, compare NULL with date

I like to keep all records in tableA that are right after my targeted date,
Main table A
Table B
SELECT *
FROM tableA a
LEFT JOIN tableB b on b.customerID = a.customerID and b.target_date = a.sell_date
WHERE a.sell_date > b.target_date
Unfortunately my code above doesn't work since SQL can't compare NULL with date.
My expected output is
The inequality between target_date and sell_date could go in the join condition of the FROM clause. This way the WHERE clause could be eliminated.
SELECT *
FROM tableA a
LEFT JOIN tableB b on b.customerID=a.customerID
and b.target_date <= a.sell_date;

SQL function to create a one-to-one match between two tables?

I am trying to join 2 tables. Table_A has ~145k rows whereas Table_B has ~205k rows.
They have two columns in common (i.e. ISIN and date). However, when I execute this query:
SELECT A.*,
B.column_name
FROM Table_A
JOIN
Table_B ON A.date = B.date
WHERE A.isin = B.isin
I get a table with more than 147k rows. How is it possible? Shouldn't it return a table with at most ~145k rows?
What you are seeing indicates that, for some of the records in Table_A, there are several records in Table_B that satisfy the join conditions (equality on the (date, isin) tuple).
To exhibit these records, you can do:
select B.date, B.isin
from Table_A
join Table_B on A.date = B.date and A.isin = B.isin
group by B.date, B.isin
having count(*) > 1
It's up to you to define how to handle those duplicates. For example:
if the duplicates have different values in column column_name, then you can decide to pull out the maximum or minimum value
or use another column to filter on the top or lower record within the duplicates
if the duplicates are true duplicates, then you can use select distinct in a subquery to dedup them before joining
... other solutions are possible ...
If you want one row per table A, then use outer apply:
SELECT A.*,
B.column_name
FROM Table_A a OUTER APPLY
(SELECT TOP (1) b.*
FROM Table_B b
WHERE A.date = B.date AND A.isin = B.isin
ORDER BY ? -- you can specify *which* row you want when there are duplicates
) b;
OUTER APPLY implements a lateral join. The TOP (1) ensures that at most one row is returned. The OUTER (as opposed to CROSS) ensures that nothing is filtered out. In this case, you could also phrase it as a correlated subquery.
All that said, your data does not seem to be what you really expect. You should figure out where the duplicates are coming from. The place to start is:
select b.date, b.isin, count(*)
from tableb b
group by b.date, b.isin
having count(*) >= 2;
This will show you the duplicates, so you can figure out what to do about them.
Duplicate possibilities is already discuss.
When millions of records are use in join then often due to poor Cardianility Estimate,
record return are not accurate.
For this just change join order,
SELECT A.*,
B.column_name
FROM Table_A
JOIN
Table_B ON A.isin = B.isin
and
A.date = B.date
Also create non clustered index on both table.
Create NonClustered index isin_date_table_A on Table_A(isin,date)include(*Table_A)
*Table_A= comma seperated list Table_A column which is require in resultset
Create NonClustered index isin_date_table_B on Table_B(isin,date)include(column_nameA)
Update STATISTICS Table_A
Update STATISTICS Table_B
Keeping the DATE columns of both tables in the same format in the JOIN condition you should be getting the result as expected.
Select A.*, B.column_name
from Table_A
join Table_B on to_date(a.date,'DD-MON-YY') = to_date(b.date,'DD-MON-YY')
where A.isin = B.isin

SQL LEFT outer join with only some rows from the right?

I have two tables TABLE_A and TABLE_B having the joined column as the employee number EMPNO.
I want to do a normal left outer join. However, TABLE_B has certain records that are soft-deleted (status='D'), I want these to be included. Just to clarify, TABLE_B could have active records (status= null/a/anything) as well as deleted records, in this case i don't want that employee in my result. If however there are only deleted records of the employee in TABLE_B i want the employee to be included in the result.I hope i'm making my requirement clear. (I could do a lengthy qrslt kind of thingy and get what I want, but I figure there has to be a more optimized way of doing this using the join syntax). Would appreciate any suggestions(even without the join). His newbness is trying the following query without the desired result:
SELECT TABLE_A.EMPNO
FROM TABLE_A
LEFT OUTER JOIN TABLE_B ON TABLE_A.EMPNO = TABLE_B.EMPNO AND TABLE_B.STATUS<>'D'
Much appreciate any help.
Just to clarify -- all records from TABLE_A should appear, unless there are rows in table B with statues other than 'D'?
You'll need at least one non-null column on B (I'll use 'B.ID' as an example, and this approach should work):
SELECT TABLE_A.EMPNO
FROM TABLE_A
LEFT OUTER JOIN TABLE_B ON
(TABLE_A.EMPNO = TABLE_B.EMPNO)
AND (TABLE_B.STATUS <> 'D' OR TABLE_B.STATUS IS NULL)
WHERE
TABLE_B.ID IS NULL
That is, reverse the logic you might think -- join onto TABLE_B only where you have rows that would exclude TABLE_A entries, and then use the IS NULL at the end to exclude those. This means that only those which didn't match (those with no row in TABLE_B, or with only 'D' rows) get included.
An alternative might be
SELECT TABLE_A.EMPNO
FROM TABLE_A
WHERE NOT EXISTS (
SELECT * FROM TABLE_B
WHERE TABLE_B.EMPNO = TABLE_A.EMPNO
AND (TABLE_B.STATUS <> 'D' OR TABLE_B.STATUS IS NULL)
)
The following query will get you the employee records that aren't deleted, or only the employ only has deleted records.
select
a.*
from
table_a a
left join table_b b on
a.empno = b.empno
where
b.status <> 'D'
or (b.status = 'D' and
(select count(distinct status) from table_b where empno = a.empno) = 1)
This is in ANSI SQL, but if I knew your RDBMS, I could give a more specific solution that may be a bit more elegant.
ah crud, this apparently works ><
SELECT TABLE_A.EMPNO
FROM TABLE_A
LEFT OUTER JOIN TABLE_B ON TABLE_A.EMPNO = TABLE_B.EMPNO
where TABLE_B.STATUS<>'D'
If you guys have any extra info to chime in with though, please feel free.
UPDATE:
Saw this question after sometime and thought i'll add more helpful info: This link has good info regarding ANSI syntax - http://www.oracle-base.com/articles/9i/ANSIISOSQLSupport.php
In particular this part from the linked page is informative:
Extra filter conditions can be added to the join to using AND to form a complex join. These are often necessary when filter conditions are required to restrict an outer join. If these filter conditions are placed in the WHERE clause and the outer join returns a NULL value for the filter column the row would be thrown away. if the filter condition is coded as part of the join the situation can be avoided.
SELECT A.*, B.*
FROM
Table_A A
INNER JOIN Table_B B
ON A.EmpNo = B.EmpNo
WHERE
NOT EXISTS (
SELECT *
FROM Table_B X
WHERE
A.EmpNo = X.EmpNo
AND X.Status <> 'D'
)
I think this does the trick. The left join is not needed because you only want to include employees with all (and at least one) deleted rows.
This is how I understand the question. You need to include only those employees for which either of the following is true:
an employee has only (soft-)deleted rows in TABLE_B;
an employee has only non-deleted rows in TABLE_B;
an employee has no rows in TABLE_B at all.
In other words, if an employee has both deleted and non-deleted rows in TABLE_B, omit that employee, otherwise include them.
This is how I think it could be solved:
SELECT DISTINCT a.EMPNO
FROM TABLE_A a
LEFT JOIN TABLE_B b1 ON a.EMPNO = b1.EMPNO
LEFT JOIN TABLE_B b2 ON b1.EMPNO = b2.EMPNO
AND (b1.STATUS = 'D' AND (b2.STATUS <> 'D' OR b2 IS NULL) OR
b2.STATUS = 'D' AND (b1.STATUS <> 'D' OR b1 IS NULL))
WHERE b2.EMPNO /* or whatever non-nullable column there is */ IS NULL
Alternatively, though, you could use grouping:
SELECT a.EMPNO
FROM TABLE_A a
LEFT JOIN TABLE_B b ON a.EMPNO = b1.EMPNO
GROUP BY a.EMPNO
HAVING 0 IN (COUNT(CASE b.STATUS WHEN 'D' THEN 1 ELSE NULL END),
COUNT(CASE b.STATUS WHEN 'D' THEN NULL ELSE 1 END))