Hive / SQL - Left join with fallback - sql

In Apache Hive I have to tables I would like to left-join keeping all the data from the left data and adding data where possible from the right table.
For this I use two joins, because the join is based on two fields (a material_id and a location_id).
This works fine with two traditional left joins:
SELECT
a.*,
b.*
FROM a
INNER JOIN (some more complex select) b
ON a.material_id=b.material_id
AND a.location_id=b.location_id;
For the location_id the database only contains two distinct values, say 1 and 2.
We now have the requirement that if there is no "perfect match", this means that only the material_id can be joined and there is no correct combination of material_id and location_id (e.g. material_id=100 and location_id=1) for the join for the location_id in the b-table, the join should "default" or "fallback" to the other possible value of the location_id e.g. material_id=001 and location_id=2 and vice versa. This should only be the case for the location_id.
We have already looked into all possible answers also with CASE etc. but to no prevail. A setup like
...
ON a.material_id=b.material_id AND a.location_id=
CASE WHEN a.location_id = b.location_id THEN b.location_id ELSE ...;
we tried or did not figure out how really to do in hive query language.
Thank you for your help! Maybe somebody has a smart idea.
Here is some sample data:
Table a
| material_id | location_id | other_column_a |
| 100 | 1 | 45 |
| 101 | 1 | 45 |
| 103 | 1 | 45 |
| 103 | 2 | 45 |
Table b
| material_id | location_id | other_column_b |
| 100 | 1 | 66 |
| 102 | 1 | 76 |
| 103 | 2 | 88 |
Left - Join Table
| material_id | location_id | other_column_a | other_column_b
| 100 | 1 | 45 | 66
| 101 | 1 | 45 | NULL (mat. not in b)
| 103 | 1 | 45 | DEFAULT TO where location_id=2 (88)
| 103 | 2 | 45 | 88
PS: As stated here exists etc. does not work in the sub-query ON.

The solution is to left join without a.location_id = b.location_id and number all rows in order of preference. Then filter by row_number. In the code below the join will duplicate rows first because all matching material_id will be joined, then row_number() function will assign 1 to rows where a.location_id = b.location_id and 2 to rows where a.location_id <> b.location_id if exist also rows where a.location_id = b.location_id and 1 if there are not exist such. b.location_id added to the order by in the row_number() function so it will "prefer" rows with lower b.location_id in case there are no exact matching. I hope you have caught the idea.
select * from
(
SELECT
a.*,
b.*,
row_number() over(partition by material_id
order by CASE WHEN a.location_id = b.location_id THEN 1 ELSE 2 END, b.location_id ) as rn
FROM a
LEFT JOIN (some more complex select) b
ON a.material_id=b.material_id
)s
where rn=1
;

Maybe this is helpful for somebody in the future:
We also came up with a different approach.
First, we create another table to calculate averages from the table b based on material_id over all (!) locations.
Second, In the join table we create three columns:
c1 - the value where material_id and location_id are matching (result from a left join of table a with table b). This column is null if there is no perfect match.
c2 - the value from the table where we write the number from the averages (fallback) table for this material_id (regardless of the location)
c3 - the "actual value" column where we use a case statement to decide if when the column 1 is NULL (there is no perfect match of material and location) then we use the value from column 2 (the average over all the other locations for the material) for the further calculations.

Related

Linking tables on multiple criteria

I've got myself in a bit of a mess on something I'm doing where I'm trying to get two tables linked together based on multiple bits of info.
I want to link one table to another based on the basic rules of(in this hierarchy)
where main linking is where orderid matches between the two tables
records from table 2 where valid=Y,
from those i want the valid records which has the highest seqn1 number and then from those the one that has the highest seqn2 value
table1
orderid | date | otherinfo
223344 | 22/10/2020 | okokkokokooeodijjf
table2
orderid | seqn1 | seqn2 | valid | additonaldata
223344 | 1 | 3 | y | sdfsfsf
223344 | 2 | 1 | y | sffferfr
223344 | 2 | 2 | y | sfrfrefr -- This row
223344 | 2 | 3 | n | rfrg66rr
223344 | 2 | 4 | n | adwere
223344 | 3 | 4 | n | adwere
so would want the final record to be
orderid | date | otherinfo | seqn1 | seqn2 | valid | additonaldata
223344 | 22/10/2020 | okokkokokooeodijjf | 2 | 2 | y | sfrfrefr
I started off with the code below but I'm not sure I'm doing it right and I can't seem to get it to pay attention to the valid flag when i try to add it in.
SELECT * FROM table1
left JOIN table2
ON table1.orderid = table2.orderid
AND table2.seqn1 = (SELECT MAX(table2.seqn1) FROM table2 WHERE table1.orderid = table2.orderid)
AND table2.seqn2 = (SELECT MAX(table2.seqn2) FROM table2 WHERE table1.orderid = table2.orderid
AND table2.seqn1 = (SELECT MAX(table2.seqn1) FROM table2 WHERE table1.orderid = table2.orderid))
Could someone help me amend the code please.
Use row_number analytic function with partition by orderid and order by SEQNRs in the order you need. No need for multiple subselects. To add more selections for the single row, use CASE to map your values to numbers and order by them also.
Fiddle here.
with l as (
select *,
rank() over(partition by orderid order by seqn1 desc, seqn2 desc) as rn
from line
where valid = 'y'
)
select *
from header as h
join l
on h.orderid = l.orderid
and l.rn = 1
How about something like this:
;
with cte_table2 as
(
SELECT ordered
,MAX(seqn1) as seqn1
,MAX(seqn2) as seqn2
FROM table2
where valid = 'y'
group by ordered --check if you need to add 'valid' to the group by but I don't think so.
)
SELECT
t1.*
,t3.otherinfo
--,t3.[OtherFields]
from table1 t1
inner join cte_table2 t2 on t1.orderid = t2.orderid -- first match on id
left join table2 t3 on t3.orderid = t2.orderid and t3.seqn1 = t2.seqn1 and t3.seqn2 = t2.seqn2

In a PostgreSQL query how to filter results from a field in a join

My table structure is as follows:
Restaurants
| id | name |
|----|---------|
| 1 | The Hut |
| 2 | T Burger|
Dishes:
| id | name |
|----|---------|
| 1 | Pizza |
| 2 | Caramel |
Orders:
| id | locatio |
|----|---------|
| 1 | New York|
| 2 | London |
_RestaurantDishes:
In here, B represents restaurants and A represents dishes.
A restaurant can have many dishes. A dish can only have one restaurant.
| id | A | B |
|----|---|---|
| 1 | 1 | 1 |
| 2 | 1 | 2 |
_DishOrders:
In here, B represents orders and A represents dishes.
A dish can have many orders. An order can have many dishes.
| id | A | B |
|----|---|---|
| 1 | 1 | 1 |
| 2 | 1 | 2 |
What I want to do is, get a list of dishes from a selected list of restaurants and sort them according to the orders count. I tried to do it like this:
SELECT count(dishOrder.id) as myCount, dish.id, name
FROM "default$default"."Dish" dish
left join "default$default"."_RestaurantDishes" dishRestaurant on dish.id = dishRestaurant."A"
left join "default$default"."_DishOrders" dishOrder on dish.id = dishorder."A"
where "dishRestaurant"."B" in ("1", "2")
group by dish.id order by mycount desc;
But it gives me the error ERROR: missing FROM-clause entry for table "dishRestaurant". I tried many other approaches but didn't work.
you are getting error ERROR: missing FROM-clause entry for table "dishRestaurant" because
you used double quote in number in ("1", "2") which turned into column name for using double quote. you have to change it like in (1,2)
SELECT count(dishOrder.id) as myCount, dish.id, name
FROM "default$default"."Dish" dish
left join "default$default"."_RestaurantDishes" dishRestaurant on dish.id = dishRestaurant."A"
left join "default$default"."_DishOrders" dishOrder on dish.id = dishorder."A"
where "dishRestaurant"."B" in (1, 2)
group by dish.id ,name
order by myCount desc;
you also used dish.id, name in selection but not used name in group by, so you have to use that in group by
Don't escape identifiers if you can avoid it. The escaping fixes the casing -- and strange things happen.
I suspect that you want:
select count(*) as myCount, d.id, d.name
from "default$default"."Dish" d left join
"default$default"."_RestaurantDishes" dr
on dr."A" = d.id and
dr."B" in (1, 2) left join
"default$default"."_DishOrders" do
on d.id = do."A"
group by d.id, d.name
order by mycount desc;
Notes:
I simplified the table aliases and removed the double quotes from the aliases defined in the query.
You should remove the double quotes from the column names -- if you can.
I added d.name back into the group by. This is not needed if d.id is unique.
I use count(*), so no columns are referenced.
I moved the filtering condition to the on clause. This keeps the left join intact.

Comparing different columns in SQL for each row

after some transformation I have a result from a cross join (from table a and b) where I want to do some analysis on. The table for this looks like this:
+-----+------+------+------+------+-----+------+------+------+------+
| id | 10_1 | 10_2 | 11_1 | 11_2 | id | 10_1 | 10_2 | 11_1 | 11_2 |
+-----+------+------+------+------+-----+------+------+------+------+
| 111 | 1 | 0 | 1 | 0 | 222 | 1 | 0 | 1 | 0 |
| 111 | 1 | 0 | 1 | 0 | 333 | 0 | 0 | 0 | 0 |
| 111 | 1 | 0 | 1 | 0 | 444 | 1 | 0 | 1 | 1 |
| 112 | 0 | 1 | 1 | 0 | 222 | 1 | 0 | 1 | 0 |
+-----+------+------+------+------+-----+------+------+------+------+
The ids in the first column are different from the ids in the sixth column.
In a row are always two different IDs that are matched with each other. The other columns always have either 0 or 1 as a value.
I am now trying to find out how many values(meaning both have "1" in 10_1, 10_2 etc) two IDs have on average in common, but I don't really know how to do so.
I was trying something like this as a start:
SELECT SUM(CASE WHEN a.10_1 = 1 AND b.10_1 = 1 then 1 end)
But this would obviously only count how often two ids have 10_1 in common. I could make something like this for example for different columns:
SELECT SUM(CASE WHEN (a.10_1 = 1 AND b.10_1 = 1)
OR (a.10_2 = 1 AND b.10_1 = 1) OR [...] then 1 end)
To count in general how often two IDs have one thing in common, but this would of course also count if they have two or more things in common. Plus, I would also like to know how often two IDS have two things, three things etc in common.
One "problem" in my case is also that I have like ~30 columns I want to look at, so I can hardly write down for each case every possible combination.
Does anyone know how I can approach my problem in a better way?
Thanks in advance.
Edit:
A possible result could look like this:
+-----------+---------+
| in_common | count |
+-----------+---------+
| 0 | 100 |
| 1 | 500 |
| 2 | 1500 |
| 3 | 5000 |
| 4 | 3000 |
+-----------+---------+
With the codes as column names, you're going to have to write some code that explicitly references each column name. To keep that to a minimum, you could write those references in a single union statement that normalizes the data, such as:
select id, '10_1' where "10_1" = 1
union
select id, '10_2' where "10_2" = 1
union
select id, '11_1' where "11_1" = 1
union
select id, '11_2' where "11_2" = 1;
This needs to be modified to include whatever additional columns you need to link up different IDs. For the purpose of this illustration, I assume the following data model
create table p (
id integer not null primary key,
sex character(1) not null,
age integer not null
);
create table t1 (
id integer not null,
code character varying(4) not null,
constraint pk_t1 primary key (id, code)
);
Though your data evidently does not currently resemble this structure, normalizing your data into a form like this would allow you to apply the following solution to summarize your data in the desired form.
select
in_common,
count(*) as count
from (
select
count(*) as in_common
from (
select
a.id as a_id, a.code,
b.id as b_id, b.code
from
(select p.*, t1.code
from p left join t1 on p.id=t1.id
) as a
inner join (select p.*, t1.code
from p left join t1 on p.id=t1.id
) as b on b.sex <> a.sex and b.age between a.age-10 and a.age+10
where
a.id < b.id
and a.code = b.code
) as c
group by
a_id, b_id
) as summ
group by
in_common;
The proposed solution requires first to take one step back from the cross-join table, as the identical column names are super annoying. Instead, we take the ids from the two tables and put them in a temporary table. The following query gets the result wanted in the question. It assumes table_a and table_b from the question are the same and called tbl, but this assumption is not needed and tbl can be replaced by table_a and table_b in the two sub-SELECT queries. It looks complicated and uses the JSON trick to flatten the columns, but it works here:
WITH idtable AS (
SELECT a.id as id_1, b.id as id_2 FROM
-- put cross join of table a and table b here
)
SELECT in_common,
count(*)
FROM
(SELECT idtable.*,
sum(CASE
WHEN meltedR.value::text=meltedL.value::text THEN 1
ELSE 0
END) AS in_common
FROM idtable
JOIN
(SELECT tbl.id,
b.*
FROM tbl, -- change here to table_a
json_each(row_to_json(tbl)) b -- and here too
WHERE KEY<>'id' ) meltedL ON (idtable.id_1 = meltedL.id)
JOIN
(SELECT tbl.id,
b.*
FROM tbl, -- change here to table_b
json_each(row_to_json(tbl)) b -- and here too
WHERE KEY<>'id' ) meltedR ON (idtable.id_2 = meltedR.id
AND meltedL.key = meltedR.key)
GROUP BY idtable.id_1,
idtable.id_2) tt
GROUP BY in_common ORDER BY in_common;
The output here looks like this:
in_common | count
-----------+-------
2 | 2
3 | 1
4 | 1
(3 rows)

How to query 2 tables in sql server with many to many relationship to identify differences

I have two tables with a many to many relationship and I am trying to merge the 2 tables in a select statement. I want to see all of the records from both tables, but only match 1 record from table A to 1 record to table b, so null values are ok.
For example table A has 20 records that match only 15 records from table B. I want to see all 20 records, the 5 that are unable to be matched can show null.
Table 1
Something | Code#
apple | 75
pizza | 75
orange | 6
Ball | 75
green | 4
red | 6
Table 2
date | id#
Feb-15 | 75
Feb-11 | 75
Jan-10 | 6
Apr-08 | 4
The result I need is
Something | Date | Code# | ID#
apple | Feb-15 | 75 | 75
pizza | Feb-11 | 75 | 75
orange | Jan-10 | 6 | 6
Ball | NULL | 75 | NULL
green | Apr-08 | 4 | 4
red | NULL | 6 | NULL
I'm imagining something like this. You want to pair of the rows side by side but one side is going to have more than the others.
select * /* change to whatever you need */
from
(
select *, row_number() over (partition by "code#" order by "something") as rn
from tableA
) as a
full outer join /* sounds like maybe left outer join will work too */
(
select *, row_number() over (partition by "id#" order by "date" desc) as rn
from tableB
) as b
on b."id#" = a."code#" and b.rn = a.rn
Actually I don't know how you're going to get "ball" to comes after "apple" and "pizza" without some other column to sort on. Rows in SQL tables don't have any ordering and you can't rely on the default listing from select *... or assume that the order of insertion is significant.
A regular Left-join should do it for you.
select tableA.*
, tableB.*
from tableA
left join tableB
on tableB.PrimaryKey = tableA.PrimaryKey
we would need to see the table structure to tell you for sure, but essentially you join on the full key (if possible)
SELECT * FROM TABLEA A
JOIN TABLEB B ON
A.FULLKEY = B.FULLKEY
Left outer join
Question changed
Make that a full outer join
select table1.*, table2.*
from table1
full outer join table2
on table1.Code# = table2.id#
This is probably not a true many to many but I think this is what you are asking for

Oracle join on first row of a subquery

This may seem simple, but somehow it isn't. I have a table of historical rate data called TBL_A that looks like this:
| id | rate | added_date |
|--------|--------|--------------|
| bill | 7.50 | 1/24/2011 |
| joe | 8.50 | 5/3/2011 |
| ted | 8.50 | 4/17/2011 |
| bill | 9.00 | 9/29/2011 |
In TBL_B, I have hours that need to be joined to a single row of TBL_A in order to get costing info:
| id | hours | added_date |
|--------|---------|--------------|
| bill | 10 | 2/26/2011 |
| ted | 4 | 7/4/2011 |
| bill | 9 | 10/14/2011 |
As you can see, for Bill there are two rates in TBL_A, but they have different dates. To properly get Bill's cost for a period of time, you have to join each row of TBL_B on an row in TBL_A that is appropriate for the date.
I figured this would be easy; because this didn't have to an exceptionally fast query, I could just do a separate subquery for each row of costing info. However, joined subqueries apparently cannot "see" other tables that they are joined on. This query throws an invalid identifier (ORA-00904) on anything in the subquery that has the "h" alias:
SELECT h.id, r.rate * h.hours as "COST", h.added_date
FROM TBL_B h
JOIN (SELECT * FROM (
SELECT i.id, i.rate
FROM TBL_A i
WHERE i.id = h.id and i.added_date < h.added_date
ORDER BY i.added_date DESC)
WHERE rownum = 1) r
ON h.id = r.id
If the problem is simply scoping, I don't know if the approach I took can ever work. But all I'm trying to do here is get a single row based on some criteria, so I'm definitely open to other methods.
EDIT: The desired output would be this:
| id | cost | added_date |
|--------|---------|--------------|
| bill | 75 | 2/26/2011 |
| ted | 34 | 7/4/2011 |
| bill | 81 | 10/14/2011 |
Note that Bill has two different rates in the two entries in the table. The first row is 10 * 7.50 = 75 and the second row is 9 * 9.00 = 81.
Try using not exists:
select
b.id,
a.rate,
b.hours,
a.rate*b.hours as "COST",
b.added_date,
a.added_date
from
tbl_b b
inner join tbl_a a on
b.id = a.id
where
a.added_date < b.added_date
and not exists (
select
1
from
tbl_a a2
where
a2.added_date > a.added_date
and a2.added_date < b.added_date
and a2.id = a.id
)
As an explanation why this is happening: Only correlated subqueries are aware of the context in which they're being run, since they're run for each row. A joined subquery is actually executed prior to the join, and so it has no knowledge of the surrounding tables. You need to return all identifying information with it to make the join in the top level of the query, rather than trying to do it within the subquery.
select id, cost, added_date from (
select
h.id,
r.rate * h.hours as "COST",
h.added_date,
-- For each record, assign r=1 for 'newest' rate
row_number() over (partition by h.id, h.added_date order by r.added_date desc) r
from
tbl_b h,
tbl_a r
where
r.id = h.id and
-- Date of rate must be entered before
-- hours billed:
r.added_date < h.added_date
)
where r = 1
;