Oracle join on first row of a subquery - sql

This may seem simple, but somehow it isn't. I have a table of historical rate data called TBL_A that looks like this:
| id | rate | added_date |
|--------|--------|--------------|
| bill | 7.50 | 1/24/2011 |
| joe | 8.50 | 5/3/2011 |
| ted | 8.50 | 4/17/2011 |
| bill | 9.00 | 9/29/2011 |
In TBL_B, I have hours that need to be joined to a single row of TBL_A in order to get costing info:
| id | hours | added_date |
|--------|---------|--------------|
| bill | 10 | 2/26/2011 |
| ted | 4 | 7/4/2011 |
| bill | 9 | 10/14/2011 |
As you can see, for Bill there are two rates in TBL_A, but they have different dates. To properly get Bill's cost for a period of time, you have to join each row of TBL_B on an row in TBL_A that is appropriate for the date.
I figured this would be easy; because this didn't have to an exceptionally fast query, I could just do a separate subquery for each row of costing info. However, joined subqueries apparently cannot "see" other tables that they are joined on. This query throws an invalid identifier (ORA-00904) on anything in the subquery that has the "h" alias:
SELECT h.id, r.rate * h.hours as "COST", h.added_date
FROM TBL_B h
JOIN (SELECT * FROM (
SELECT i.id, i.rate
FROM TBL_A i
WHERE i.id = h.id and i.added_date < h.added_date
ORDER BY i.added_date DESC)
WHERE rownum = 1) r
ON h.id = r.id
If the problem is simply scoping, I don't know if the approach I took can ever work. But all I'm trying to do here is get a single row based on some criteria, so I'm definitely open to other methods.
EDIT: The desired output would be this:
| id | cost | added_date |
|--------|---------|--------------|
| bill | 75 | 2/26/2011 |
| ted | 34 | 7/4/2011 |
| bill | 81 | 10/14/2011 |
Note that Bill has two different rates in the two entries in the table. The first row is 10 * 7.50 = 75 and the second row is 9 * 9.00 = 81.

Try using not exists:
select
b.id,
a.rate,
b.hours,
a.rate*b.hours as "COST",
b.added_date,
a.added_date
from
tbl_b b
inner join tbl_a a on
b.id = a.id
where
a.added_date < b.added_date
and not exists (
select
1
from
tbl_a a2
where
a2.added_date > a.added_date
and a2.added_date < b.added_date
and a2.id = a.id
)
As an explanation why this is happening: Only correlated subqueries are aware of the context in which they're being run, since they're run for each row. A joined subquery is actually executed prior to the join, and so it has no knowledge of the surrounding tables. You need to return all identifying information with it to make the join in the top level of the query, rather than trying to do it within the subquery.

select id, cost, added_date from (
select
h.id,
r.rate * h.hours as "COST",
h.added_date,
-- For each record, assign r=1 for 'newest' rate
row_number() over (partition by h.id, h.added_date order by r.added_date desc) r
from
tbl_b h,
tbl_a r
where
r.id = h.id and
-- Date of rate must be entered before
-- hours billed:
r.added_date < h.added_date
)
where r = 1
;

Related

SQL Performance Inner Join

Let me ask you something I've been thinking about for a while. Imagine that you have two tables with data:
MAIN TABLE (A)
| ID | Date |
|:-----------|------------:|
| 1 | 01-01-1990|
| 2 | 01-01-1991|
| 3 | 01-01-1992|
| 4 | 01-01-2000|
| 5 | 01-01-2001|
| 6 | 01-01-2003|
SECONDARY TABLE (B)
| ID | Date | TOTAL |
|:-----------|------------:|--------:|
| 1 | 01-01-1990| 1 |
| 2 | 01-01-1991| 2 |
| 3 | 01-01-1992| 1 |
| 4 | 01-01-2000| 5 |
| 5 | 01-01-2001| 8 |
| 6 | 01-01-2003| 7 |
and you want to select only ID with date greater than 31-12-1999 and get the following columns: ID, Date and Total. For that we have many options but my question would be which of the following would be better in terms of performance:
OPTION 1
With main as(
select id,
date
from A
where date > '31-12-1999'
)
select main.id,
main.date,
B.total
from main inner join B on main.id = b.id
OPTION 1
With main as(
select id,
date
from A
where date > '31-12-1999'
),
secondary as (
select id,
total
from B
where date > '31-12-1999'
)
select main.id,
main.date,
secondary.total
from main inner join secondary on main.id = b.id
Which of both queries would be better in terms of performance? Thanks in advance!
DATE FOR BOTH TABLES MEANS THE SAME
You don't need to use CTE you can directly join two tables -
select A.id,
A.date,
B.total
from A inner join B on A.id = b.id
where A.date > '31-12-1999'
You would need to test on your data. But there is really no need for CTEs:
select a.id a.date, b.total
from a inner join
b
on a.id = b.id
where a.date > '1999-12-31' and b.date > '1999-12-31';
As for your specific question, the two queries are not the same, because the first is filtering on only one date and the second is filtering on two dates. You should run the query that implements the logic that you intend.

MAX plus other data with JOIN

This should be a simple one I reckon (at least I thought it would be when I started doing it a couple hours ago).
I am trying to select the MAX value from one table and join it with another to get the pertinent data.
I have two tables: ACCOUNTS & ACCOUNT_BALANCES
Here they are:
ACCOUNTS
ACC_ID | NAME | IMG_LOCATION
------------------------------------
0 | Cash | images/cash.png
500 | MyBank | images/mybank.png
and
ACCOUNT_BALANCES
ACC_ID | BALANCE | UPDATE_DATE
-------------------------------
500 | 100 | 2017-11-10
500 | 250 | 2018-01-11
0 | 100 | 2018-01-05
I would like the end result to look like:
ACC_ID | NAME | IMG_LOCATION | BALANCE | UPDATE_DATE
----------------------------------------------------------------
0 | Cash | images/cash.png | 100 | 2018-01-05
500 | MyBank | images/mybank.png | 250 | 2018-01-11
I thought I could select the MAX(UPDATE_DATE) from the ACCOUNT_BALANCES table, and join with the ACCOUNTS table to get the account name (as displayed above), but having to group by means my end result includes all records from the ACCOUNT_BALANCES table.
I can use this query to select only the records from ACCOUNT_BALANCES with the max UPDATE_DATE, but I can't include the balance.
SELECT
a.ACC_ID,
a.IMG_LOCATION,
a.NAME,
x.UDATE
FROM
ACCOUNTS a
RIGHT JOIN
(
SELECT
b.ACC_ID,
MAX(b.UPDATE_DATE) as UDATE
FROM
ACCOUNT_BALANCES b
GROUP BY
b.ACC_ID
) x
ON
a.ACC_ID = x.ACC_ID
If I include ACCOUNT_BALANCES.BALANCE in the above query (like so):
SELECT
a.ACC_ID,
a.IMG_LOCATION,
a.NAME,
x.UDATE
FROM
ACCOUNTS a
RIGHT JOIN
(
SELECT
b.ACC_ID,
b.BALANCE,
MAX(b.UPDATE_DATE) as UDATE
FROM
ACCOUNT_BALANCES b
GROUP BY
b.ACC_ID, b.BALANCE
) x
ON
a.ACC_ID = x.ACC_ID
The results returned looks like this:
ACC_ID | NAME | IMG_LOCATION | BALANCE | UPDATE_DATE
----------------------------------------------------------------
0 | Cash | images/cash.png | 100 | 2018-01-05
500 | MyBank | images/mybank.png | 100 | 2018-01-11
500 | MyBank | images/mybank.png | 250 | 2018-01-11
Which is obviously the case, since I'm grouping by BALANCE in the subquery.
I'm super hesitant to post this, as this seems exactly the kind of questions that's been answered n times, but I've searched around a lot, but couldn't find anything that really helped me.
This one has a really good answer, but isn't exactly what I'm looking for
I'm obviously missing something really simple and even pointers in the right direction will help. Thank you.
Try This
;WITH CTE
AS
(
SELECT
RN = ROW_NUMBER() OVER(PARTITION BY AC.ACC_ID ORDER BY AB.UPDATE_DATE DESC),
AC.ACC_ID,
NAME,
IMG_LOCATION,
BALANCE,
UPDATE_DATE
FROM ACCOUNTS AC
INNER JOIN ACCOUNT_BALANCES AB
ON AC.ACC_ID = AB.ACC_ID
)
SELECT
*
FROM CTE
WHERE RN = 1
I think the simplest method is outer apply:
select a.*, ab.*
from accounts a outer apply
(select top 1 ab.*
from account_balances ab
where ab.acc_id = a.acc_id
order by ab.update_date desc
) ab;
apply implements what is technically known as a "lateral join". This is a very powerful type of join -- something like a generalization of correlated subqueries.
The accepted answer will work. However, for large data sets try to avoid using the row_number() function. Instead, you can try something like this (assuming update_date is part of a unique constraint):
SELECT
a.Acc_Id,
a.Name,
a.Img_Location,
bDetail.Balance,
bDetail.Update_Date
FROM
#accounts AS a LEFT JOIN
(
SELECT Acc_Id, MAX(Update_Date) AS Update_Date
FROM #account_balances AS b
GROUP BY Acc_Id
) AS maxDate ON a.Acc_Id = maxDate.Acc_Id
LEFT JOIN #account_balances AS bDetail ON maxDate.Acc_Id = bDetail.Acc_Id AND
maxDate.Update_Date = bDetail.Update_Date

Join table A on table B and select only the first occurrence from B after specific date from table A

I'm trying to determine the best way to do the following.... Table a has a specific start_date. table b has a bunch of dollar amounts with various dates based on payments received and when. I only want to show the row from table b with the first date occurrence >= the start_date from table a. I also do not want to retrieve duplicates ID numbers which is what I am encountering now.
I have something like this so far...
Select a.ID, a.Start_Date
From a
Left Join (Select ID, Min(Recd_Dt) as Mindate, Total_Recd
Group by ID, Total_Recd) b on a.ID = b.ID and a.Start_Date <= b.Mindate
table a looks like this...
ID | Start_Dt
1 | 11/2/2017
2 | 11/3/2017
table b looks like this...
ID | Recd_Dt | Total_Recd
1 | 11/1/2017 | $600
1 | 11/10/2017 | $800
1 | 11/19/2017 | $100
2 | 11/2/2017 | $200
2 | 11/5/2017 | $600
2 | 11/6/2017 | $100
Id Like to see something like this...
ID | Recd_Dt | Total_Recd | Sum_of_Total_Recd_After_Start
1 | 11/10/2017 | $800 | $900
2 | 11/5/2017 | $600 | $700
furthermore, I'd like to also have a second join on the same table b that will give me a sum of any amount that occurred after the Start_Date
Give this a try:
SELECT
a.ID,
b.Recd_Dt,
b.Total_Recd,
SUM(Total_Recd) OVER(PARTITION BY a.ID) AS Sum_of_Total_Recd_After_Start
FROM a
INNER JOIN b ON a.ID = b.ID AND b.Recd_Dt > a.Start_Dt
QUALIFY ROW_NUMBER() OVER(PARTITION BY a.ID ORDER BY b.Start_Dt) = 1
1) Get all rows from table "a"
2) Get related rows from table "b" with Recd_Dt > Start_Dt
3) ROW_NUMBER orders rows by the earliest Start_Dt per each ID
4) QUALIFY ... = 1 keeps only the first row per ID grouping
5) SUM(Total_Recd) adds up the Total_Recd column per each ID grouping
I haven't tested it, but let me know if it works.

Hive / SQL - Left join with fallback

In Apache Hive I have to tables I would like to left-join keeping all the data from the left data and adding data where possible from the right table.
For this I use two joins, because the join is based on two fields (a material_id and a location_id).
This works fine with two traditional left joins:
SELECT
a.*,
b.*
FROM a
INNER JOIN (some more complex select) b
ON a.material_id=b.material_id
AND a.location_id=b.location_id;
For the location_id the database only contains two distinct values, say 1 and 2.
We now have the requirement that if there is no "perfect match", this means that only the material_id can be joined and there is no correct combination of material_id and location_id (e.g. material_id=100 and location_id=1) for the join for the location_id in the b-table, the join should "default" or "fallback" to the other possible value of the location_id e.g. material_id=001 and location_id=2 and vice versa. This should only be the case for the location_id.
We have already looked into all possible answers also with CASE etc. but to no prevail. A setup like
...
ON a.material_id=b.material_id AND a.location_id=
CASE WHEN a.location_id = b.location_id THEN b.location_id ELSE ...;
we tried or did not figure out how really to do in hive query language.
Thank you for your help! Maybe somebody has a smart idea.
Here is some sample data:
Table a
| material_id | location_id | other_column_a |
| 100 | 1 | 45 |
| 101 | 1 | 45 |
| 103 | 1 | 45 |
| 103 | 2 | 45 |
Table b
| material_id | location_id | other_column_b |
| 100 | 1 | 66 |
| 102 | 1 | 76 |
| 103 | 2 | 88 |
Left - Join Table
| material_id | location_id | other_column_a | other_column_b
| 100 | 1 | 45 | 66
| 101 | 1 | 45 | NULL (mat. not in b)
| 103 | 1 | 45 | DEFAULT TO where location_id=2 (88)
| 103 | 2 | 45 | 88
PS: As stated here exists etc. does not work in the sub-query ON.
The solution is to left join without a.location_id = b.location_id and number all rows in order of preference. Then filter by row_number. In the code below the join will duplicate rows first because all matching material_id will be joined, then row_number() function will assign 1 to rows where a.location_id = b.location_id and 2 to rows where a.location_id <> b.location_id if exist also rows where a.location_id = b.location_id and 1 if there are not exist such. b.location_id added to the order by in the row_number() function so it will "prefer" rows with lower b.location_id in case there are no exact matching. I hope you have caught the idea.
select * from
(
SELECT
a.*,
b.*,
row_number() over(partition by material_id
order by CASE WHEN a.location_id = b.location_id THEN 1 ELSE 2 END, b.location_id ) as rn
FROM a
LEFT JOIN (some more complex select) b
ON a.material_id=b.material_id
)s
where rn=1
;
Maybe this is helpful for somebody in the future:
We also came up with a different approach.
First, we create another table to calculate averages from the table b based on material_id over all (!) locations.
Second, In the join table we create three columns:
c1 - the value where material_id and location_id are matching (result from a left join of table a with table b). This column is null if there is no perfect match.
c2 - the value from the table where we write the number from the averages (fallback) table for this material_id (regardless of the location)
c3 - the "actual value" column where we use a case statement to decide if when the column 1 is NULL (there is no perfect match of material and location) then we use the value from column 2 (the average over all the other locations for the material) for the further calculations.

How do I join to another table and return only the most recent matching row?

I have a table that stores the lines on a contract. Each contract line his it's own unique ID, it also has the ID of its parent contract. Example:
+-------------+---------+
| contract_id | line_id |
+-------------+---------+
| 1111 | 100 |
| 1111 | 101 |
| 1111 | 102 |
+-------------+---------+
I have another table that stores the historical changes to contract lines. For example, every time the number of units on a contract line is changed a new row is added to the table. Example:
+-------------+---------+--------------+-------+
| contract_id | line_id | date_changed | units |
+-------------+---------+--------------+-------+
| 1111 | 100 | 2016-01-01 | 1 |
| 1111 | 100 | 2016-02-01 | 2 |
| 1111 | 100 | 2016-03-01 | 3 |
+-------------+---------+--------------+-------+
As you can see the contract line with ID 100 belonging to the contract with ID 1111 has been edited 3 times over 3 months. The current value is 3 units.
I'm running a query against the contract lines table to select all data. I want to join to the historical data table and select the most recent row for each contract line and show the units in my results. How do I do this?
Expected results (there would single results for 101 and 102 as well):
+-------------+---------+-------+
| contract_id | line_id | units |
+-------------+---------+-------+
| 1111 | 100 | 3 |
+-------------+---------+-------+
I've tried the query below with a left join but it returns 3 rows instead of 1.
Query:
SELECT *, T1.units
FROM contract_lines
LEFT JOIN (
SELECT contract_id, line_id, units, MAX(date_changed) AS maxdate
FROM contract_history
GROUP BY contract_id, line_id, units) AS T1
ON contract_lines.contract_id = T1.contract_id
AND contract_lines.line_id = T1.line_id
Actual results:
+-------------+---------+-------+
| contract_id | line_id | units |
+-------------+---------+-------+
| 1111 | 100 | 1 |
| 1111 | 100 | 2 |
| 1111 | 100 | 3 |
+-------------+---------+-------+
An extra join to contract_history along with maxdate will work
SELECT contract_lines.*,T2.units
FROM contract_lines
LEFT JOIN (
SELECT contract_id, line_id, MAX(date_changed) AS maxdate
FROM contract_history
GROUP BY contract_id, line_id) AS T1
JOIN contract_history T2 ON
T1.contract_id=T2.contract_id and
T1.line_id= T2.line_id and
T1.maxdate=T2.date_changed
ON contract_lines.contract_id = T1.contract_id
AND contract_lines.line_id = T1.line_id
Output
This is my preferred style because it doesn't require self joining and cleanly expresses your intent. Also, it competes very well with the ROW_NUMBER() method in terms of performance.
select a.*
, b.units
from contract_lines as a
join (
select a.contract_id
, a.line_id
, a.units
, Max(a.date_changed) over(partition by a.contract_id, a.line_id) as max_date_changed
from contract_history as a
) as b
on a.contract_id = b.contract_id
and a.line_id = b.line_id
and b.date_changed = b.max_date_changed;
Another possible solution to this. This uses RANK to sort/filter this. Similar to what you did, just a different tact.
SELECT contract_lines.*, T1.units
FROM contract_lines
LEFT JOIN (
SELECT contract_id, line_id, units,
RANK() OVER (PARTITION BY contract_id, line_id ORDER BY date_changed DESC) AS [rank]
FROM contract_history) AS T1
ON contract_lines.contract_id = T1.contract_id
AND contract_lines.line_id = T1.line_id
AND T1.rank = 1
WHERE T1.units IS NOT NULL
You could change this to a INNER JOIN and remove the IS NOT NULL in the WHERE clause if you expect data to be present all the time.
Glad you figured it out!
Try this simple query:
SELECT TOP 1 T1.*
FROM contract_lines T0
INNER JOIN contract_history T1
ON T0.contract_id = T1.contract_id and
T0.line_id = T1.line_id
ORDER BY date_changed DESC
As always seems to be the way after spending an hour looking at it and shouting at StackOverflow for having a rare period of maintenance I solve my own problem not long after posting a question.
In an effort to help anyone else who's stuck I'll show what I found. It might not be an efficient way to achieve this so if someone has a better suggestion I'm all ears.
I adapted the answer from here: T-SQL Subquery Max(Date) and Joins
SELECT *,
Units = (SELECT TOP 1 units
FROM contract_history
WHERE contract_lines.contract_id = contract_history.contract_id
AND contract_lines.line_id = contract_history.line_id
ORDER BY date_changed DESC
)
FROM ....