Hive - Using Lateral View Explode with Joined Table - sql

I am building some analysis and need to prep the date by joining two tables and then unpivot a date field and create one record for each "date_type". I have been trying to work with lateral view explode(array()) function but I can't figure out how to do this with columns from two separate tables. Any help would be appreciated, open to completely different methods.
TableA:
loan_number
app_date
123
07/09/2022
456
07/11/2022
TableB:
loan_number
funding_date
amount
123
08/13/2022
12000
456
08/18/2022
10000
Desired Result:
loan_number
date_type
date_value
amount
123
app_date
07/09/2022
12000
456
app_date
07/11/2022
10000
123
funding_date
08/13/2022
12000
456
funding_date
08/18/2022
10000
Here is some sample code related the example above I was trying to make work:
SELECT
b.loan_number,
b.amount,
Date_Value
FROM TableA as a
LEFT JOIN
TableB as b
ON a.loan_number=b.loan_number
LATERAL VIEW explode(array(to_date(a.app_date),to_date(b.funding_date)) Date_List AS Date_value

No need lateral view explode, just union, try below:
with base_data as (
select
a.loan_number,
a.app_date,
b.funding_date,
b.amount
from
tableA a
join
tableB b on a.loan_number = b.loan_number
)
select
loan_number,
'app_date' as date_type,
app_date as date_value,
amount
from
base_data
union all
select
loan_number,
'funding_date' as date_type,
funding_date as date_value,
amount
from
base_data

Related

Netezza - update one table with max data from another table

I have a table in netezza that I need to update. The columns I am working with are
TABLE A
ID_NO
ENTRY_DATE
PRICE
TABLE B
ID_NO
START_DATE
END_DATE
PRICE
So an example of the data would look like this:
TABLE A
ID_NO
ENTRY_DATE
PRICE
123
2020-05-01
123
2020-08-15
TABLE B
ID_NO
START_DATE
END_DATE
PRICE
123
2019-01-01
2019-11-01
$7.64
123
2020-04-30
2020-05-02
$6.19
123
2020-04-15
2020-08-30
$2.19
I need to update the PRICE in TABLE A to be the max PRICE from TABLE B where a.ENTRY_DATE is between b.START_DATE and b.END_DATE. So the final table should look like this:
TABLE A
ID_NO
ENTRY_DATE
PRICE
123
2020-05-01
$6.19
123
2020-08-15
$2.19
This is what I have so far, but it just ends up taking the max price that fits either row rather than doing the calculation for each row:
update TABLE_A
set PRICE=(select max(b.PRICE)
from TABLE_B b
inner join TABLE_A a on a.ID_NO=b.ID_NO
where a.ENTRY_DATE between b.START_DATE and b.END_DATE)
I don't have access to Netezza, but a usual format would be to use a correlated sub-query.
That is, instead of including TABLE_A again in the query, you refer to the outer reference to TABLE_A...
update
TABLE_A
set
PRICE = (
select max(b.PRICE)
from TABLE_B b
where TABLE_A.ID_NO = b.ID_NO
and TABLE_A.ENTRY_DATE between b.START_DATE and b.END_DATE
)
In this way, the correlated-sub-query is essentially invoked once for each row in TABLE_A and that invocation uses the current row from TABLE_A as its parameters.
An alternative could be...
update
TABLE_A
set
PRICE = revised.PRICE
from
(
select a.ID_NO, a.ENTRY_DATE, max(b.PRICE) AS PRICE
from TABLE_B b
inner join TABLE_A a on a.ID_NO=b.ID_NO
where a.ENTRY_DATE between b.START_DATE and b.END_DATE
group by a.ID_NO, a.ENTRY_DATE
)
AS revised
where
TABLE_A.ID_NO = revised.ID_NO
and TABLE_A.ENTRY_DATE = revised.ENTRY_DATE

Find last job change date with JOB_TITLE and EVENT_DATE

Hi I am working in an Azure Databricks and I am looking for a SQL query solution.
Assuming that my db has five columns:
ID
EVENT_DATE
JOB_TITLE
PAY
12345
2021-01-01
VP1
100,000
12345
2020-01-10
VP1
90,000
12345
2019-01-20
Analyst1
80,000
12346
2021-02-01
VP2
200,000
12346
2020-02-10
Analyst2
150,000
12346
2020-01-20
Analyst2
110,000
Basically I want the EVENT_DATE when JOB_TITLE changed the last time. This is my desired output:
ID
JOB_TITLE
PAY
LAST_JOB_CHANGE_DATE
12345
VP1
90,000
2021-01-10
12346
VP2
200,000
2021-02-01
For the last column LAST_JOB_CHANGE_DATE, we are pulling from the 2nd and 4th row of the table because that's the date when they changed job the last time.
Thank you!
You can just use INNER JOIN to accomplish that, ie
%sql
SELECT a.*
FROM yourTable a
INNER JOIN
(
SELECT id, MAX(event_date) event_date
FROM yourTable b
GROUP BY id
) b ON a.id = b.id
AND a.event_date = b.event_date
The ROW_NUMBER approach would also work well:
%sql
WITH cte AS
(
SELECT
ROW_NUMBER() OVER( PARTITION BY id ORDER BY event_date DESC ) AS rn,
*
FROM yourTable a
)
SELECT *
FROM cte
WHERE rn = 1
My results:
There's probably a simpler solution for this but the following should work.
I'm assuming you wanted the MOST resent job change for each employee. To illustrate this, I added an extra row for an Engineer1. The ROW_NUMBER() window function helps us with this.
ID
EVENT_DATE
JOB_TITLE
PAY
12345
2021-01-01
VP1
100,000
12345
2020-01-10
VP1
90,000
12345
2019-01-20
Analyst1
80,000
12345
2018-01-04
Engineer1
75,000
12346
2021-02-01
VP2
200,000
12346
2020-02-10
Analyst2
150,000
12346
2020-01-20
Analyst2
110,000
Here is the query:
SELECT <---- (4)
c.ID,
c.JOB_TITLE,
c.PAY,
c.last_job_change_date
FROM
(
SELECT <---- (3)
b.ID,
ROW_NUMBER() OVER (PARTITION BY b.ID ORDER BY b.last_job_change_date DESC) AS row_id,
b.JOB_TITLE,
b.PAY,
b.last_job_change_date
FROM
(
SELECT <---- (2)
a.ID,
a.JOB_TITLE,
a.PAY,
a.EVENT_DATE as last_job_change_date
FROM
(
SELECT <---- (1)
ID,
EVENT_DATE,
PAY,
JOB_TITLE,
LEAD(JOB_TITLE, 1) OVER (
PARTITION BY ID ORDER BY EVENT_DATE DESC) job_change
FROM yourtable
) a
WHERE JOB_TITLE <> job_change
) b
) c
WHERE row_id = 1
I used a 4 step process and annotated the query with each step:
Returns a table with a column for the subsequent job title (ordered by most recent title) of each employee.
Returns the table from (1) but removes rows where the employee did not change their job
Add row numbers so we can get the most recent job change of each employee
Return most recent job changes for each employee

SQL: Take maximum value, but if a field is missing for a particular ID, ignore all values

This is somewhat difficult to explain...(this is using SQL Assistant for Teradata, which I'm not overly familiar with).
ID creation_date completion_date Difference
123 5/9/2016 5/16/2016 7
123 5/14/2016 5/16/2016 2
456 4/26/2016 4/30/2016 4
456 (null) 4/30/2016 (null)
789 3/25/2016 3/31/2016 6
789 3/1/2016 3/31/2016 30
An ID may have more than one creation_date, but it will always have the same completion_date. If the creation_date is populated for all records for an ID, I want to return the record with the most recent creation_date. However, if ANY creation_date for a given ID is missing, I want to ignore all records associated with this ID.
Given the data above, I would want to return:
ID creation_date completion_date Difference
123 5/14/2016 5/16/2016 2
789 3/25/2016 3/31/2016 6
No records are returned for 456 because the second record has a missing creation_date. The record with the most recent creation_date is returned for 123 and 789.
Any help would be greatly appreciated. Thanks!
Depending on your database, here's one option using row_number to get the max date per group. You can then filter those results with not exists to check against null values:
select *
from (
select *,
row_number() over (partition by id order by creation_date desc) rn
from yourtable
) t
where rn = 1 and not exists (
select 1
from yourtable t2
where t2.creationdate is null and t.id = t2.id
)
row_number is a window function that is supported in many databases. mysql doesn't but you can achieve the same result using user-defined variables.
Here is a more generic version using conditional aggregation:
select t.*
from yourtable t
join (select id, max(creation_date) max_creation_date
from yourtable
group by id
having count(case when creation_date is null then 1 end) = 0
) t2 on t.id = t2.id and t.creation_date = t2.max_creation_date
SQL Fiddle Demo

How do I return multiple column values as new rows in Oracle 10g?

I have a table where multiple account numbers are associated with different IDs(DR_NAME). Each account could have as few as 0 accounts, and as many as 16. I believe UNPIVOT would work, but I'm on Oracle 10g, which does not support this.
DR_NAME ACCT1 ACCT2 ACCT3 ACC4
======================================
SMITH 1234
JONES 5678 2541 2547
MARK NULL
WARD 8754 6547
I want to display a new line for each name with only 1 account number per line
DR_NAME ACCT
==============
SMITH 1234
JONES 5678
JONES 2541
JONES 2547
MARK NULL
WARD 8754
WARD 6547
Oracle 10g does not have an UNPIVOT function but you can use a UNION ALL query to unpivot the columns into rows:
select t1.DR_NAME, d.Acct
from yourtable t1
left join
(
select DR_NAME, ACCT1 as Acct
from yourtable
where acct1 is not null
union all
select DR_NAME, ACCT2 as Acct
from yourtable
where acct2 is not null
union all
select DR_NAME, ACCT3 as Acct
from yourtable
where acct3 is not null
union all
select DR_NAME, ACCT4 as Acct
from yourtable
where acct4 is not null
) d
on t1.DR_NAME = d.DR_NAME;
See SQL Fiddle with Demo.
This query uses a UNION ALL to convert the columns into rows. I included a where clause to remove any null values, otherwise you will get multiple rows for each account where the acct value is null. Excluding the null values will drop the dr_name = Mark which you showed that you want in the final result. To include the rows that only have null values, I added the join to the table again.
The most efficient way I know is to do a cross join with some logic:
select *
from (select t.dr_name,
(case when n.n = 1 then acct1
when n.n = 2 then acct2
when n.n = 3 then acct3
when n.n = 4 then acct4
end) as acct
from t cross join
(select 1 as n from dual union all
select 2 from dual union all
select 3 from dual union all
select 4 from dual
) n
) s
where acct is not null
The union all approach typical results in scanning the table once for each subquery. This approach will typically scan the table once.
If you're only interested in inserting these records then take a look at multitable insert -- a single scan of the data and multiple rows generated, so it's very efficient.
Code examples here: http://docs.oracle.com/cd/B28359_01/server.111/b28286/statements_9014.htm#SQLRF01604
Note that you can reference the same table multiple times, using syntax along the lines of ...
insert all
when acct1 is not null then into target_table (..) values (dr_name,acct1)
when acct2 is not null then into target_table (..) values (dr_name,acct2)
when acct3 is not null then into target_table (..) values (dr_name,acct3)
when acct4 is not null then into target_table (..) values (dr_name,acct4)
select
dr_name,
acct1,
acct2,
acct3,
acct4
from my_table.

Filling in for missing latest data with last available data

I have two tables, one (market_cap_data) with month_end_date, id, market_cap fields:
month_end_date id market_cap
2012-12-31 123456 5000
2011-12-31 123456 4000
and a second table (start_date_table) with month_end_date, id, start_date fields:
month_end_date id start_date
2011-12-31 123456 1980-12-31
I want to combine the two tables but the start_date_table data ends a year before the market_cap_data table. I want to fill the latest data where the start_date_table doesn't have data using the most recent start_date. For example, instead of an outside join like:
month_end_date id market_cap start_date
2012-12-31 123456 5000 NULL
2011-12-31 123456 4000 1980-12-31
I want it to look like
month_end_date id market_cap start_date
2012-12-31 123456 5000 1980-12-31
2011-12-31 123456 4000 1980-12-31
Tried a bunch of different things but can't figure it out.
Any help would be appreciated!
SELECT
m.month_end_date,
m.id,
m.market_cap,
CASE
WHEN s.start_date IS NOT NULL THEN s.start_date
ELSE (SELECT MAX(s2.start_date) FROM start_date_table s2 WHERE s2.id = m.id)
END AS start_date
FROM market_cap_data m
LEFT JOIN start_date_table s
ON m.id = s.id
AND m.month_end_date = s.month_end_date
I think you would benefit from a case statement, this is not tested as I don't have a fiddle to validate against
create function get_latest_date_from_table(varchar(100) table_name returns Date
(
return select max(date) from #table_name
)
create procedure modify_null_dates_for_marker
(
max_date Date;
max_date = get_latest_date_from_table('table');
select
foo,
bar
CASE WHEN start_date IS NULL
THEN max_date
ELSE start_date END AS start_date
FROM table
)
This should give a method to set the null columns correctly.