hive: coalesce over a single column

hive: coalesce over a single column - hive

For my sample data below. I need to coalesce with the available productid over a given viewid. Is it possible using coalesce ?
date viewid productid
5/1/17 100e9b59e70deb1493677845193 null
5/1/17 100e9b59e70deb1493677845193 12345
5/1/17 100e9b59e70deb1493677845193 null
Results:
date viewid productid
5/1/17 100e9b59e70deb1493677845193 12345
5/1/17 100e9b59e70deb1493677845193 12345
5/1/17 100e9b59e70deb1493677845193 12345

select `date`,viewid,min(productid) over (partition by viewid) as productid
from mytable

Related

Join and enrich data in one table by closest date BigQuery

I have a BigQuery table with data:
clientId
revenue
orderId
order_date
w_date
w_source
w_campaign
11111111
100
00000001
2022-08-02
null
null
null
11111111
1000
00000002
2022-08-07
null
null
null
11111111
2000
00000003
2022-08-07
null
null
null
11111111
null
null
null
2022-07-27
source_1
campaign_2
11111111
null
null
null
2022-06-30
source_2
campaign_4
22222222
250
00000011
2022-08-15
null
null
null
22222222
500
00000015
2022-08-22
null
null
null
22222222
100
00000087
2022-08-25
null
null
null
22222222
null
null
null
2022-08-02
source_4
campaign_6
22222222
null
null
null
2022-08-18
source_1
campaign_9
And I want to get the result:
clientId
revenue
orderId
order_date
w_date
w_source
w_campaign
11111111
100
00000001
2022-08-02
2022-07-27
source_1
campaign_2
11111111
1000
00000002
2022-08-07
2022-07-27
source_1
campaign_2
11111111
2000
00000003
2022-08-07
2022-07-27
source_1
campaign_2
22222222
250
00000011
2022-08-15
2022-08-02
source_4
campaign_6
22222222
500
00000015
2022-08-22
2022-08-18
source_1
campaign_9
22222222
100
00000087
2022-08-25
2022-08-18
source_1
campaign_9
And I don't understand how to properly merge such data:
I have only one value with clientId on which data can be joined
w_date must be less or equal order_date
w_source & w_campaign must be equal w_date as well
I tried to do it with JOIN or subquery with LAST_VALUE but it doesn't work but perhaps the desired query is too easy for that I hope.

You need to use window function and the over statements. I defined a get_last window range.
The last_value returns the last entry in that column. The ignore nulls only takes filled values in account and ignores the all null entries. By limiting the range, only the values before that date are considered.
With tbl as
(Select 11111111 clientId,100 revenue,1 orderId,date("2022-08-02") order_date,null w_date,null w_source,null w_campaign
union all Select 11111111,1000,2,date("2022-08-07"),null,null,null
union all Select 11111111,2000,3,date("2022-08-07"),null,null,null
union all Select 11111111,null,null,null,date("2022-07-27"),"source_1","campaign_2"
union all Select 11111111,null,null,null,date("2022-06-30"),"source_2","campaign_4"
union all Select 22222222,250,11,date("2022-08-15"),null,null,null
union all Select 22222222,500,15,date("2022-08-22"),null,null,null
union all Select 22222222,100,87,date("2022-08-25"),null,null,null
union all Select 22222222,null,null,null,date("2022-08-02"),"source_4","campaign_6"
union all Select 22222222,null,null,null,date("2022-08-18"),"source_1","campaign_9"
),
tmp as
(Select * except(w_date,w_source,w_campaign) ,
last_value(w_date ignore nulls) over get_last as w_date_,
last_value(w_source ignore nulls) over get_last as w_source_,
last_value(w_campaign ignore nulls) over get_last as w_campaign_,
from tbl
window get_last as (partition by clientid order by ifnull(order_date,w_date) range between unbounded preceding and current row )
)
Select * from tmp
where orderId is not null

SQL - Group vacations in a table based on a holidays

Here is the sample data from the employee vacation table.
Emp_id Vacation_Start_Date Vacation_End_Date Public_Hday
1234 06/01/2022 06/07/2022 null
1234 06/08/2022 06/14/2022 null
1234 06/15/2022 06/19/2022 06/17/2022
1234 06/20/2022 06/23/2022 null
1234 06/24/2022 06/28/2022 null
1234 06/29/2022 07/02/2022 06/30/2022
1234 07/03/2022 07/07/2022 null
1234 07/08/2022 07/12/2022 null
1234 07/13/2022 07/17/2022 07/15/2022
1234 07/18/2022 07/22/2022 null
I want to group these vacations based on the public holidays in between (Assuming that all the vacations are consecutive). Here is the output that I am trying to get.
Emp_id Vacation_Start_Date Vacation_End_Date Public_Hday Group
1234 06/01/2022 06/07/2022 null 0
1234 06/08/2022 06/14/2022 null 0
1234 06/15/2022 06/19/2022 06/17/2022 1
1234 06/20/2022 06/23/2022 null 1
1234 06/24/2022 06/28/2022 null 1
1234 06/29/2022 07/02/2022 06/30/2022 2
1234 07/03/2022 07/07/2022 null 2
1234 07/08/2022 07/12/2022 null 2
1234 07/13/2022 07/17/2022 07/15/2022 3
1234 07/18/2022 07/22/2022 null 3
Here is the code that I tried
Select *, dense_rank() over (partition by Emp_id order by Public_Hday) - 1 AS Group from Emp_Vacation.
But, it gave the expected group values only to the vacations where the Public_Hday is not null. How do I get the group values to the other vacations.

You can use a conditional sum() over()
Select *
,Grp = sum( case when [Public_Hday] is null then 0 else 1 end ) over (partition by [Emp_id] order by [Vacation_Start_Date])
from YourTable
Results

Return first non null value in each column

I am looking to create a summary/rollup by day and customer ID from a table (table is updating from multiple sources currently).
For each customer ID and transaction date, I'm either looking to get a min, max, sum or first non null value in that column for that combination. I have no problem with min, max and sum, but am looking for suggestions on how to best handle the first non null value in a column.
Sample of what my table looks like:
Cust ID Trans Date Housing Housing $ Retail Retail $ Arrival
123 1/1/2019 test1 $500.00 NULL NULL 1/1/2019
123 1/1/2019 NULL NULL product1 $15.00 NULL
1235 5/10/2019 test2 $1,000.00 NULL NULL 5/10/2019
1234 10/15/2019 test2 $1,000.00 NULL NULL 10/15/2019
1234 10/15/2019 NULL NULL product2 $25.00 NULL
Results I'm looking for:
123 1/1/2019 test1 $500.00 product1 $15.00 1/1/2019
1235 5/10/2019 test2 $1,000.00 NULL NULL 5/10/2019
1234 10/15/2019 test2 $1,000.00 product2 $25.00 10/15/2019

SQL tables represent unordered sets. There is no "first value" in a column -- NULL or otherwise -- unless a column specifies the ordering.
However, for your result set, simple aggregation seems sufficient:
select CustID, TransDate, max(Housing), max(Housing$), max(Retail), max(Retail$), max(Arrival)
from t
group by CustID, TransDate;

Apply MAX value. Then add conditions based on the MAX Value Row

I have the below table. I need the MAX value of Date Per ID when CategoryID = 201 Per ID
TableA
ID Date CategoryID
1 1/1/17 101
1 1/2/17 201
1 1/4/17 201
1 1/5/17 301
2 1/1/17 101
2 5/1/17 201
(Work) Query:
,MAX(TABLEA.DATE)
KEEP (DENSE_RANK LAST ORDER BY TABLEA.DATE)
OVER (PARTITION BY ID)
AS most_recent_dt
I need to add a condition in the query: When CategoryId = 201 Then take the MAX Date
Expected Output:
ID Date CatergoryId Most_Recent_Dt
1 1/1/17 101 1/4/17
1 1/2/17 201 1/4/17
1 1/4/17 201 1/4/17
1 1/5/17 301 1/4/17
2 1/1/17 101 5/1/17
2 5/1/17 201 5/1/17
*Edit
Now that I have my MAX Line I need to add Conditions based on the MAX line only.
Expected Output:
In short.
**Partition by ID.
Apply Max Value when CategoryID = 201
Now apply conditions based off the MAX value ROW
When Role = Gold and HistID is not null Then "Approved"
else "Pending"
ID Date CategoryID Most_Recent_Dt Role HistId Category
1 1/1/17 101 1/4/17 Gold (Null) Approved
1 1/2/17 201 1/4/17 Bronze 201 Approved
*1 1/4/17 201 1/4/17 Gold 101 Approved
1 1/5/17 301 1/4/17 Gold 101 Approved
2 1/1/17 101 5/1/17 Gold (Null) Pending
*2 5/1/17 201 5/1/17 Bronze 101 Pending

You should not need the KEEP clause (since it is the same as the MAX) and can just do:
MAX( CASE When CategoryId = 201 THEN TABLEA.DATE END )
OVER (PARTITION BY ID)
AS most_recent_201_dt
Now that I have my MAX Line I need to add Conditions based on the MAX line only.
Case When (Role = Gold And HistId IS NOT NULL) OR () THEN 'Approved' WHEN... THEN 'NotApproved' ELSE 'Pending' END AS Category
This is when you would use the KEEP clause as you want the values for the Role and HistID columns for the latest date value.
Something like:
CASE
WHEN (
MAX( CASE Role WHEN 'Gold' THEN Role END )
KEEP ( DENSE_RANK LAST
ORDER BY CASE WHEN CategoryId = 201 THEN TABLEA.DATE END NULLS FIRST )
OVER ( PARTITION BY ID )
= Role
AND
MAX( HistID )
KEEP ( DENSE_RANK LAST
ORDER BY CASE WHEN CategoryId = 201 THEN TABLEA.DATE END NULLS FIRST,
CASE Role WHEN 'Gold' THEN Role END NULLS FIRST )
OVER ( PARTITION BY ID )
IS NOT NULL
)
OR ( ... )
THEN 'Approved'
WHEN ...
THEN 'NotApproved'
ELSE 'Pending'
END

I would do this as:
MAX(CASE WHEN CategoryId = 201 THEN TABLEA.DATE END) OVER (PARTITION BY id) as most_recent_dt
That is, don't think of this as a "first value" calculation. Think of it as a (condition) maximum over all records with the same id.

sql combine only some duplicates

I want to combine one set of duplicates from my table, but not all.
example:
acct date bal
--------------------
123 1/1/2013 40.00
123 1/1/2013 2.00
456 1/2/2013 50.00
456 1/1/2013 5.00
789 1/1/2013 10.00
789 1/1/2013 17.00
I would like to combine acct 123 to only one row, summing the balance of those rows, but leave the rest.
desired output:
acct date bal
--------------------
123 1/1/2013 42.00
456 1/2/2013 50.00
456 1/1/2013 5.00
789 1/1/2013 10.00
789 1/1/2013 17.00
Working in SQL Server 2005.

Use CASE in GROUP BY clause
SELECT acct, date, SUM(bal) AS bal
FROM dbo.tets73
GROUP BY acct, date, CASE WHEN acct != 123 THEN bal END
Demo on SQLFiddle

SELECT acct, date, SUM(bal)
FROM T
WHERE acct = 123
UNION
SELECT acct, date, bal
FROM T
WHERE acct <> 123

select acct, date, sum(bal) from table where acct = 123
union
select acct, date bal from table where acct <> 123

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

hive: coalesce over a single column - hive

select `date`,viewid,min(productid) over (partition by viewid) as productid from mytable

Related

Join and enrich data in one table by closest date BigQuery

SQL - Group vacations in a table based on a holidays

Return first non null value in each column

Apply MAX value. Then add conditions based on the MAX Value Row

sql combine only some duplicates

Categories

Resources