Hive Multi Select - hive

CID F_ID NME
1 A QR
1 B QB
2 A QR
3 B QB
4 A QR
4 B QB
Result: -
CID F_ID NME
1 A QR
1 B QB
4 A QR
4 B QB
In Hive, what's the query to get the result should only outcome the CID
that fall in both F_ID - A & B, I can acheive the same using LISTAGG in oracle

This query will execute in single map-reduce stage:
select CID, F_ID, NME from
(
select s.*,
sum(A) over (partition by CID) A_cnt,
sum(B) over (partition by CID) B_cnt
from
(
select s.*,
case when F_ID='A' then 1 else 0 end A,
case when F_ID='B' then 1 else 0 end B
from your_table
)s
)s where A_cnt>=1 and B_cnt >=1
;
Demo:
select CID, F_ID, NME from
(
select s.*,
sum(A) over (partition by CID) A_cnt,
sum(B) over (partition by CID) B_cnt
from
(
select s.*,
case when F_ID='A' then 1 else 0 end A,
case when F_ID='B' then 1 else 0 end B
from
( --replace this subquery (s) with your table
select stack(6,
1, 'A', 'QR',
1, 'B', 'QB',
2, 'A', 'QR',
3, 'B', 'QB',
4, 'A', 'QR',
4, 'B', 'QB') as (CID, F_ID, NME)
) s
)s
)s where A_cnt>=1 and B_cnt >=1
;
Result:
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.39 sec HDFS Read: 13549 HDFS Write: 28 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 390 msec
OK
1 B QB
1 A QR
4 B QB
4 A QR
Time taken: 108.779 seconds, Fetched: 4 row(s)

Related

Subtract two columns in a query for 2 given types, otherwise leave as is

CREATE TABLE #test (Type1 VARCHAR(10),NUM1 VARCHAR(10), Grp1 VARCHAR(10), Amt1 INT, Amt2 INT)
INSERT INTO #test
(
Type1,
NUM1,
Grp1,
Amt1,
Amt2
)
VALUES
('CA', 'TIX_1', 'GG', 3, 5),
('PR', 'TIX_1', 'GG', 2, 1),
('PR', 'TIX_2', 'XX', 1, 5)
Let's say I have data in a table that looks like this:
Type1 NUM1 Grp1 Amt1 Amt2
CA TIX_1 GG 3 5
PR TIX_1 GG 2 1
PR TIX_2 XX 1 5
For a given NUM1 I will either have 2 Type1 (CA, PR) or 1 Typ1, either CA or PR again.
Goal is for those NUM1's that have PR and CA record to do a simple subtraction, however if there is NUM1 with only one Type1 (either CA or PR) then leave as is
Goal:
NUM1 Grp1 Amt1 Amt2
TIX_1 GG 1 4
TIX_2 XX 1 5
I'm trying to do this with a FULL JOIN like so:
SELECT t1.NUM1,
t1.Grp1,
t1.Amt1 - t2.Amt1 AS X, --(CA - PR )
t1.Amt2 - t2.Amt2 AS Y --(CA - PR )
FROM #test t1 FULL JOIN #test t2
ON t1.NUM1=t2.NUM1
AND t1.Grp1=t2.Grp1
AND t2.Type1='PR'
WHERE t1.Type1='CA'
Result from above query:
NUM1 Grp1 X Y
TIX_1 GG 1 4
For TIX_2 - there is only 1 Type1 = PR, so No data gets pulled up, but is there a way I can JOIN the data and subtract it to achieve my goal?
#NBK Edit
NUM1 Grp1 X Y
TIX_1 GG 1 4
TIX_1 GG 0 0
TIX_2 XX 0 0
NULL NULL NULL NULL
Great work on your Minimal Reproducible Example!
One approach, which works with this dataset, but may not work with a more complex dataset is a straight group by with a conditional sum e.g.
select min(Type1), NUM1, Grp1
, case when min(Type1) = 'CA' then sum(Amt1 * case when Type1 = 'CA' then 1 else -1 end) else sum(Amt1) end
, case when min(Type1) = 'CA' then sum(Amt2 * case when Type1 = 'CA' then 1 else -1 end) else sum(Amt2) end
from #test
group by NUM1, Grp1;
A more complex approach, which would work with a more complex dataset would use CTEs.
In the first cte get the count of rows per NUM1
Then in the second cte get a row number, and a conditional sum over NUM1. The row number allows us to then filter only the first row per NUM1. The conditional sum allows us to sum the amounts for a given NUM1 only reversing Type1='PA' when there is more than one row for that NUM1.
with cte1 as (
select *
, count(*) over (partition by NUM1) Cnt
from #test
), cte2 as (
select Type1, NUM1, Grp1, Amt1, Amt2, rn
, row_number() over (partition by NUM1 order by Type1) rn
, sum(Amt1 * case when Type1 = 'CA' or Cnt = 1 then 1 else -1 end) over (partition by NUM1) Amt11
, sum(Amt2 * case when Type1 = 'CA' or Cnt = 1 then 1 else -1 end) over (partition by NUM1) Amt22
from cte1
)
select Type1, NUM1, Grp1, Amt11, Amt22
from cte2
where rn = 1;

SQL- Cumulative sum based on condition

I have a scenario in which I have to calculate the counter based on below data. If the status is A, B,C than counter should be 0 which is working fine.
If STATUS is D counter should do a cumulative sum with the exception that if status is changed in between(like in 201907) , the counter should reset again and sum should start again with 1,2,3 and so on. Any possible help is appreciated on same.
Input - 3 columns - Customer_No, Date, Status
CUSTOMER_NO Date STATUS
1234 201901 A
1234 201902 B
1234 201903 C
1234 201904 D
1234 201905 D
1234 201906 D
1234 201907 C
1234 201908 D
1234 201910 D
1234 201911 D
1234 201912 D
expected Output - Input columns + Counter Column
CUSTOMER_NO Date STATUS COUNTER
----------------------------------------
1234 201901 A 0
1234 201902 B 0
1234 201903 C 0
1234 201904 D 1
1234 201905 D 2
1234 201906 D 3
1234 201907 C 0
1234 201908 D 1
1234 201910 D 2
1234 201911 D 3
1234 201912 D 4
Sample data
Thanks
You can create a numbering like a serial number for the ordering purpose using the ROW_NUMBER() function as shown below.
create table SampleData(CUSTOMER_NO int
, STATUS char(1)
, COUNTER int)
insert into SampleData Values
(1234, 'A', 0),
(1234, 'B', 0),
(1234, 'C', 0),
(1234, 'D', 1),
(1234, 'D', 2),
(1234, 'D', 3),
(1234, 'C', 0),
(1234, 'D', 1),
(1234, 'D', 2),
(1234, 'D', 3),
(1234, 'D', 4)
;with cte as(
Select *
, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS RN
from SampleData
)
select CUSTOMER_NO
, STATUS
, COUNTER
, (SELECT SUM(case STATUS when 'D' then Counter else 0 end) FROM cte t2 WHERE t2.RN <= cte.RN) AS Needed
from cte
Live db<>fiddle demo.
This is a similar approach this Gordon's, however, uses CTEs and ROW_NUMBER to make the islands first, and then 0's if there is only 1 row in that island using a windowed COUNT and a CASE expression:
WITH Grps AS(
SELECT ID,
CUSTOMER_NO,
[STATUS],
ROW_NUMBER() OVER (PARTITION BY CUSTOMER_NO ORDER BY ID) -
ROW_NUMBER() OVER (PARTITION BY CUSTOMER_NO, [STATUS] ORDER BY ID) AS Grp
FROM (VALUES(1,1234,'A'),
(2,1234,'B'),
(3,1234,'C'),
(4,1234,'D'),
(5,1234,'D'),
(6,1234,'D'),
(7,1234,'C'),
(8,1234,'D'),
(9,1234,'D'),
(10,1234,'D'),
(11,1234,'D'))V(ID,CUSTOMER_NO,[STATUS]))
SELECT ID,
CUSTOMER_NO,
[STATUS],
Grp,
CASE WHEN COUNT(ID) OVER (PARTITION BY CUSTOMER_NO, [STATUS], Grp) = 1 THEN 0
ELSE ROW_NUMBER() OVER (PARTITION BY CUSTOMER_NO, [STATUS], Grp ORDER BY ID) - 1
END AS [COUNTER]
FROM Grps;
As Gordon mentioned, as well, if you don't have some kind of sequential ID/Key, you can't do this with your data. You will need to implement some kind of sequential ID, and hope that your data retains it's "insert order".
This is a variant of a gaps-and-islands problem. For this particular incarnation, you can identify the islands by counting the number of non-D statuses before a given row.
After identifying the groups, use case and row_number():
select t.*,
(case when status = 'D'
then row_number() over (partition by customer_no, grp, status order by date)
else 0
end) as counter
from (select t.*,
sum(case when status <> 'D' then 1 else 0 end) over (partition by customer_no order by date) as grp
from t
) t

Assign column value based on the percentage of rows

In DB2 is there a way to assign a column value based on the first x%, then y% and remaining z% of rows?
I've tried using row_number() function but no luck!
Example below
Assuming that the below example count(id) is already arranged in descending order
Input:
ID count(id)
5 10
3 8
1 5
4 3
2 1
Output:
First 30% rows of the above input should be assigned code H, last 30% of the rows will have code L and remaining will have code M. If 30% of rows evaluates to decimal then round up-to 0 decimal place.
ID code
5 H
3 H
1 M
4 L
2 L
You can use window functions:
select t.id,
(case ntile(3) over (order by count(id) desc)
when 1 then 'H'
when 2 then 'M'
when 3 then 'L'
end) as grp
from t
group by t.id;
This puts them into equal sized groups.
For 30-40-30% split with your conditions, you have to be more careful:
select t.id,
(case when (seqnum - 1.0) < 0.3 * cnt then 'H'
when (seqnum + 1.0) > 0.7 * cnt then 'L'
else 'M'
end) as grp
from (select t.id,
count(*) as cnt,
count(*) over () as num_ids,
row_number() over (order by count(*) desc) as seqnum
from t
group by t.id
) t
Try this:
with t(ID, count_id) as (values
(5, 10)
, (3, 8)
, (1, 5)
, (4, 3)
, (2, 1)
)
select t.*
, case
when pst <=30 then 'H'
when pst <=70 then 'M'
else 'L'
end as code
from
(
select t.*
, rownumber() over (order by count_id desc) as rn
, 100*rownumber() over (order by count_id desc)/nullif(count(1) over(), 0) as pst
from t
) t;
The result is:
ID COUNT_ID RN PST CODE
-- -------- -- --- ----
5 10 1 20 H
3 8 2 40 M
1 5 3 60 M
4 3 4 80 L
2 1 5 100 L

oracle sql running total range

I have two tables tab_a as
SUB_ID AMOUNT
1 10
2 5
3 7
4 15
5 4
2 table tab_b as
slab_number slab_start slab_end
1 12 20
2 21 25
3 26 35
slab_start will always be 1 more than slab_end of previous slab number
If I run the running total for tab_a my result is
select sub_id , sum(amount) OVER(ORDER BY sub_id) run_sum
from tab_a
sub_id run_sum
1 10
2 15
3 22
4 37
5 41
I need to SQL query to check which slab_NUMBER if run_sum is less than first slab_number from then it should be Zero , if run_sum is more than last slab number then blank except the row which crosses the limit .
Expected result is
sub_id run_sum slab_number
1 10 0
2 15 1
3 22 2
4 37 3
5 41 NULL
I have tried this .
First find the running sum which crosses the limit i. e last slab_end
select min( run_sum )
from (select sub_id , sum(amount) OVER(ORDER BY sub_id) run_sum
from tab_a ) where run_sum>=35
then use below query
select sub_id,
run_sum,
case
when run_sum <
(select SLAB_START from tab_b where slab_number = '1') then
0
when run_sum = 37 then
(select max(slab_number) from tab_b)
when run_sum > 37 then
NULL
else
(select slab_number
from tab_b
where run_sum between SLAB_START and slab_end)
end slab_number
from (select sub_id, sum(amount) OVER(ORDER BY sub_id) run_sum from tab_a)
is there any other way to improve.
Somewhat strange requirement :) Use some analytic functions and case when's. Row_number when you need to find something first, max() over() and sum() over() when you need information from over rows:
with
a as (
select sub_id, row_number() over (order by sub_id) rn,
sum(amount) over (order by sub_id) rs
from tab_a),
b as (select tab_b.*, max(slab_number) over () msn from tab_b )
select sub_id, rs,
case when sn is null and row_number() over (partition by sn order by sub_id) = 1
then msn else sn
end sn
from (
select sub_id, rs, max(msn) over () msn,
case when slab_number is null and rn = 1 then 0 else slab_number end sn
from a left join b on rs between slab_start and slab_end)
dbfiddle demo
you could try this:
select a.sub_id , sum(a.amount) OVER(ORDER BY a.sub_id) run_sum
,case when b.slab_number=1 then 0 else lag(b.slab_number,1) over (order by a.sub_id)end slab_number
from tab_a a
left join tab_b b on a.SUB_ID = b.slab_number
I think this is basically a left join with a default value:
select a.*,
(case when a.run_sum < bb.min_slab_num then 0
else b.slab_num
end) as slab_num
from (select sub_id,
sum(amount) over (order by sub_id) as run_sum
from tab_a
) a left join
tab_b b
on a.run_sum between slab_start and slab_end cross join
(select min(slab_start) as min_slab_start
from tab_b
) bb;

ROW_NUMBER query

I have a table:
Trip Stop Time
-----------------
1 A 1:10
1 B 1:16
1 B 1:20
1 B 1:25
1 C 1:31
1 B 1:40
2 A 2:10
2 B 2:17
2 C 2:20
2 B 2:25
I want to add one more column to my query output:
Trip Stop Time Sequence
-------------------------
1 A 1:10 1
1 B 1:16 2
1 B 1:20 2
1 B 1:25 2
1 C 1:31 3
1 B 1:40 4
2 A 2:10 1
2 B 2:17 2
2 C 2:20 3
2 B 2:25 4
The hard part is B, if B is next to each other I want it to be the same sequence, if not then count as a new row.
I know
row_number over (partition by trip order by time)
row_number over (partition by trip, stop order by time)
None of them will meet the condition I want. Is there a way to query this?
create table test
(trip number
,stp varchar2(1)
,tm varchar2(10)
,seq number);
insert into test values (1, 'A', '1:10', 1);
insert into test values (1, 'B', '1:16', 2);
insert into test values (1, 'B', '1:20', 2);
insert into test values (1 , 'B', '1:25', 2);
insert into test values (1 , 'C', '1:31', 3);
insert into test values (1, 'B', '1:40', 4);
insert into test values (2, 'A', '2:10', 1);
insert into test values (2, 'B', '2:17', 2);
insert into test values (2, 'C', '2:20', 3);
insert into test values (2, 'B', '2:25', 4);
select t1.*
,sum(decode(t1.stp,t1.prev_stp,0,1)) over (partition by trip order by tm) new_seq
from
(select t.*
,lag(stp) over (order by t.tm) prev_stp
from test t
order by tm) t1
;
TRIP S TM SEQ P NEW_SEQ
------ - ---------- ---------- - ----------
1 A 1:10 1 1
1 B 1:16 2 A 2
1 B 1:20 2 B 2
1 B 1:25 2 B 2
1 C 1:31 3 B 3
1 B 1:40 4 C 4
2 A 2:10 1 B 1
2 B 2:17 2 A 2
2 C 2:20 3 B 3
2 B 2:25 4 C 4
10 rows selected
You want to see if the stop changes between one row and the next. If it does, you want to increment the sequence. So use lag to get the previous stop into the current row.
I used DECODE because of the way it handles NULLs and it is more concise than CASE, but if you are following the text book, you should probably use CASE.
Using SUM as an analytic function with an ORDER BY clause will give the answer you are looking for.
select *, dense_rank() over(partition by trip, stop order by time) as sqnc
from yourtable;
Use dense_rank so you get all the numbers consecutively, with no skipped numbers in between.
I think this is more complicated than a simple row_number(). You need to identify groups of adjacent stops and then enumerate them.
You can identify the groups using a difference of row numbers. Then, a dense_rank() on the difference does what you want if there are no repeated stops on a trip:
select t.*,
dense_rank() over (partition by trip order by grp, stop)
from (select t.*,
(row_number() over (partition by trip order by time) -
row_number() over (partition by trip, stop order by time)
) as grp
from table t
) t;
If there are:
select t.*, dense_rank() over (partition by trip order by mintime)
from (select t.*,
min(time) over (partition by trip, grp, stop) as mintime
from (select t.*,
(row_number() over (partition by trip order by time) -
row_number() over (partition by trip, stop order by time)
) as grp
from table t
) t
) t;