Failing to breakup Windowing invocations into Groups when using rank() - hive

I have a dataset of properties and I'm trying to rank the each segment by it's property count. I then want to assign the segment with most properties as the segment for the xml_id which manages these properties. I have tried this:
select
yyyy_mm_dd,
xml_id,
ps_segment
from(
select
pspd.yyyy_mm_dd,
xppd.xml_id,
ps.ps_segment,
count(pspd.property_id) as property_count.
rank() over (partition by pspd.yyyy_mm_dd order by count(pspd.property_id) desc) rn
from(
select
yyyy_mm_dd,
property_id
from
t1
) pspd
left join(
select
yyyy_mm_dd,
property_id,
xml_id
from
t2
) xppd on xppd.property_id = pspd.property_id and xppd.yyyy_mm_dd = pspd.yyyy_mm_dd
inner join
t3 ps on ps.property_id = property_id
group by
1,2,3
) x
where
rn = 1
The above throws the following error:
Error while compiling statement: FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies. Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 5:50 Expression not in GROUP BY key 'property_count'
So essentially xml_id ends up with the ps_segment which has the highest property count. What am I doing wrong in the query?

I solved this by myself and the below query works. It seems I should have been using ROW_NUMBER() instead of RANK():
select
yyyy_mm_dd,
xml_id,
ps_segment
from(
select
pspd.yyyy_mm_dd,
xppd.xml_id,
ps.ps_segment,
count(pspd.property_id) as property_count,
ROW_NUMBER() over (partition by pspd.yyyy_mm_dd, xml_id order by count(pspd.property_id) desc) rn
from(
select
yyyy_mm_dd,
property_id
from
t1
where
and yyyy_mm_dd = '2019-10-10'
) pspd
left join(
select
yyyy_mm_dd,
property_id,
xml_id
from
t2
where
yyyy_mm_dd = '2019-10-10'
) xppd on xppd.property_id = pspd.property_id and xppd.yyyy_mm_dd = pspd.yyyy_mm_dd
inner join
t3 ps on ps.hotel_id = pspd.hotel_id
group by
1,2,3
) x
where
rn = 1

Related

Selecting the latest order

I need to select the data of all my customers with the records displayed in the image. But I need to get the most recent record only, for example I need to get the order # E987 for John and E888 for Adam. As you can see from the example, when I do the select statement, I get all the order records.
You don't mention the specific database, so I'll answer with a generic solution.
You can do:
select *
from (
select t.*,
row_number() over(partition by name order by order_date desc) as rn
from t
) x
where rn = 1
You can use analytical function row_number.
Select * from
(Select t.*,
Row_number() over (partition by customer_id order by order_date desc) as rn
From your_table t) t
Where rn = 1
Or you can use not exists as follows:
Select *
From yoir_table t
Where not exists
(Select 1 from your_table tt
Where t.customer_id = tt.custome_id
And tt.order_date > t.order_date)
You can do it with a subquery that finds the last order date.
SELECT t.*
FROM yoir_table t
JOIN (SELECT tt.custome_id,
MAX(tt.order_date) MaxOrderDate
FROM yoir_table tt
GROUP BY tt.custome_id) AS tt
ON t.custome_id = tt.custome_id
AND t.order_date = tt.MaxOrderDate

Display Prev and Current value based on a ID - SQL

I am not sure if a similar question has been posted. I was unable to find one.
I have the following table:
What I trying to get is the below:
Any advice will be appreciated.
Thanks in advance,
Sam
Worked both in Oracle and Snowflake:
SELECT t.ID,
t.prev_dt,
t.current_dt,
t.prev_code,
t.curr_code
FROM (
SELECT id,
order_dt,
LAG(order_dt, 1) OVER (PARTITION by id ORDER BY id, order_dt) prev_dt,
upd_dt current_dt,
LAG(code, 1) OVER (PARTITION by id ORDER BY id, upd_dt) prev_code,
code curr_code
FROM t111
) t
INNER JOIN (
SELECT id,
MAX(order_dt) max_date
FROM t111
GROUP BY id
) idm
ON idm.id=t.id AND t.order_dt=idm.max_date
You seem to want window function lag():
select
id,
lag(order_dt) over(partition by id order by order_by_id) prev_dt,
order_dt current_dt,
lag(code) over(partition by id order by order_by_id) prev_code,
code curr_code
from mytable
Note that the above query does not filter the records of the table. When there is no preceeding record, lag() returns null. If you want to filter out the first record per group, and assuming that such record is identify by order_by_id = 1, you can do:
select *
from (
select
id,
lag(order_dt) over(partition by id order by order_by_id) prev_dt,
order_dt current_dt,
lag(code) over(partition by id order by order_by_id) prev_code,
code curr_code,
order_by_id
from mytable
) t
where order_by_id > 1
Window functions might be the best approach. But you could also use join:
select t1.id, t1.order_dt as prev_dt, t2.upd_dt as curr_date,
t1.code as prev_code, t2.code as curr_code
from t t1 join
t t2
on t1.id = t2.id and t1.order_by_id = 1 and t2.order_by_id = 2;
In Snowflake, I simply do not know whether this would have better, worse, or similar performance to using window functions.

SQL query to get first and last of a sequence

With the following query ...
select aa.trip_id, aa.arrival_time, aa.departure_time, aa.stop_sequence, aa.stop_id, bb.stop_name
from OeBB_Stop_Times aa
left join OeBB_Stops bb on aa.stop_id = bb.stop_id
I get the following table:
Now I want the first and last line/value from the column stop_sequence referring to column trip_id, so the result should be:
How can I do that?
Thanks
You can do a sub-query to get the min and max and join against that data.
Like this:
select aa.trip_id, aa.arrival_time, aa.departure_time, aa.stop_sequence, aa.stop_id, bb.stop_name
from OeBB_Stop_Times aa
join (
SELECT trip_id, max(stop_sequence) as max_stop, min(stop_sequence) as min_stop
FROM OeBB_Stop_Times
GROUP BY trip_di
) sub on aa.trip_id = sub.trip_id AND (aa.stop_sequence = sub.max_stop or aa.stop_sequence = sub.min_stop)
left join OeBB_Stops bb on aa.stop_id = bb.stop_id
You can use the ROW_NUMBER() window function twice to filter out rows, as in:
select *
from (
select *,
row_number() over(partition by trip_id order by arrival_time) as rn,
row_number() over(partition by trip_id order by arrival_time desc) as rnr
from OeBB_Stop_Times
) x
where rn = 1 or rnr = 1
order by trip_id, arrival_time
You can use row_number():
select s.*
from (select st.trip_id, st.arrival_time, st.departure_time,
st.stop_sequence, st.stop_id, s.stop_name,
row_number() over (partition by st.trip_id order by st.stop_sequence) as seqnum_asc,
row_number() over (partition by st.trip_id order by st.stop_sequence desc) as seqnum_desc
from OeBB_Stop_Times st left join
OeBB_Stops s
on st.stop_id = s.stop_id
) s
where 1 in (seqnum_asc, seqnum_desc);
Note that I fixed the table aliases so they are meaningful rather than arbitrary letters.
Actually, if the stop_sequence is guaranteed to start at 1, this is a bit simpler:
select s.*
from (select st.trip_id, st.arrival_time, st.departure_time,
st.stop_sequence, st.stop_id, s.stop_name,
max(stop_sequence) over (partition by st.trip_id) as max_stop_sequence
from OeBB_Stop_Times st left join
OeBB_Stops s
on st.stop_id = s.stop_id
) s
where stop_sequence in (1, max_stop_sequence);

Select most recent status for each ID and department code

I have the following table:
I want to get the most recent status for each dept_code that a CL_ID has. So the desired output would be this:
I have tried the following but this give me just the most recent status for each client and not each of their dept_codes.
SELECT *
FROM [CIMSHR6_MERGED].[dbo].[C3CLSTAT] C
INNER JOIN
(SELECT CLIENT_NUMBER, MAX(STATUS_DATE) AS SDATE
FROM [CIMSHR6_MERGED].[dbo].[C3CLSTAT]
GROUP BY CLIENT_NUMBER) X
ON X.CLIENT_NUMBER = C.CLIENT_NUMBER
AND X.SDATE = C.STATUS_DATE
ORDER BY C.CLIENT_NUMBER
Any help would be much appreciated. Thanks.
A convenient method that works in SQL Server is:
select top (1) cl.*
from [CIMSHR6_MERGED].[dbo].[C3CLSTAT] cl
order by row_number() over (partition by cl_id, dept_code order by status_date desc);
A method that is efficient with the right indexes in almost any database is:
select cl.*
from [CIMSHR6_MERGED].[dbo].[C3CLSTAT] cl
where cl.status_date = (select max(cl2.status_date)
from [CIMSHR6_MERGED].[dbo].[C3CLSTAT] cl2
where cl2.cl_id = cl.cl_id and cl2.dept_code = cl.dept_code
);
The right index is on (cl_id, dept_code, status_date).
I would also use ROW_NUMBER, but with a subquery:
SELECT CL_ID, Status_date, Status, Dept_code
FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY CL_ID, Dept_code ORDER BY Status_date DESC) rn
FROM CIMSHR6_MERGED].[dbo].[C3CLSTAT]
) t
WHERE rn = 1;
1) Firstly group everything on Dept_Code,CL_ID and assign rank for each row with in the group in descending order.
2) Select all the rows with rnk=1 which would display your desired result.
SELECT Z.CL_ID,
Z.Status_Date,
Z.Status,
Z.Dept_Code
FROM
(
SELECT *,
RANK() OVER( PARTITION BY Dept_Code,CL_ID, ORDER BY Status_Date DESC ) AS rnk
FROM [CIMSHR6_MERGED].[dbo].[C3CLSTAT]
) Z
WHERE Z.rnk = 1;
This would work for almost all databases
select * from c3clstat c
where exists
(select 1 from c3clstat c1
where c1.cl_id=c.cl_id
and c1.dept_code=c.dept_code
group by cl_id,dept_code
having c.status_date=max(c1.status_date)
)

SQL QUALIFY equivalent HIVE query

I'm trying to create a HIVE query from an Oracle SQL query. Essentially I want to select the first record, sorted descending by UPDATED_TM, DATETIME, ID_NUM.
SELECT
tbl1.NUM AS ID,
tbl1.UNIT AS UNIT,
tbl2.VALUE AS VALUE,
tbl1.CONTACT AS CONTACT_NAME,
'FILE' AS SOURCE,
CURDATE() AS DATE
FROM
DB1.TBL1 tbl1
LEFT JOIN DB1.TBL2 tbl2 ON tbl1.USR_ID = tbl2.USR_ID
WHERE
tbl1.UNIT IS NOT NULL
AND tbl1.TYPE = 'Generic'
QUALIFY
ROW_NUMBER() OVER (PARTITION BY tbl1.ROW_ID ORDER BY tbl1.UPDATED_TM DESC, tbl1.DATETIME DESC, tbl1.ID_NUM DESC) = 1
And my attempt at an equivalent Hive query (but also sql-compatible):
SELECT
tbl1.NUM AS ID,
tbl1.UNIT AS UNIT,
tbl2.VALUE AS VALUE,
tbl1.CONTACT AS CONTACT_NAME,
'FILE' AS SOURCE,
CURDATE() AS DATE
FROM (
SELECT
USR_ID, TYPE, NUM, UNIT, ROW_NUMBER() OVER (PARTITION BY tbl1.ROW_ID ORDER BY tbl1.UPDATED_TM DESC, tbl1.DATETIME DESC, tbl1.ID_NUM DESC) AS RNUM
FROM
DB1.TBL1
) tbl1
LEFT JOIN DB1.TBL2 tbl2 ON tbl1.USR_ID = tbl2.USR_ID
WHERE
tbl1.RNUM = 1
AND tbl1.UNIT IS NOT NULL
AND tbl1.TYPE = 'Generic'
Does that seem correct? Is there any way I can optimize the query? The tables I'm working with are quite large and I would like to make this as efficient as possible.
Thanks.
SELECT
tbl1.NUM AS ID,
tbl1.UNIT AS UNIT,
tbl2.VALUE AS VALUE,
tbl1.CONTACT AS CONTACT_NAME,
'FILE' AS SOURCE,
CURDATE() AS DATE
FROM
(
SELECT
USR_ID, TYPE, NUM, UNIT, ROW_NUMBER() OVER (PARTITION BY tbl.ROW_ID ORDER BY tbl.UPDATED_TM DESC, tbl.DATETIME DESC, tbl.ID_NUM DESC) AS RNUM
FROM
(
SELECT
USR_ID,TYPE,NUM,UNIT,ROW_ID,UPDATED_TM,DATETIME,ID_NUM
FROM DB1.TBL1
WHERE UNIT IS NOT NULL
AND TYPE = 'Generic'
)tbl
)tbl1
LEFT OUTER JOIN
DB1.TBL2 tbl2
ON tbl1.USR_ID = tbl2.USR_ID
WHERE tbl1.RNUM = 1;