Converting Code from Teradata to HIVE Rank Order

Converting Code from Teradata to HIVE Rank Order - sql

I need help with converting the below query from syntax appropriate for teradata to HIVE.
I've tried a copy and past for the subquery but I'm not able to get the qualify clause to work.
CREATE MULTISET VOLATILE TABLE Month_Shifts AS (
SELECT "Month"
, Emp_ID
, Emp_NM
, MAX(ending_team) OVER (PARTITION BY Emp_ID ORDER BY "Month" ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS Starting_team
, ending_team
FROM
(
SELECT "Month"
, Emp_id
, current_team AS Ending_team
, COUNT(DISTINCT call_key) AS CallVolume
FROM data
GROUP BY 1,2,3
QUALIFY ROW_NUMBER() OVER (PARTITION BY "month", Emp_ID, Emp_NM ORDER BY CallVolume DESC) = 1
) a
) WITH DATA NO PRIMARY INDEX ON COMMIT PRESERVE ROWS;
It should be able to run without issue. Currently seeing this error message:
FAILED: ParseException line 1:260 missing EOF at 'QUALIFY' near '4'

In Hive, you can just move the condition to the outer query:
SELECT "Month", Emp_ID, Emp_NM,
LAG(ending_team) OVER (PARTITION BY Emp_ID ORDER BY "Month") AS Starting_team,
ending_team
FROM (SELECT d."month", d.Emp_ID, d.Emp_NM,
current_team AS Ending_team,
COUNT(DISTINCT call_key) AS CallVolume,
ROW_NUMBER() OVER (PARTITION BY "month", Emp_ID, Emp_NM ORDER BY COUNT(DISTINCT call_key) DESC) as seqnum
FROM data d
GROUP d."month", d.Emp_ID, d.Emp_NM
) d
WHERE seqnum = 1;
Notes:
The QUALIFY is replaced by the WHERE in the outer query.
Do not use SELECT * with GROUP BY. List the columns. Regardless of database.
Hive supports LAG(), which is more appropriate for the outer SELECT.

Related

In Oracle How do we get multiple columns in result which are not in tables

For example :
I have below table named "T1"
and I need result like this:
if "earlist_run_date" , "last_rundate" & "remainng_run_dates" where in the table T1 could have used PIVOT.
But i don't know how to bring these 3 columns in result set. Any Solution will be much appreciated

My guess is that you want something like this. There's probably a better way to eliminate the first and last row from the listagg that I'm not seeing off the top of my head but this should be reasonably efficient.
with ranked_t1 as (
select t1.*,
rank() over( partition by job_id
order by run_date asc ) asc_rank,
rank() over( partition by job_id
order by run_date desc ) desc_rank
from t1
)
select job_id,
min( run_date ) earliest_run_date,
max( run_date ) last_rundate,
listagg( (case when asc_rank != 1
and desc_rank != 1
then run_date
else null
end), ' ' )
within group( order by run_date ) remaining_run_dates
from ranked_t1
group by job_id;

Removing the remaining_run_dates column, you get a query as simple as
select
JOB_ID,
min(RUN_DATE) as earliest_run_date,
max(RUN_DATE) as last_rundate
from T1
group by JOB_ID

Finding the highest COUNT of a group per individual GROUP BY query in Hive

I have a table of customer transactions where an individual_id appears once for every different transaction.
There is a category column called Name_desc which i would like to group by individual and find the most common category of name_desc per individual.
Suppose data is like below
Id Name_desc
---- ------
1 a
2 c
1 b
2 c
1 b
I want below output
Id Name_desc( most occuring category)
------ ------
1 b
2 c
I tried with below query and got an
Error while compiling statement: FAILED: ParseException line 4:19 cannot recognize input near 'select' 'max' '(' in expression specification
error
select name_desc, count(*) as count_e
from db.cust_scan
group by id, name_desc
having count(*)= ( select max(count_e),id
from
(
select id, name_desc, count(*) as count_e
from
db.cust_scan
where
base_div_nbr =1
and
country_code ='US'
and
retail_channel_code=1
and visit_date between '2019-01-01' and '2019-12-31'
GROUP by
individual_id, tt_id_desc
order by individual_id, count_e desc
) as t
group by individual_id )
I would appreciate any suggestions or help with regard to query. If there is an efficient way of getting this job done. Let me know.

This following script written and tested for MSSQL. But as HIVE also support the same Row_Number() ans sub query, this following query should help you getting your required output-
SELECT A.Id, A.Name_desc
FROM
(
SELECT Id,Name_desc,
row_number() over (partition by id order by COUNT(*) desc) AS RN
FROM your_table
GROUP BY Id,Name_desc
) A
WHERE RN = 1

You need subquery in Hive:
SELECT s.Id, s.Name_desc
FROM
(
select s.*, row_number() over (partition by s.id order by s.cnt desc) rn
from
(
SELECT Id, Name_desc, COUNT(*) cnt
FROM your_table
GROUP BY Id, Name_desc
) s
) s
WHERE rn= 1;

Filter out null values resulting from window function lag() in SQL query

Example query:
SELECT *,
lag(sum(sales), 1) OVER(PARTITION BY department
ORDER BY date ASC) AS end_date_sales
FROM revenue
GROUP BY department, date;
I want to show only the rows where end_date is not NULL.
Is there a clause used specifically for these cases? WHERE or HAVING does not allow aggregate or window function cases.

One method uses a subquery:
SELECT r.*
FROM (SELECT r. *,
LAG(sum(sales), 1) OVER (ORDER BY date ASC) AS end_date
FROM revenue r
) r
WHERE end_date IS NOT NULL;
That said, I don't think the query is correct as you have written it. I would assume that you want something like this:
SELECT r.*
FROM (SELECT r. *,
LEAD(end_date, 1) OVER (PARTITION BY ? ORDER BY date ASC) AS end_date
FROM revenue r
) r
WHERE end_date IS NOT NULL;
Where ? is a column such as the customer id.

Try this
select * from (select distinct *,SUM(sales) OVER (PARTITION BY dept) from test)t
where t.date in(select max(date) from test group by dept)
order by date,dept;
And one more simpler way without sub query
SELECT distinct dept,MAX(date) OVER (PARTITION BY dept),
SUM(sales) OVER (PARTITION BY dept)
FROM test;

Oracle optimise SQL query - Multiple Max()

I have a table where first I need to select data by max(event_date) then need to
filter the data by max(event_sequence) then filter again by max(event_number)
I wrote following query which works but takes time.
Here the the query
SELECT DISTINCT a.stuid,
a.prog,
a.stu_prog_id,
a.event_number,
a.event_date,
a.event_sequence,
a.prog_status
FROM table1 a
WHERE a.event_date=
(SELECT max(b.event_date)
FROM table1 b
WHERE a.stuid=b.stuid
AND a.prog=b.prog
AND a.stu_prog_id=b.stu_prog_id)
AND a.event_seq=
(SELECT max(b.event_sequence)
FROM table1 b
WHERE a.stuid=b.stuid
AND a.prog=b.prog
AND a.stu_prog_id=b.stu_prog_id
AND a.event_date=b.event_date)
AND a.event_number=
(SELECT max(b.event_number)
FROM table1 b
WHERE a.stuid=b.stuid
AND a.prog=b.prog
AND a.stu_prog_id=b.stu_prog_id
AND a.event_date=b.event_date
AND a.event_sequence=b.event_sequence
I was wondering is there there a faster way to get the data?
I am using Oracle 12c.

You could try rephrasing your query using analytic functions:
SELECT
stuid,
prog,
stu_prog_id,
event_number,
event_date,
event_sequence,
prog_status
FROM
(
SELECT t.*,
RANK() OVER (PARTITION BY studio, prog, stu_prog_id
ORDER BY event_date DESC) rnk1,
RANK() OVER (PARTITION BY studio, prog, stu_prog_id, event_date
ORDER BY event_sequence DESC) rnk2,
RANK() OVER (PARTITION BY studio, prog, stu_prog_id, event_date, event_sequence
ORDER BY event_number DESC) rnk3
FROM table1 t
) t
WHERE rnk1 = 1 AND rnk2 = 1 AND rnk3 = 1;
Note: I don't actually know if you really need all three subqueries there. Adding sample data to your question might help someone else improve upon the solution I have given above.

I think you want a simple row_number() or rank():
select t1.*
from (select t1.*,
rank() over (partition by stuid, prog, stu_prog_id
order by event_date desc, event_sequence desc, event_number desc
) as seqnum
from table1 t1
) t1
where seqnum = 1;

If you have multiple records with EVENT_DATE, EVENT_SEQUENCE, EVENT_NUMBER as max respectively then in Tim's solution, Use DENSE_RANK or use the following to fetch the exact max and compare with original column data.
SELECT DISTINCT
A.STUID,
A.PROG,
A.STU_PROG_ID,
A.EVENT_NUMBER,
A.EVENT_DATE,
A.EVENT_SEQUENCE,
A.PROG_STATUS
FROM
(
SELECT
A.STUID,
A.PROG,
A.STU_PROG_ID,
A.EVENT_NUMBER,
A.EVENT_DATE,
A.EVENT_SEQUENCE,
A.PROG_STATUS,
MAX(A.EVENT_DATE) OVER(
PARTITION BY A.STUID, A.PROG, A.STU_PROG_ID
) AS MAX_EVENT_DATE,
MAX(A.EVENT_SEQUENCE) OVER(
PARTITION BY A.STUID, A.PROG, A.STU_PROG_ID, A.EVENT_DATE
) AS MAX_EVENT_SEQUENCE,
MAX(A.EVENT_NUMBER) OVER(
PARTITION BY A.STUID, A.PROG, A.STU_PROG_ID, A.EVENT_DATE, A.EVENT_SEQUENCE
) AS MAX_EVENT_NUMBER
FROM
TABLE1 A
) A
WHERE
A.MAX_EVENT_DATE = A.EVENT_DATE
AND A.MAX_EVENT_SEQUENCE = A.EVENT_SEQUENCE
AND A.MAX_EVENT_NUMBER = A.EVENT_NUMBER;
Cheers!!

As being an Oracle 12c user, you can use
[ OFFSET offset { ROW | ROWS } ]
[ FETCH { FIRST | NEXT } [ { rowcount | percent PERCENT } ]
{ ROW | ROWS } { ONLY | WITH TIES } ]
syntax as :
SELECT DISTINCT a.stuid,
a.prog,
a.stu_prog_id,
a.event_number,
a.event_date,
a.event_sequence,
a.prog_status
FROM table1 a
ORDER BY event_date DESC, event_sequence DESC, event_number DESC
FETCH FIRST 1 ROW ONLY;
where WITH TIES clause is not needed for your case, since you're looking for DISTINCT rows, and OFFSET is not needed either, since starting point is just the beginning of a descendingly ordered columns. Even, using the keyword ROW as ROWS is optional, even for the case of plural rows such as FETCH FIRST 5 ROW ONLY;
^^ --> ROWS without S
Demo

Oracle SQL query result into a temporary table for use in a sub query

I want to create a temporary table which is inturn derived from a query to be used in another sub-query so as to simplify the rownum() and partition by condition. The query I have entered is as below but it returns an error t.trlr_num invalid identifier.
with t as
(select distinct
ym.trlr_num,
ym.arrdte,
ri.invnum,
ri.supnum
from rcvinv ri, yms_ymr ym
where ym.trlr_cod='RCV'
and ri.trknum = ym.trlr_num
and ym.wh_id <=50
and ym.trlr_stat in ('C','CI','R','OR')
and ym.arrdte is not null
order by ym.arrdte desc
)
select trlr_number, invnum, supnum
from
(
select
t.trlr_num, t.invnum, t.supnum,
row_number() over (partition by t.trlr_number,t.invnum order by t.arrdte) as rn
from t
)
where rn = 1;
From above, I put a condition to create a table t as a temporary table to be used in the below select statement. But is seems to error out with invalid identifier.

seems typo, replace trlr_number with trlr_num and it work
with t as
(select distinct
ym.trlr_num,
ym.arrdte,
ri.invnum,
ri.supnum
from rcvinv ri, yms_ymr ym
where ym.trlr_cod='RCV'
and ri.trknum = ym.trlr_num
and ym.wh_id <=50
and ym.trlr_stat in ('C','CI','R','OR')
and ym.arrdte is not null
order by ym.arrdte desc
)
select trlr_num, invnum, supnum
from
(
select
t.trlr_num, t.invnum, t.supnum,
row_number() over (partition by t.trlr_num,t.invnum order by t.arrdte) as rn
from t
)
where rn = 1;

You could use multiple subqueries in the WITH clause as separate temporary tables. It would be nice and easy to understand:
WITH t AS
(SELECT DISTINCT ym.trlr_num trlr_num,
ym.arrdte arrdte,
ri.invnum invnum,
ri.supnum supnum
FROM rcvinv ri,
yms_ymr ym
WHERE ym.trlr_cod ='RCV'
AND ri.trknum = ym.trlr_num
AND ym.wh_id <=50
AND ym.trlr_stat IN ('C','CI','R','OR')
AND ym.arrdte IS NOT NULL
),
t1 AS (
SELECT t.trlr_num,
t.arrdte,
t.invnum,
t.supnum,
row_number() OVER (PARTITION BY t.trlr_num, t.invnum ORDER BY t.trlr_num, t.invnum DESC) rn
FROM t
)
SELECT trlr_num, arrdte, invnum, supnum
FROM t1
WHERE rn = 1;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Converting Code from Teradata to HIVE Rank Order - sql

Related

In Oracle How do we get multiple columns in result which are not in tables

Finding the highest COUNT of a group per individual GROUP BY query in Hive

Filter out null values resulting from window function lag() in SQL query

Oracle optimise SQL query - Multiple Max()

Oracle SQL query result into a temporary table for use in a sub query

Categories

Resources