Hive query to select rows from latest partition - hive

I have a partitioned table in hive. The schema and a sample is shown below
item_id | price | brand | partition_id
AX_12 340.22 Apple. 356
AZ_47 230.00 Samsung 357
AX_12 321.00. Apple. 357
AQ_17. 125.00 Lenovo. 356
If an item is present in multiple partitions. I need to select the row with latest partition
So the expected output for this example is this
item_id | price | brand | partition_id
AX_12 321.00 Apple. 357
AZ_47 230.00 Samsung 357
AQ_17. 125.00 Lenovo. 356
There are 10 partitions in the table and each partition has 10 million rows

You can use window functions to filter on the top record per group:
select t.*
from (
select t.*, row_number() over(partition by item_id order by partition_id desc) rn
from mytable t
)
where rn = 1
A typical alternative is to filter with a correlated subquery:
select t.*
from mytable t
where t.partition_id = (
select max(t1.partition_id) from mytbale t1 where t1.item_id = t.item_id
)

Related

Query to pull data from column based off max value of second column

I have a table that has [Order], [Yield], [Scrap], [OpAc] columns. I need to pull the yield based on the max value of [OpAc].
Order
Yield
Scrap
OpAc
1234
140
0
10
1234
140
0
20
1234
130
10
30
1234
130
0
40
1234
125
5
50
1234
110
15
60
1235
140
0
10
1235
138
2
20
1235
138
0
30
1235
138
0
40
1235
138
0
50
1235
137
1
60
1235
137
0
70
Expected Results
Order
Yield
1234
110
1235
137
The query that I have tried is
select [Order], [Yield], MAX([OpAc]) as Max_OpAc
from SCRAP
GROUP BY [Order], [Yield]
order by [order]
This produces
Order
Yield
Max_OpAc
1234
110
60
1234
125
50
1234
130
40
1234
140
20
1235
137
70
1235
138
50
1235
140
10
I've tried setting up some CTE queries to break it down into separate functions but I keep getting caught at this step.
WITH CTE1 AS(
SELECT ROW_NUMBER() OVER(PARTITION BY [Order] ORDER BY [Order],[OpAc]) AS RN , *
FROM SAP_SCRAP
),
This proved to be redundant due to the fact that the [OpAc] field is sequential for each step.
Thanks in advance for any help
You almost got it!
WITH Orders_By_OpAc_Desc AS (
SELECT
[Order],
[Yield].
ROW_NUMBER() OVER (PARTITION BY [Order] ORDER BY OpAc DESC) AS [rn],
FROM
SCRAP
)
SELECT [Order],
[Yield]
FROM
Orders_By_OpAc_Desc
WHERE
rn = 1
The trick here is ROW_NUMBER() OVER (PARTITION BY [Order] ORDER BY OpAc DESC) AS [rn]. It might be confusing to understand in SQL, but when expressed in words it's a bit clearer.
This statement takes each group of rows with the same Order value (PARTITION BY [Order]), orders each group by OpAc in descending order so that the higher OpAc values end up "on top" of the group (ORDER BY OpAc DESC), and numbers each row in the group "top" to "bottom", starting with 1 (ROW_NUMBER()).
Meaning, each row with this number set to 1 has the highest OpAc value for the OrderId.
Wrap that into a CTE and then select just the rows with this number (rn) set to 1. Voi-la.
You definitely want the OVER (PARTITION BY) but MAX() is also an option here. You want something like:
SELECT
*
FROM
(
SELECT
t3.*
, MAX(OpAc) OVER (PARTITION BY [Order]) max1
FROM
SCRAP t3
) a
WHERE
a.Max1 = a.OpAc
for MAX()
Depending on your SQL Server edition, version, and query needs, you may be able to use FIRST_VALUE() as well:
SELECT
DISTINCT
t3.[Order],
FIRST_VALUE(Yield) OVER(PARTITION BY [Order] ORDER BY OpAc DESC) Yield
FROM
SCRAP t3
You were so close. Just missing an ORDER BY OpAc DESC in your ROW_NUMBER function.
SQL Fiddle
MS SQL Server 2017 Schema Setup:
CREATE TABLE orders (
[Order] int null
, Yield int null
, Scrap int null
, OpAc int null
);
INSERT INTO orders ([Order], Yield, Scrap, OpAc)
VALUES (1234,140,0,10)
, (1234,140,0,20)
, (1234,130,10,30)
, (1234,130,0,40)
, (1234,125,5,50)
, (1234,110,15,60)
, (1235,140,0,10)
, (1235,138,2,20)
, (1235,138,0,30)
, (1235,138,0,40)
, (1235,138,0,50)
, (1235,137,1,60)
, (1235,137,0,70)
;
Query 1:
WITH CTE1 AS (
SELECT *
, ROW_NUMBER() OVER(PARTITION BY [Order] ORDER BY OpAc DESC) as row_num
FROM orders
)
SELECT *
FROM CTE1 as c
WHERE c.row_num = 1
Results:
| Order | Yield | Scrap | OpAc | row_num |
|-------|-------|-------|------|---------|
| 1234 | 110 | 15 | 60 | 1 |
| 1235 | 137 | 0 | 70 | 1 |

select top 5 max records in "High" column and 5 min records from "Low" Column in same query and from same table partitioned by stock name

we have 6 months historic data and need to find out what is the top 2 max highs and top 2 min lows per each stock for all the stocks. Below is the sample data
Stock High Low Date prevclose ....
------------------------------------
ABB 100 75 29/12/2019 90
ABB 83 50 30/12/2019 87
ABB 73 45 30/12/2019 87
infy 1000 675 29/12/2019 900
infy 830 650 30/12/2019 810
infy 730 645 30/12/2019 788
I tried the following queries, but not getting the expected results.. I need results such as top 2 high rows and top 3 min low in one result set. I tried below query but no luck..
select * into SRTrend from (
--- Resistance
select * from (Select top (5) with ties 'H' as 'Resistance', RowN=Row_Number() over(partition by name order by High desc),* from Historic
order by Row_Number() over(partition by name order by High desc))B
Union all
--Support
select * from (Select top (5) with ties 'L' as 'Support', RowN=Row_Number() over(partition by name order by Low asc),* from Historic
--where name='ABB'
order by Row_Number() over(partition by name order by Low asc))C
)D
PS: Hurdles which I faced is when I tried to export data to another table, getting very messed up results instead of getting top 2 max(highs) and top3 min(lows), I am getting single rows.
You can use rank() as follows:
select *
from (
select
t.*,
rank() over(partition by stock order by high desc) rn_high,
rank() over(partition by stock order by low asc) rn_low
from mytable t
) t
where rn_high <= 2 or rn_low <= 3
The inner query ranks records twice, by descending high and ascending low within groups of stocks. Then the outer query filters on top 2 and bottom 3 per stock (ties included).

How can I select records based on this specific criteria?

I am using SQL Server 2014 and SSMS for executing my T-SQL queries.
I have the following SQL table (extract) called Table1:
BkgID ProfileID ArrivalDate
872 50 2018-01-03
876 50 2018-01-03
911 64 2018-02-15
924 64 2018-04-15
950 72 2018-05-04
I need my T-SQL query to give me the following output:
BkgID ProfileID ArrivalDate
872 50 2018-01-03
911 64 2018-02-15
924 64 2018-04-15
950 72 2018-05-04
The logic is that the query must list ProfileIDs which have same ArrivalDate only once. The BkgID it chooses to list in such scenario is not important.
How do I write such a query?
You can use ROW_NUMBER() window function to achieve this. The ORDER BY will determine which record gets first in the ordering of each partition.
;WITH RankingByProfile AS
(
SELECT
T.BkgID,
T.ProfileID,
T.ArrivalDate,
Ranking = ROW_NUMBER() OVER (PARTITION BY T.ProfileID, T.ArrivalDate ORDER BY T.BkgID ASC)
FROM
Table1 AS T
)
SELECT
R.BkgID,
R.ProfileID,
R.ArrivalDate
FROM
RankingByProfile AS R
WHERE
R.Ranking = 1
You can also use GROUP BY and retrieve the MIN(BkgID), but you won't be able to access other columns without aggregate functions.
SELECT
MinBkgID = MIN(T.BkgID),
T.ProfileID,
T.ArrivalDate
FROM
Table1 AS T
GROUP BY
T.ProfileID,
T.ArrivalDate
select BkgID, ProfileID, ArrivalDate
from (
select BkgID, ProfileID, ArrivalDate,
ROW_NUMBER() OVER(PARTITION BY ProfileID, ArrivalDate Order By BkgID) RowIdx
from yourTable)
where RowIdx = 1
You can also use correlation approach :
select t.*
from table t
where BkgID = (select top 1 t1.BkgID
from table t1
where t1.ProfileID = t.ProfileID and t1.ArrivalDate = t.ArrivalDate
order by t1.BkgID asc
);
This assumes ArrivalDate has some reasonable date format else use cast() function.

SQL: Take maximum value, but if a field is missing for a particular ID, ignore all values

This is somewhat difficult to explain...(this is using SQL Assistant for Teradata, which I'm not overly familiar with).
ID creation_date completion_date Difference
123 5/9/2016 5/16/2016 7
123 5/14/2016 5/16/2016 2
456 4/26/2016 4/30/2016 4
456 (null) 4/30/2016 (null)
789 3/25/2016 3/31/2016 6
789 3/1/2016 3/31/2016 30
An ID may have more than one creation_date, but it will always have the same completion_date. If the creation_date is populated for all records for an ID, I want to return the record with the most recent creation_date. However, if ANY creation_date for a given ID is missing, I want to ignore all records associated with this ID.
Given the data above, I would want to return:
ID creation_date completion_date Difference
123 5/14/2016 5/16/2016 2
789 3/25/2016 3/31/2016 6
No records are returned for 456 because the second record has a missing creation_date. The record with the most recent creation_date is returned for 123 and 789.
Any help would be greatly appreciated. Thanks!
Depending on your database, here's one option using row_number to get the max date per group. You can then filter those results with not exists to check against null values:
select *
from (
select *,
row_number() over (partition by id order by creation_date desc) rn
from yourtable
) t
where rn = 1 and not exists (
select 1
from yourtable t2
where t2.creationdate is null and t.id = t2.id
)
row_number is a window function that is supported in many databases. mysql doesn't but you can achieve the same result using user-defined variables.
Here is a more generic version using conditional aggregation:
select t.*
from yourtable t
join (select id, max(creation_date) max_creation_date
from yourtable
group by id
having count(case when creation_date is null then 1 end) = 0
) t2 on t.id = t2.id and t.creation_date = t2.max_creation_date
SQL Fiddle Demo

SQL Query for avoiding any repetition for a specific column terms

I am looking to design a query in which I need DISTINCT terms in a column without repetition. I am using the SQL Server 2008 R2 edition.
Here is my sample table:
id bank_code bank_name interest_rate
----------------------------------------------------------
1 123 abc 3.5
2 456 xyz 3.7
3 123 abc 3.4
4 789 pqr 3.3
5 123 abc 3.6
6 456 xyz 3.1
What I want is, to sort the table descending on the 'interest_rates' column but without any repetition of the terms in 'bank_code'.
Here is what I want:
id bank_code bank_name interest_rate
----------------------------------------------------------
2 456 xyz 3.7
5 123 abc 3.6
4 789 pqr 3.3
I have been trying the DISTINCT operator but it selects the unique combination of all the columns and not the single column for repetition.
Here is what I am doing, which clearly would not do get me what I want:
SELECT DISTINCT TOP 5 [ID], [BANK_CODE]
,[BANK_NAME]
,[INTEREST_RATE]
FROM [SAMPLE]
ORDER BY [INTEREST_RATE] DESC
Is there a way to achieve this?
Any help is appreciated.
;WITH x AS
(
SELECT id,bank_code,bank_name,interest_rate,
rn = ROW_NUMBER() OVER (PARTITION BY bank_code ORDER BY interest_rate DESC)
FROM dbo.[SAMPLE]
)
SELECT id,bank_code,bank_name,interest_rate
FROM x WHERE rn = 1
ORDER BY interest_rate DESC;
Try using analytical functions:
;WITH CTE AS
(
SELECT *, ROW_NUMBER() OVER(PARTITION BY bank_code ORDER BY interes_rate DESC) Corr
FROM [Sample]
)
SELECT id, bank_code, banck_name, interest_rate
FROM CTE
WHERE Corr = 1
not sure about the [] syntax, but you probably need something like this:
SELECT min([ID]), [BANK_CODE], [BANK_NAME], max([INTEREST_RATE])
FROM [SAMPLE]
GROUP BY [BANK_CODE], [BANK_NAME]
ORDER BY 4 DESC
How about something like this. It is simple, but will duplicate if you have interest rates that are the same.
select ID, #sample.Bank_code, bank_name, #sample.interest_Rate
from #sample
join
(
SELECT [BANK_CODE], MAX(interest_rate) as interest_Rate
FROM #sample
GROUP BY bank_code
) as groupingtable
on groupingtable.bank_code = #sample.bank_code
and groupingtable.interest_Rate = #sample.interest_rate