Hello I am very new to Hive and was learning WINDOWING functionality of Hive. I came across a problem.
I was trying to find the lowest closing price for each stock ticker (Each ticker have 22 records and I wanted to find the lowest)
I wrote a Query:
SELECT ticker, close FROM
(SELECT ticker, close, RANK() OVER (PARTITION BY ticker) AS rank FROM stocks) AS p
WHERE rank = 1 LIMIT 10;
I got the result
ticker close
A 28.15
A 27.93
A 28.82
A 27.84
A 28.29
A 28.46
A 27.58
A 28.73
A 29.82
A 29.3
But I wanted one for each ticker.
Then I ran the same query but added the ORDER BY clause
SELECT ticker, close FROM
(SELECT ticker, close, RANK() OVER (PARTITION BY ticker ORDER BY close) AS rank FROM stocks) AS p
WHERE rank = 1 LIMIT 20;
And I got the ideal result.
ticker close
A 27.16
AA 10.57
AAPL 247.64
ABC 28.71
ABT 48.68
ACE 52.43
ADBE 27.36
ADI 28.07
ADM 27.0
ADP 39.4
My Question here is how is this order by grouping tickers?
There is no order if you don't specify the "order by" clause, for this reason all the elements in the group have the same RANK, if you want different ranks you have to use DENSE_RANK function.
Related
I am using Spark SQL and I have some outliers that have incredibly high transaction counts in comparison to the rest. I only want the lowest 98% of the values and to cut off the top 2% outliers. How do I go about doing that? The TOP function is not being recognized in Spark SQL. This is a sample of the table but it is a very large table.
Date
ID
Name
Transactions
02/02/2022
ABC123
Bob
107
01/05/2022
ACD232
Emma
34
12/03/2022
HH254
Kirsten
23
12/11/2022
HH254
Kirsten
47
You need a couple of window functions to compute the relative rank; the row_number() will give absolute rank, but you won't know where to draw the cutoff line without a full record count to compute the percentile.
In an inner query,
Select t.*,
row_number() Over (Order By Transactions, Date desc) * 100
/ count(*) Over (Rows unbounded preceeding to rows unbounded following) as percentile
From myTable t
Then in an outer query just
Select * from (*inner query*)
Where percentile <= 98
You might be able to omit the Over clause on the Count(*), I don't know.
You can calculate the 98th percentile value for the Transactions column and then filter the rows where the value of Transactions is below the 98th percentile. You can use the following query to accomplish that:
WITH base_data AS (
SELECT Date, ID, Name, Transactions
FROM your_table
),
percentiles AS (
SELECT percentiles_approx(Transactions, array(0.98)) AS p
FROM base_data
)
SELECT Date, ID, Name, Transactions
FROM base_data
JOIN percentiles
ON Transactions <= p
The percentiles_approx method is used on the baseData DataFrame to obtain the 98th percentile value
I have a very simplified table / view like below to illustrate the issue:
The stock column represents the current stock quantity of the style at the retailer. The reason the stock column is included is to avoid joins for reporting. (the table is created for reporting only)
I want to query the table to get what is currently in stock, grouped by stylenumber (across retailers). Like:
select stylenumber,sum(sold) as sold,Max(stock) as stockcount
from MGTest
I Expect to get Stylenumber, Total Sold, Most Recent Stock Total:
A, 6, 15
B, 1, 6
But using ...Max(Stock) I get 10, and with (Sum) I get 25....
I have tried with over(partition.....) also without any luck...
How do I solve this?
I would answer this using window functions:
SELECT Stylenumber, Date, TotalStock
FROM (SELECT M.Stylenumber, M.Date, SUM(M.Stock) as TotalStock,
ROW_NUMBER() OVER (PARTITION BY M.Stylenumber ORDER BY M.Date DESC) as seqnum
FROM MGTest M
GROUP BY M.Stylenumber, M.Date
) m
WHERE seqnum = 1;
The query is a bit tricky since you want a cumulative total of the Sold column, but only the total of the Stock column for the most recent date. I didn't actually try running this, but something like the query below should work. However, because of the shape of your schema this isn't the most performant query in the world since it is scanning your table multiple times to join all of the data together:
SELECT MDate.Stylenumber, MDate.TotalSold, MStock.TotalStock
FROM (SELECT M.Stylenumber, MAX(M.Date) MostRecentDate, SUM(M.Sold) TotalSold
FROM [MGTest] M
GROUP BY M.Stylenumber) MDate
INNER JOIN (SELECT M.Stylenumber, M.Date, SUM(M.Stock) TotalStock
FROM [MGTest] M
GROUP BY M.Stylenumber, M.Date) MStock ON MDate.Stylenumber = MStock.Stylenumber AND MDate.MostRecentDate = MStock.Date
You can do something like this
SELECT B.Stylenumber,SUM(B.Sold),SUM(B.Stock) FROM
(SELECT Stylenumber AS 'Stylenumber',SUM(Sold) AS 'Sold',MAX(Stock) AS 'Stock'
FROM MGTest A
GROUP BY RetailerId,Stylenumber) B
GROUP BY B.Stylenumber
if you don't want to use joins
My solution, like that of Gordon Linoff, will use the window functions. But in my case, everything will turn around the RANK window function.
SELECT stylenumber, sold, SUM(stock) totalstock
FROM (
SELECT
stylenumber,
SUM(sold) OVER(PARTITION BY stylenumber) sold,
RANK() OVER(PARTITION BY stylenumber ORDER BY [Date] DESC) r,
stock
FROM MGTest
) T
WHERE r = 1
GROUP BY stylenumber, sold
I'm trying to find out how to find max of transaction_date per EAN_code
My table looks like:
Transaction_Date EAN_Code
09/04/2018 3029440000286
09/04/2018 3029440000286
08/04/2018 5000128221139
14/04/2018 5000128221139
08/04/2018 5000128221139
10/04/2018 5000128221108
Essentially what we need to do is for the list of items we want to pull out the latest date that it was sold across, e.g. one row per product, last date sold.
Both columns have non distinct values.
Simply do a GROUP BY. Use MAX() to get the latest date for each product.
select EAN_Code, max(Transaction_Date)
from tablename
group by EAN_Code
You could use ROW_NUMBER/RANK:
SELECT *
FROM (SELECT *,ROW_NUMBER() OVER(PARTITION BY Ean_Code
ORDER BY Transaction_Date DESC) AS rn
FROM table_name) s
WHERE s.rn = 1;
I am trying to find out if there is any way to aggregate a sales for each product. I realise I can achieve it either by using group-by clause or by writing a procedure.
example:
Table name: Details
Sales Product
10 a
20 a
4 b
12 b
3 b
5 c
Is there a way possible to perform the following query with out using group by query
select
product,
sum(sales)
from
Details
group by
product
having
sum(sales) > 20
I realize it is possible using Procedure, could it be done in any other way?
You could do
SELECT product,
(SELECT SUM(sales) FROM details x where x.product = a.product) sales
from Details a;
(and wrap it into another select to simulate the HAVING).
It's possible to use analytic functions to do the sum calculation, and then wrap that with another query to do your filtering.
See and play with the example here.
select
running_sum,
OwnerUserId
from (
select
id,
score,
OwnerUserId,
sum(score) over (partition by OwnerUserId order by Id) running_sum,
last_value(id) over (partition by OwnerUserId order by OwnerUserId) last_id
from
Posts
where
OwnerUserId in (2934433, 10583)
) inner_q
where inner_q.id = inner_q.last_id
--and running_sum > 20;
We keep a running sum going on the partition of the owner (product), and we tally up the last id for the same window, which is the ID we'll use to get the total sum. Wrap it all up with another query to make sure you get the "last id", take the sum, and then do any filtering you want on the result.
This is an extremely round-about way to avoid using GROUP BY though.
If you don't want nested select statements (run slower), use CASE:
select
sum(case
when c.qty > 20
then c.qty
else 0
end) as mySum
from Sales.CustOrders c
I have a table of data and want to retrieve the penultimate record.
How is this done?
TABLE: results
-------
30
31
35
I need to get 31.
I've been trying with rownum but it doesn't seem to work.
Assuming you want the second highest number and there are no ties
SELECT results
FROM (SELECT results,
rank() over (order by results desc) rnk
FROM your_table_name)
WHERE rnk = 2
Depending on how you want to handle ties, you may want either the rank, dense_rank, or row_number analytic function. If there are two 35's for example, would you want 35 returned? Or 31? If there are two 31's, would you want a single row returned? Or would you want both 31's returned.
This can use for n th rank ##
select Total_amount from (select Total_amount, rank() over (order by Total_amount desc) Rank from tbl_booking)tbl_booking where Rank=3