Calculating Top N items per dimension

Calculating Top N items per dimension - sql

I have the following query that shows total sales for each product on an hourly basis. However, it is very big data and I don't want to see all products, so would like to see the top 1000 product_id based on sales for each date, hour, and category_id dimensions.
SELECT date,
hour,
category_id,
product_id,
sum(sales) AS sales
FROM a
LEFT JOIN
ON a.product_id = b.product_id
WHERE date(date) >= date('2021-01-01')
GROUP BY 1, 2, 3, 4
How to do it in the Athena?
Thanks in advance.

You can use rank function on your result and then filter out corresponding ranks:
SELECT date,
hour,
category_id,
product_id,
sales
FROM
(
SELECT *,
rank() OVER (PARTITION BY date, hour, category_id
ORDER BY sales DESC) AS rnk
FROM (your query)
)
WHERE rnk <= 1000

Related

How to get min value at max date in sql?

I have a table with snapshot data. It has productid and date and quantity columns. I need to find min value in the max date. Let's say, we have product X: X had the last snapshot at Y date but it has two snapshots at Y with 9 and 8 quantity values. I need to get
product_id | date | quantity
X Y 8
So far I came up with this.
select
productid
, max(snapshot_date) max_date
, min(quantity) min_quantity
from snapshot_table
group by 1
It works but I don't know why. Why this does not bring min value for each date?

I would use RANK here along with a scalar subquery:
WITH cte AS (
SELECT *, RANK() OVER (ORDER BY quantity) rnk
FROM snapshot_table
WHERE snapshot_date = (SELECT MAX(snapshot_date) FROM snapshot_table)
)
SELECT productid, snapshot_date, quantity
FROM cte
WHERE rnk = 1;
Note that this solution caters to the possibility that two or more records happened to be tied for having the lower quantity among those most recent records.
Edit: We could simplify by doing away with the CTE and instead using the QUALIFY clause for the restriction on the RANK:
SELECT productid, snapshot_date, quantity
FROM snapshot_table
WHERE snapshot_date = (SELECT MAX(snapshot_date) FROM snapshot_table)
QUALIFY RANK() OVER (ORDER BY quantity) = 1;

Consider also below approach
select distinct product_id,
max(snapshot_date) over product as max_date,
first_value(quantity) over(product order by snapshot_date desc, quantity) as min_quantity
from your_table
window product as (partition by product_id)

use row_number()
with cte as (select *,
row_number() over(partition by product_id order by date desc) rn
from table_name) select * from cte where rn=1

Calculate average days between orders The last three records tsql

I trying to take an average per customer, but you're not grouping by customer.
I would like to calculate the average days between several order dates from a table called invoice. For each BusinessPartnerID, what is the average days between orders i want average days last three records orders .
I got the average of all order for each user but need days last three records orders
The sample table is as below
;WITH temp (avg,invoiceid,carname,carid,fullname,mobail)
AS
(
SELECT AvgLag = AVG(Lag) , Lagged.idinvoice,
Lagged.carname ,
Lagged.carid ,Lagged.fullname,Lagged.mobail
FROM
(
SELECT
(car2.Name) as carname ,
(car2.id) as carid ,( busin.Name) as fullname, ( busin.Mobile) as mobail , INV.Id as idinvoice , Lag = CONVERT(int, DATEDIFF(DAY, LAG(Date,1)
OVER (PARTITION BY car2.Id ORDER BY Date ), Date))
FROM [dbo].[Invoice] AS INV
JOIN [dbo].[InvoiceItem] AS INITEM on INV.Id=INITEM.Invoiceid
JOIN [dbo].[BusinessPartner] as busin on busin.Id=INV.BuyerId and Type=5
JOIN [dbo].[Product] as pt on pt.Id=INITEM.ProductId and INITEM.ProductId is not null and pt.ProductTypeId=3
JOIN [dbo].[Car] as car2 on car2.id=INv.BusinessPartnerCarId
) AS Lagged
GROUP BY
Lagged.carname,
Lagged.carid,Lagged.fullname,Lagged.mobail, Lagged.idinvoice
-- order by Lagged.fullname
)
SELECT * FROM temp where avg is not null order by avg

I don't really see how your query relate to your question. Starting from a table called invoice that has columns businesspartnerid, and date, here is how you would take the average of the day difference between the last 3 invoices of each business partner:
select businesspartnerid,
avg(1.0 * datediff(
day,
lag(date) over(partition by businesspartnerid order by date),
date
) avg_diff_day
from (
select i.*,
row_number() over(partiton by businesspartnerid order by date desc) rn
from invoice i
) i
where rn <= 3
group by businesspartnerid
Note that 3 rows gives you 2 intervals only, that will be averaged.

Difference between multiple dates

I am working in a database with multiple orders of multiple suppliers. Now I would like to know the difference in days between order 1 and order 2, order 2 and order 3, order 3 and order 4 and so on.. For each supplier on its own. I need this to generate the Standard Deviation for each supplier based on their days between orders.
Hopefully someone can help..

What you describe is lag() with aggregation:
select supplier,
stddev(orderdate - prev_orderdate) as std_orderdate
from (select t.*,
lag(orderdate) over (partition by supplier order by orderdate) as prev_orderdate
from t
) t
group by supplier;

You would typically use window function lag() and date arithmetics.
Assuming the following data structure for table orders:
order_id int primary key
supplier_id int
order_date date
You would go:
select
i.*,
order_date
- lag(order_date) over(partition by supplier_id order by order_date) date_diff
from orders o
Which gives you, for each order, the difference in days from the previous order of the same supplier (or null if this is the first order of the supplier).
You can then compute the standard deviation with aggregation:
select supplier_id, stddev(date_diff)
from (
select
o.*,
order_date
- lag(order_date) over(partition by supplier_id order by order_date) date_diff
from orders o
) x
group by supplier_id

Output top 3 most profitable products every quarter

I'm trying to output a top 3 products per quarter, that should be a total of 12 rows, since 3 top products per quarter.
Closest output is the one provided below i have no idea how to like partition it every quarter
SELECT * FROM (SELECT QUARTER, PRODUCT_NAME, SUM(QUANTITY) "QTY_SOLD", SALES, SUM(PROFIT) "PROFIT_GENERATED" FROM DELIVERIES_FACT
WHERE EXTRACT(YEAR from SHIP_DATE) = 2015 GROUP BY QUARTER, PRODUCT_NAME, SALES ORDER BY "PROFIT_GENERATED" DESC)
WHERE rownum <= 3
getting an output of

I've written this SQL extracting the calendar quarter from SHIP_DATE; you can adjust as needed.
Similarly, RANK(), ROW_NUMBER(), and DENSE_RANK() all are different; you may wish to experiment with each analytical function to see which best fits your data and handles ties the way you want them to.
SELECT *
FROM (SELECT RANK() OVER (PARTITION BY SHIP_QUARTER
ORDER BY PROFIT_GENERATED desc) AS PROFIT_RANK_BY_Q,
ORIG.*
FROM
(SELECT EXTRACT(QUARTER from SHIP_DATE) AS SHIP_QUARTER,
PRODUCT_NAME,
SUM(QUANTITY) "QTY_SOLD", SALES, SUM(PROFIT) "PROFIT_GENERATED"
FROM DELIVERIES_FACT
WHERE EXTRACT(YEAR from SHIP_DATE) = 2015
GROUP BY EXTRACT(QUARTER from SHIP_DATE), PRODUCT_NAME, SALES
)
)
WHERE PROFIT_RANK_BY_Q <= 3
order by SHIP_QUARTER, PROFIT_RANK_BY_Q

Can I limit the amount of rows to be used for a group in a GROUP BY statement

I'm having an odd problem
I have a table with the columns product_id, sales and day
Not all products have sales every day. I'd like to get the average number of sales that each product had in the last 10 days where it had sales
Usually I'd get the average like this
SELECT product_id, AVG(sales)
FROM table
GROUP BY product_id
Is there a way to limit the amount of rows to be taken into consideration for each product?
I'm afraid it's not possible but I wanted to check if someone has an idea
Update to clarify:
Product may be sold on days 1,3,5,10,15,17,20.
Since I don't want to get an the average of all days but only the average of the days where the product did actually get sold doing something like
SELECT product_id, AVG(sales)
FROM table
WHERE day > '01/01/2009'
GROUP BY product_id
won't work

If you want the last 10 calendar day since products had a sale:
SELECT product_id, AVG(sales)
FROM table t
JOIN (
SELECT product_id, MAX(sales_date) as max_sales_date
FROM table
GROUP BY product_id
) t_max ON t.product_id = t_max.product_id
AND DATEDIFF(day, t.sales_date, t_max.max_sales_date) < 10
GROUP BY product_id;
The date difference is SQL server specific, you'd have to replace it with your server syntax for date difference functions.
To get the last 10 days when the product had any sale:
SELECT product_id, AVG(sales)
FROM (
SELECT product_id, sales, DENSE_RANK() OVER
(PARTITION BY product_id ORDER BY sales_date DESC) AS rn
FROM Table
) As t_rn
WHERE rn <= 10
GROUP BY product_id;
This asumes sales_date is a date, not a datetime. You'd have to extract the date part if the field is datetime.
And finaly a windowing function free version:
SELECT product_id, AVG(sales)
FROM Table t
WHERE sales_date IN (
SELECT TOP(10) sales_date
FROM Table s
WHERE t.product_id = s.product_id
ORDER BY sales_date DESC)
GROUP BY product_id;
Again, sales_date is asumed to be date, not datetime. Use other limiting syntax if TOP is not suported by your server.

Give this a whirl. The sub-query selects the last ten days of a product where there was a sale, the outer query does the aggregation.
SELECT t1.product_id, SUM(t1.sales) / COUNT(t1.*)
FROM table t1
INNER JOIN (
SELECT TOP 10 day, Product_ID
FROM table t2
WHERE (t2.product_ID=t1.Product_ID)
ORDER BY DAY DESC
)
ON (t2.day=t1.day)
GROUP BY t1.product_id
BTW: This approach uses a correlated subquery, which may not be very performant, but it should work in theory.

I'm not sure if I get it right but If you'd like to get the average of sales for last 10 days for you products you can do as follows :
SELECT Product_Id,Sum(Sales)/Count(*) FROM (SELECT ProductId,Sales FROM Table WHERE SaleDAte>=#Date) table GROUP BY Product_id HAVING Count(*)>0
OR You can use AVG Aggregate function which is easier :
SELECT Product_Id,AVG(Sales) FROM (SELECT ProductId,Sales FROM Table WHERE SaleDAte>=#Date) table GROUP BY Product_id
Updated
Now I got what you meant ,As far as I know it is not possible to do this in one query.It could be possible if we could do something like this(Northwind database):
select a.CustomerId,count(a.OrderId)
from Orders a INNER JOIN(SELECT CustomerId,OrderDate FROM Orders Order By OrderDate) AS b ON a.CustomerId=b.CustomerId GROUP BY a.CustomerId Having count(a.OrderId)<10
but you can't use order by in subqueries unless you use TOP which is not suitable for this case.But maybe you can do it as follows:
SELECT PorductId,Sales INTO #temp FROM table Order By Day
select a.ProductId,Sum(a.Sales) /Count(a.Sales)
from table a INNER JOIN #temp AS b ON a.ProductId=b.ProductId GROUP BY a.ProductId Having count(a.Sales)<=10

If this is a table of sales transactions, then there should not be any rows in there for days on which there were no Sales. I.e., If ProductId 21 had no sales on 1 June, then this table should not have any rows with productId = 21 and day = '1 June'... Therefore you should not have to filter anything out - there should not be anything to filter out
Select ProductId, Avg(Sales) AvgSales
From Table
Group By ProductId
should work fine. So if it's not, then you have not explained the problem completely or accurately.
Also, in yr question, you show Avg(Sales) in the example SQL query but then in the text you mention "average number of sales that each product ... " Do you want the average sales amount, or the average count of sales transactions? And do you want this average by Product alone (i.e., one output value reported for each product) or do you want the average per product per day ?
If you want the average per product alone, for just thpse sales in the ten days prior to now? or the ten days prior to the date of the last sale for each product?
If the latter then
Select ProductId, Avg(Sales) AvgSales
From Table T
Where day > (Select Max(Day) - 10
From Table
Where ProductId = T.ProductID)
Group By ProductId
If you want the average per product alone, for just those sales in the ten days with sales prior to the date of the last sale for each product, then
Select ProductId, Avg(Sales) AvgSales
From Table T
Where (Select Count(Distinct day) From Table
Where ProductId = T.ProductID
And Day > T.Day) <= 10
Group By ProductId

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Calculating Top N items per dimension - sql

You can use rank function on your result and then filter out corresponding ranks: SELECT date, hour, category_id, product_id, sales FROM ( SELECT *, rank() OVER (PARTITION BY date, hour, category_id ORDER BY sales DESC) AS rnk FROM (your query) ) WHERE rnk <= 1000

Related

How to get min value at max date in sql?

Calculate average days between orders The last three records tsql

Difference between multiple dates

Output top 3 most profitable products every quarter

Can I limit the amount of rows to be used for a group in a GROUP BY statement

Categories

Resources